Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Knowledge Discovery in Big Data from Astronomy and Earth Observation: Astrogeoinformatics
Knowledge Discovery in Big Data from Astronomy and Earth Observation: Astrogeoinformatics
Knowledge Discovery in Big Data from Astronomy and Earth Observation: Astrogeoinformatics
Ebook1,393 pages12 hours

Knowledge Discovery in Big Data from Astronomy and Earth Observation: Astrogeoinformatics

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Knowledge Discovery in Big Data from Astronomy and Earth Observation: Astrogeoinformatics bridges the gap between astronomy and geoscience in the context of applications, techniques and key principles of big data. Machine learning and parallel computing are increasingly becoming cross-disciplinary as the phenomena of Big Data is becoming common place. This book provides insight into the common workflows and data science tools used for big data in astronomy and geoscience. After establishing similarity in data gathering, pre-processing and handling, the data science aspects are illustrated in the context of both fields. Software, hardware and algorithms of big data are addressed.

Finally, the book offers insight into the emerging science which combines data and expertise from both fields in studying the effect of cosmos on the earth and its inhabitants.

  • Addresses both astronomy and geosciences in parallel, from a big data perspective
  • Includes introductory information, key principles, applications and the latest techniques
  • Well-supported by computing and information science-oriented chapters to introduce the necessary knowledge in these fields
LanguageEnglish
Release dateApr 10, 2020
ISBN9780128191552
Knowledge Discovery in Big Data from Astronomy and Earth Observation: Astrogeoinformatics

Related to Knowledge Discovery in Big Data from Astronomy and Earth Observation

Related ebooks

Computers For You

View More

Related articles

Reviews for Knowledge Discovery in Big Data from Astronomy and Earth Observation

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Knowledge Discovery in Big Data from Astronomy and Earth Observation - Petr Skoda

    end.

    Part I

    Data

    Outline

    Chapter 1. Methodologies for Knowledge Discovery Processes in Context of AstroGeoInformatics

    Chapter 2. Historical Background of Big Data in Astro and Geo Context

    Chapter 1

    Methodologies for Knowledge Discovery Processes in Context of AstroGeoInformatics

    Peter Butka, PhD; Peter Bednár, PhD; Juliana Ivančáková, MSc

    Abstract

    Successful data science projects usually follow some methodology which can provide the data scientist with basic guidelines on how to challenge the problem and how to work with data, algorithms, or models. This methodology is then a structured way to describe the knowledge discovery process. Without a flexible structure of steps, data science projects can be unsuccessful, or at least it will be hard to achieve a result that can be easily applied and shared. Their better understanding is quite beneficial both to data scientists and to anyone who needs to discuss results or steps of the process. Moreover, in some domains, including those working with data from astronomy and geophysics, steps used in preprocessing and analysis of data are crucial to understanding provided data products. In this chapter, we provide an overview of knowledge discovery processes, selected methodologies, and their standardization and sharing using process languages and ontologies. At the end of the chapter, we also discuss these aspects according to the domain of astro/geo data.

    Keywords

    Knowledge discovery process; Data mining; Methodology; Process modeling; Ontology; AstroGeoInformatics

    1.1 Introduction

    Whenever someone wants to apply data mining techniques to specific problem or data, it is useful to see anything done in a broader and more organized way. Therefore, successful data science projects usually follow some methodology which can provide data scientist with basic guidelines on how to challenge the problem and how to work with data, algorithms, or models. This methodology is then a structured way to describe the knowledge discovery process. Without a flexible structure of steps data science projects can be unsuccessful, or at least it will be hard to achieve a result that can be easily applied and shared. A better understanding of at least an overview of the process is quite beneficial both to the data scientist and to anyone who needs to discuss results or steps of the process (such as data engineers, customers, or managers). Moreover, in some domains, including those working with data from astronomy and geophysics, steps used in preprocessing and analysis of data are crucial to understanding provided products.

    From the 1990s, research in this area started to define its terms more precisely, with the definition of knowledge discovery (or knowledge discovery in databases [KDD]) (Fayyad et al., 1996) as a synonym for knowledge discovery process (KDP). It included data mining as one of the steps in the knowledge acquisition effort. KDD (or KDP) and data mining are even today often seen as equal terms, but data mining is a subpart (step) of the whole process dedicated to the application of algorithms able to extract patterns from data. Moreover, KDD also becomes the first description of KDP as a formalized methodology. During the next years, new efforts bring more attempts which lead to other methodologies and their applications. We will describe more details on selected cases later.

    For a better understanding of KDPs, we can shortly describe how basic terms about data, information, or knowledge are defined. We have to say that there are many attempts to explain them more precisely. One example is the DIKW pyramid (Rowley, 2007). This model represents and characterizes information-based levels (according to the area of information engineering) in the chain of grading informativeness known as Data–Information–Knowledge–Wisdom (see Fig. 1.1). Similar models often apply such chains, even if some parts are removed or combined. For example, very often such a model is simplified to Data–Information–Knowledge or even Data–Knowledge, but semantics is usually the same or similar as in the case of the DIKW pyramid. Moreover, there are many models which describe not only its objects but also processes for their transitions, e.g., Bloom's taxonomy (Anderson and Krathwohl, 2001), decision process models (Bouyssou et al., 2010), or knowledge management – SECI models (Nonaka et al., 2000). The description of methodology usually defines what we understand under data, information, and knowledge level.

    Fig. 1.1 DIKW pyramid – understanding the difference between Data, Information, Knowledge, and Wisdom.

    While methodologies started from a more general view, logically more and more attempts were transformed into a more structured way. Also, many of them became more tool-specific. When we try to look at the evolution of the KDP, the main further steps after the creation of more general methodologies are two basic concepts. First, in order to have more precise and formalized processes, many of them were transformed into standardized process-based definitions with the automation of their steps. Such effort is logically achieved more easily by the application in specific domains (such as industry, medicine, science), with clear standards for exchanging documents and often with the support of specific tools used for the automation of processes. Second, when we have several standardized processes in different domains, it is often not easy to apply methods from one area directly in another one. One of the solutions is to support better cross-domain understanding of steps using some shared terminology. This solution leads to the creation of formalized semantic models like ontologies that are helpful in better understanding of terminology between domains. Moreover, another step towards a new view of methodologies and sharing of information about them was proposed based on the ontologies of KDPs, like OntoDM (Panov et al., 2013).

    Therefore, if we summarize, generalized methodologies are basic concepts related to KDPs. More specific versions of them provide standards and automation in specific domains, and on the other hand, cross-domain models share domain-specific knowledge between different domains. This basic overview also describes the structure of sections for this chapter. In the next section, we provide some details on data–information–knowledge definitions and KDPs. In the following section, we describe existing more general methodologies. In Section 1.4 we provide a look at methodologies in a more precise way, through standardization and automation efforts, as well as attempts to share knowledge in cross-domain view. In the following section, the astro/geo context is discussed, mainly focusing on their specifics and shared aspects, and the possible transfer of knowledge.

    1.2 Knowledge Discovery Processes

    Currently, we can store and access large amounts of data. One of the main problems is to transform raw data into some useful artifacts. Hence, the real benefit is in our ability to extract such useful artifacts, which can be in the form of reports, policies, decisions, or recommended actions. Before we provide more details on processes that transform raw data into these artifacts, we can start with the basic notion of data, information, or knowledge.

    As we already mentioned in the previous section, there are different definitions with a different scope, i.e., from the DIKW pyramid with a more granular view, to simpler definitions when there are only two levels of data–knowledge relations. For our purposes we would stay with simpler versions of DIKW, where we define Data–Information–Knowledge relations in this way, adapted from broader Beckman definitions (Beckman, 1997):

    •  Data – facts, numbers, pictures, recorded sound, or another raw source usually describing real-world objects and their relations;

    •  Information – data with added interpretation and meaning, i.e., formatted, filtered, and summarized data;

    •  Knowledge – information with actions and applications, i.e., ideas, rules, and procedures, which lead to decisions and actions.

    While there are also extended versions of such relations, this basic view is quite sufficient with all methodologies for KDPs. It is because raw data gathering (Data part), their processing and manipulation (Information part), and creation of models that are suitable for support of decisions and further actions (Knowledge part) are all necessary aspects of standard data analytical tasks. Hence, transformations in this Data–Information–Knowledge chain represent a very general understanding of the KDP or a simple version of methodology. We have the input dataset (raw sources – Data part), which is transferred using several steps (often including data manipulation to get more interpreted and meaningful data – Information part) to knowledge (models containing rules or patterns – Knowledge part).

    For example, data from a customer survey are in the raw form of Yes/No answers, values on an ordinal scale, or numbers. If we put these data about customers in the context of questions, combine them in infographics, and analyze their relations with each other, we transform raw data into information. In practice, we mine some rules on how these customers and their subgroups usually react in specific cases discussed in the survey. We can try to understand their behavior (what they prefer, buy), predict their future reactions (if they will be interested in a new product) in similar cases, and provide actionable knowledge in the form of a recommendation to the responsible actor (apply these rules to get higher income).

    The presented view of Data–Information–Knowledge relations is also comparable to the view of business analytics. In this case, we have three options in analytics according to our expectations (Evans, 2015):

    •  Descriptive analytics – uses data aggregation and descriptive data mining techniques to see what happened in the system (business), so the question What has happened? is answered. The main idea is to use descriptive analytics if we want to understand at an aggregate level what is going on, summarize such information, and describe different aspects of the system in that way (to understand present and historical data). The methods here lead us to exploration analysis, visualizations, periodic or ad hoc reporting, trend analysis, data warehousing, and creation of dashboards.

    •  Predictive analytics – basically tasks from this part examine the future of the system. They answer the question What could happen according to historical data? We can see this as a predictor of states according to all historical information. It is an estimation of the normal development of the characteristics of our system. This part of analytical tasks is closest to the traditional view of KDPs. The methods here are the same as in the case of any KDP methodology, statistical analysis, and data mining methods.

    •  Prescriptive analytics – here are all attempts when we select some model about the system and try to optimize its possible outcomes. It means that we analyze what we have to do if we want to get the best efficiency for some output model values. The name came from the word prescribe, so it is prescription or advice for actions to be done. The set of methods applied here is large, including methods from data mining, machine learning (whenever output models are also applicable as actions), operation research, optimization, computational modeling, or expert (knowledge-based) systems.

    A nice feature of business analytics is that every option can be applied separately, or we can combine them in the chain as a step-by-step process. In this case, we can see descriptive analytics mainly responsible for transformation between Data and Information. With the addition of predictive analytics, we can enhance the process of transformation to get Knowledge of our system. Our extracted knowledge is then applicable and actionable simply as is, or we can extend it and make it part of the decision making process using methods from the area of prescriptive analytics. Hence, we can see Data–Information–Knowledge in a narrow view as part of predictive analytics in, let us say, traditional understanding (with KDPs as KDD), or we can see it in broader scope with all analytics involved in transformation.

    Now we can show differences in this example. Imagine that a company has several hotels with casinos, and they want to analyze customers and optimize their profit. Within descriptive analytics they use data warehousing techniques to make reports about hotel packing in time, activities in the casino and its incomes, and infographics of profit according to different aspects. These methods will help them to understand what is happening in their casinos and hotels. Within predictive analytics, they can create a predictive model that forecasts hotel and casino packing in the future, or they can use data about customers and segment them into groups according to their behavior in casinos. The result is a better understanding of what will happen in the future, what will be occupancy of the hotel in different months, and what is the expected behavior of customers when they come to the casino. Moreover, within prescriptive analytics, they can identify which decision-based input setup (and how) to optimize their profit. It means that according to the prediction of hotel occupancy they can change prices accordingly, set up the allocation of rooms, or provide benefits to some segments of customers. For example, if someone is playing a lot, we can provide him/her with some benefits to support his/her return like a better apartment for a lower price or free food.

    As we already mentioned, people often exchange the KDP with data mining, which is only one step. Moreover, for knowledge discovery, some other names were also used in literature, like knowledge extraction, information harvesting, information discovery, data pattern processing, or even data archeology. The mostly used synonym for KDPs is then obviously KDD, which is logical due to the beginnings of KDP with the processing of structured data stored in standard databases. The basic properties are even nowadays the same as or similar to KDD basics from the 1990s. Therefore we can summarize them accordingly (Fayyad et al., 1996):

    •  The main objective of KDP is to seek new knowledge in the selected application domain.

    •  Data are a set of facts. The pattern is the expression in some suitable language (part of the outcome model, e.g., rule written in some rule-based language) about a subset of facts.

    •  KDP is a nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. A process is simply a multistep approach of transformations from data to patterns. The pattern (knowledge) mentioned before is:

    •  valid – pattern should be true on new data with some certainty,

    •  novel – we did not know about this pattern before,

    •  useful – pattern should lead to actions (the pattern is actionable),

    •  comprehensible – the process should produce patterns that lead to a better understanding of the underlying data for human (or machine).

    KDP is easily generalized also to sources of data which are not in databases or not in structured form, which induces methodology aspects of a similar type also to the area of text mining, Big Data analysis, or data streams processing. Knowledge discovery involves the entire process, including storage and access of data, application of efficient and scalable data processing algorithms to analyze large datasets, interpretation and visualization of outcome results, and support of the human–machine or human–computer interaction, as well as support for learning and analyzing the domain. The KDP model, which is then called methodology, consists of a set of processing steps followed by the data analyst or scientist to run a knowledge discovery project. The KDP methodology usually describes procedures for each step of such a project. The model helps organizations (represented by the data analyst) to understand the process and create a project roadmap. The main advantage is reduced costs for any ad hoc analysis, time savings, better understanding, and acceptation of the advice coming from the results of the analysis. While there are still data analysts who apply ad hoc steps to their projects, most of them apply some common framework with the help of (commercial or open source) software tools for particular steps or one unified analytical platform of tools.

    Before we move to a description of selected methodologies in the next section, we summarize the motivation for the use of standardized KDP models (methodologies) (Kurgan and Musilek, 2006):

    •  The output product (knowledge) must be useful for the user, and ad hoc solutions more often failed in yielding valid, novel, useful, and understandable results.

    •  Understanding of the process itself is important. Humans often lack a perception of large amounts of untapped and potentially valuable data. A process model that is well structured and logical will help to avoid these issues.

    •  An often underestimated factor is providing support for management problems (this also includes cases of a larger project in the science area, which needs efficient management). Whenever KDP projects involve large teams, requiring careful planning and scheduling, a management specialist in such projects is often unfamiliar with terms from the data mining area – KDP methodology can then be helpful in managing the whole project.

    •  Standardization of KDP provides a unified view of current process description and allows an appropriate selection and usage of technology to solve current problems in practice, mostly on an industrial level.

    1.3 Methodologies for Knowledge Discovery Processes

    In this section, we provide more details on selected methodologies. From the 1990s, several of them were developed, starting basically from academic research, but they very quickly moved on to an industry level. As we already mentioned, the first more structured way was proposed as KDD in Fayyad et al. (1996). Their approach was later modified and improved by both the research and the industry community. The processes always share a multistep sequential way in processing input data, where each step starts after accessing the result of the successful completion of the previous step as its input. Also, it is common that activities within steps cover understanding of the task, data, preprocessing or preparation of data, analysis, evaluation, understanding of results, and their application. All methodologies also emphasize their iterative nature by introducing feedback loops throughout the process. Moreover, they are often processed with a strong influence of human data scientists and therefore acknowledge its interactivity. The main differences between the methodologies are in the number and scope of steps, the characteristics of their inputs and outputs, and the usage of various formats.

    Several studies compared existing methodologies, their advantages and disadvantages, the scope of their application, the relation to software tools and standards, and any other aspects. Probably the most extensive comparisons of methodologies can be found in Kurgan and Musilek (2006) and Mariscal et al. (2010). Other papers also bring ideas and advice, including their applicability in different domains; see, for example, Cios et al. (2007), Ponce (2009), Rogalewicz and Sika (2016).

    Before we describe details of some selected methodologies, we provide some information on two aspects, i.e., the evolution of methodologies and practical usage of them by data analysts.

    According to the history of methodologies, in Mariscal et al. (2010) one can find quite a thorough description of such evolution. As we already mentioned, the first attempts were fulfilled by Fayyad's KDD process between the years 1993–1996, which we will also describe in the next subsection. This approach inspired several other methodologies, which came in the years after the KDD process, like SEMMA (SAS Institute Inc., 2017), Human-Centered (Brachman and Anand, 1996), or approaches described in Cabena et al. (1998) and Anand and Buchner (1998). On the other hand, also some other ideas evolved into methodologies including the 5As or Six Sigma. Of course, some issues were identified during those years and an answer to them was in the development of CRISP-DM standard methodology, which we will also describe in one of the following subsections. CRISP-DM became the leading methodology and quite a reasonable solution for a start in any data mining project, including new projects with Big Data and data streams processing. Any new methodology or some standardized description of processes usually follows a similar approach to one defined by CRISP-DM (some of them are available in the review papers mentioned before).

    The influential role of CRISP-DM is evident by the polls evaluated on KDnuggets,¹ a well-known and widely accepted community-based web site related to knowledge discovery and data mining. Gregory Piatetsky-Shapiro, one of the authors of the KDD process methodology, showed in his article² that according to the result of polls from years 2007 and 2014, more than 42% of data analysts (most of all votes) are using CRISP-DM methodology in their analytics, data mining, or data science projects, and the usage of the methodology seems to be stable.

    1.3.1 First Attempt to Generalize Steps – Research-Based Methodology

    Within the starting field of knowledge discovery in the 1990s, researchers defined the multistep process, which guides users of data mining tools in their knowledge discovery effort. The main idea was to provide a sequence of steps that would help to go through the KDP in an arbitrary domain. As mentioned before, in Fayyad et al. (1996) the authors developed a model known as KDD process.

    In general, KDD provides a nine-step process, mainly considered as a research-based methodology. It involves both the evaluation and interpretation of the patterns (possibly knowledge) and the selection of preprocessing, sampling, and projections of the data before the data mining step. While some of these nine steps focus on decisions or analysis, other steps are data transitions within the data–information–knowledge chain. As mentioned before, KDD is a nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad et al., 1996). The KDD process description also provides an outline of its steps, which is available in Fig. 1.2.

    Fig. 1.2 The KDD process.

    The model of the KDD process consists of the following steps (input of each step is output from the previous one), in an iterative (analysts apply feedback loops if necessary) and interactive way:

    1.  Developing and understanding the application domain, learning relevant prior knowledge, identifying of the goals of the end-user (input: problem to be solved/our goal, output: understanding of the problem/domain/goal).

    2.  Creation of a target dataset – selection (querying) of the dataset, identification of subset variables (data attributes), and the creation of data samples for the KDP (output: target data/dataset).

    3.  Data cleaning and preprocessing – dealing with outliers and noise removal, handling the missing data, collecting data on time sequences, and identifying known changes to data (output: preprocessed data).

    4.  Data reduction and projection – finding useful features that represent the data (according to goal), including dimension reductions and transformations (output: transformed data).

    5.  Selection of data mining task – the decision on which methods to apply for classification, clustering, regression, or another task (output: selected method[s]).

    6.  Selection of data mining algorithm(s) – select method for pattern search, deciding on appropriate models and their parameters, and matching methods with the goal of the process (output: selected algorithms).

    7.  Data mining – searching for patterns of interest in specific form like classification rules, decision trees, regression models, trends, clusters, and associations (output: patterns).

    8.  Interpretation of mined patterns – understanding and visualizations of patterns based on the extracted models (output: interpreted patterns).

    9.  Consolidation of discovered knowledge – use of discovered patterns into a system analyzed by the KDD process, documenting and reporting knowledge to end-users, and checking and resolving conflicts if needed (output: knowledge, actions/decisions based on the results).

    The authors of this model declared its iterative fashion, but they gave no specific details. The KDD process is a simple methodology and quite a natural model for the discussion of KDPs. There are two significant drawbacks of this model. First, lower levels are too abstract and not explicit and formalized. This lack of detail was changed in later methodologies using more formalized step descriptions (in some cases using standards, automation of processes, or specific tools or platforms). The second drawback is its lack of business aspects description, which is logical due to the research-based idea at the start of its development.

    1.3.2 Industry-Based Standard – the Success of CRISP-DM

    Shortly after the KDD process definition, the industry produced methodologies more suitable for their needs. One of them is CRISP-DM (CRoss-Industry Standard Process for Data Mining) (Chapman et al., 2000), which became the standard for many years and is still widely used in both the industry and the research area. CRISP-DM was originally developed by a project consortium under the ESPRIT EU funding initiative in 1997. The project involved several large companies, which cooperated in its design: SPSS, Teradata, Daimler AG, NCR Corporation, and OHRA. Thanks to the different knowledge of companies, the consortium was able to cover all aspects, like IT technologies, case studies, data sources, and business understanding.

    CRISP-DM is an open standard and is available for anyone to follow. Some of the software tools (like SPSS Modeler/SPSS Clementine) have CRISP-DM directly incorporated. As we already mentioned, CRISP-DM is the most widely used KDP methodology. While it still has some drawbacks, it became a part of the most successful story in the data mining industry. The central fact behind this success is that CRISP-DM is industry-based and neutral according to tools and application. One of the drawbacks of this model is that it does not perform project management activities. One major factor behind the success of CRISP-DM is that it is an industry tool, and it is application-neutral (Mariscal et al., 2010).

    The CRISP-DM model (see Fig. 1.3) consists of the following six steps, which are then described in more details and can be iteratively applied, including feedback in some places (where necessary):

    1.  Business understanding – focuses on the understanding of objectives and requirements from a business perspective, and also converts them into the technical definition and prepares the first version of the project plan to achieve the objectives. Therefore, substeps here are:

    a.  determination of business objectives – here it is important to define what we expect as business goals (costs, profits, better support of customers, and higher quality of the data product),

    b.  assessment of the situation – understanding the actual situation within the objectives, defining the criteria of success for business goals,

    c.  determination of technical (data mining) goals – business goals should be transformed into technical goals, i.e., what data mining models we need to achieve business goals, what the technical details of these models are, how we will measure it,

    d.  generation of a project plan – the analyst creates the first version of the plan, where details on next steps are available. The analysts should address different issues, from business aspects (how to discuss and transform data mining results, deployment issues from a management point of view) to technical aspects (how to achieve data, data formats, security, anonymization of data, software tools, technical deployment).

    2.  Data understanding – initial collection of data, understanding the data quality issues, exploration analysis, detection of interesting data subsets. If understanding shows a need to reconsider business understanding substeps, we can move back to the previous step. Hence, the substeps of data understanding are:

    a.  collection of initial data – the creation of the first versions of the dataset or its parts,

    b.  description of data – understanding the meaning of attributes in data, summary of the initial dataset(s), extraction of basic characteristics,

    c.  exploration of data – visualizations, descriptions of relations between attributes, correlations, simple statistical analysis on attributes, exploration of the dataset,

    d.  verification of data quality – analysis of missing values, anomalies, or other issues in data.

    3.  Data preparation – after finishing the first steps, the most important step is the preparation of data for data mining (modeling), i.e., the preparation of the final dataset for modeling using data manipulation methods which can be applied. We can divide them into:

    a.  selection of data – a selection of tables, records, and attributes, according to goal needs and reduction of dimensionality,

    b.  integration of data – identification of the same entities within more tables, aggregations from more tables, redundancy checks, and processing of and detection of conflicts in data,

    c.  cleansing of data – processing of missing values (remove records or imputation of values), processing of anomalies, removing inconsistencies,

    d.  construction (transformation) of data – the creation of new attributes, aggregations of values, transformation of values, normalizations of values, and discretization of attributes,

    e.  formatting of data – preparation of data as input to the algorithm/software tool for the modeling step.

    4.  Modeling – various modeling techniques are applied, and usually more types of algorithms are used, with different setup parameters (often with some metaapproach for optimization of parameters). Because methods have different formats of inputs and other needs, the previous step of data preparation could be repeated in a small feedback loop. In general, this step consists of:

    a.  selection of modeling technique(s) – choose the method(s) for modeling and examining their assumptions,

    b.  generation of test design – plan for training, testing, and evaluating the models,

    c.  creation of models – running the selected methods,

    d.  assessment of generated models – analysis of models and their qualities, revision of parameters, and rebuild.

    5.  Evaluation – with some high-quality models (according to the data analysis goal), such models are evaluated from a business perspective. The analyst reviews the process of model construction (to find insufficiently covered business issues) and also decides on the next usage of data mining results. Therefore, we have:

    a.  evaluation of the results – assessment of results and identification of approved models,

    b.  process review – summarize the process, identify activities which need another iteration,

    c.  determination of the next step – a list of further actions is provided, including their advantages and disadvantages,

    d.  decision – describe the decision as to how to proceed.

    6.  Deployment – discovered knowledge is organized and presented in the form of reports or some complex deployment is done. Also, this can be a step that finishes one of the cycles if we have an iterative application of KDP (lifecycle applications). This step consists of:

    a.  plan deployment – the deployment strategy is provided, including the necessary steps and how to perform them,

    b.  plan monitoring and maintenance – strategy for the monitoring and maintenance of deployment,

    c.  generation of the final report – preparation of the final report and final presentation (if expected),

    d.  review of the process substeps – summary of experience from the project, unexpected problems, misleading approaches, interesting solutions, and externalization of best practices.

    Fig. 1.3 Methodology CRISP-DM.

    CRISP-DM is relatively easy to understand and has good vocabulary and documentation. Thanks to its generalized nature, this methodology is a very successful and extensively used model. In practice, many advanced analytic platforms are based on this methodology, even if they do not call it the same way.

    In order to help in understanding the process, we can provide a simple example. One of the possible applications of the CRISP-DM methodology is to provide tools in support of clinical diagnosis in medicine. For example, our goal is to improve breast cancer diagnostics using data about patients. In terms of CRISP-DM methodology we can describe the KDP in the following way:

    1.  Business understanding – from a business perspective, our business objective goal is to improve the effectiveness of breast cancer diagnostics. Here we can provide some expectation in numbers related to diagnostics effectiveness and costs of additional medical tests, in order to set up business goals – for example, if our diagnosis using some basic setup will be more effective, it reduces the costs by 20%. Then data mining goals are defined. In terms of data mining, it is a classification task with the binary target attribute, which will be tested using a confusion matrix, and according to business goals we want to achieve at least 95% accuracy of the classifier to fulfill the business goal. According to the project plan, we know that data are available in CSV format, and data and models are processed in R using RStudio, with the Rshiny web application (on available server infrastructure) providing the interface for doctors in their diagnostic process.

    2.  Data understanding – in this example, let us say we have data collected from the Wisconsin Diagnosis Breast Cancer (WDBC) database. We need to understand the data themselves, and what are their attributes and what is their meaning. In this case, we have 569 records with 32 attributes, which mostly describe original images with/without breast cancer. The first attribute is ID and the second attribute is target class (binary – the result of diagnosis). The other 30 real-valued attributes describe different aspects of cells in the image (shape, texture, radius). We also find no missing values, and we do not need any procedure to clean or transform data. We also explore data, visualize them, and describe relations between attributes and correlations, in order to have enough information for the next steps.

    3.  Data preparation – any integration, cleaning, and transformation issues are solved here. In our example, there are no missing values other issues in WDBC. There is only one data table, we will select all records, and we will not remove/add an attribute. The data format is CSV, suitable for input in RStudio for the modeling step. We can also select subsets of data according to expected modeling and evaluation, in this case, let us say a simple hold-out method with different ratios for the size of training and test samples (80:20, 70:30, 60:40).

    4.  Modeling – data mining models are created. In our case, we want classification models (algorithms), i.e., C4.5, Random Forests, neural networks, k-NN, SVM, and naive Bayes. We create models for different hold-out selections and parameters of algorithms to achieve the best models. Then we evaluate models according to test subsets and select the best of them for further deployment, i.e., the SVM-based model with more than 97% accuracy with 70:30 hold-out.

    5.  Evaluation – the best models are analyzed from a business point of view, i.e., whether we can achieve the business goal using such a model and its sufficiency for application in the deployment phase. We decide on how to proceed with the best model, and what the advantages and disadvantages are. For example, in this case, the application of the selected model can support doctors and remove one intrusive and expensive test out of diagnostics, in some of the new cases.

    6.  Deployment – a web-based application (based on Rshiny) is created and deployed on the server, which contains an extracted model (SVM classifier) and a user interface for the doctor in order to input results of image characteristics from new patients (records) and provide him/her with a diagnosis of such new samples.

    1.3.3 Proprietary Methodologies – Usage of Specific Tools

    While the research or open standard methodologies are more general and tool-free, some of the leaders in the area of data analysis also provide to their customers proprietary solutions, usually based on the usage of their software tools.

    One of such examples is the SEMMA methodology from the SAS Institute, which provided a process description on how to follow its data mining tools. SEMMA is a list of steps that guide users in the implementation of a data mining project. While SEMMA provides still quite a general overview of KDP, authors claim that it is a most logical organization of their tools to cover core data mining tasks (known as SAS Enterprise Miner). The main difference of SEMMA with the traditional KDD overview is that the first steps of application domain understanding (or business understanding in CRISP-DM) are skipped. SEMMA also does not include the knowledge application step, so the business aspect is out of scope for this methodology (Azevedo and Santos, 2008). Both these steps are in the knowledge discovery community considered as crucial for the success of projects. Moreover, applying this methodology outside SAS software tools is not easy. The phases of SEMMA and related tasks are the following:

    1.  Sample – the first step is data sampling – a selection of the dataset and data partitioning for modeling; the dataset should be large enough to contain representative information and content, but still small enough to be processed efficiently.

    2.  Explore – understanding the data, performing exploration analysis, examining relations between the variables, and checking anomalies, all using simple statistics and mostly visualizations.

    3.  Modify – methods to select, create, and transform variables (attributes) in preparation for data modeling.

    4.  Model – the application of data mining techniques on the prepared variables, the creation of models with (possibly) the desired outcome.

    5.  Assess – the evaluation of the modeling results, and analysis of reliability and usefulness of the created models.

    IBM Analytics Services have designed a new methodology for data mining/predictive analytics named Analytics Solutions Unified Method for Data Mining/Predictive Analytics (also known as ASUM-DM),³ which is a refined and extended CRISP-DM. While strong points of CRISP-DM are on the analytical part, due to its open standard nature CRISP-DM does not cover the infrastructure or operations side of implementing data mining projects, i.e., it has only few project management activities, and has no templates or guidelines for such tasks.

    The primary goal of ASUM-DM creation was to solve the disadvantages mentioned above. It means that this methodology retained CRISP-DM and augmented some of the substeps with missing activities, tasks, guidelines, and templates. Therefore, ASUM-DM is an extension or refinement of CRISP-DM, mainly in the more detailed formalization of steps and application of (IBM-based) analytics tools. ASUM-DM is available in two versions – an internal IBM version and an external version. The internal version is a full-scale version with attached assets, and the external version is a scaled-down version without attached assets. Some of these ASUM-DM assets or a modified version are available through a service engagement with IBM Analytics Services. Like SEMMA, it is a proprietary-based methodology, but more detailed and with a broad scope of covered steps within the analytical project.

    At the end of this section, we also mention that KDPs can be easily extended using agile methods, initially developed for software development. The main application of agile-based aspects is logically in larger teams in the industrial area. Many approaches are adapted explicitly for some company and are therefore proprietary. Generally, KDP is iterative, and the inclusion of more agile aspects is quite natural (Nascimento and de Oliveira, 2012). The AgileKDD method fulfills the OpenUP lifecycle, which implements Agile Manifesto. The project consists of sprints with fixed deadlines (usually a few weeks). Each sprint must deliver incremental value. Another example of an agile process description is also ASUM-DM from IBM, which combines project management and agility principles.

    1.3.4 Methodologies in Big Data Context

    Traditional methodologies are usually applied also in Big Data projects. The problem here is that none of the traditional standards support the description of the execution environment or workflow lifecycle aspects. In the case of Big Data projects, it is an important issue due to the complex cluster of distributed services implemented using the various technologies (distributed databases, frameworks for distributed processing, message queues, data provenance tools, coordination, and synchronization tools). An interesting paper discussing these aspects is Ponsard et al. (2017). One of the mentioned methodologies related to Big Data in this paper is Architecture-centric Agile Big data Analytics (AABA) (Chen et al., 2016), which addresses technical and organizational challenges of Big Data with the application of agile delivery. It integrates Big Data system Design (BDD) and Architecture-centric Agile Analytics (AAA) with the architecture-supported DevOps model for effective value discovery and continuous delivery of value. The authors validated the method based on case studies from different domains and summarized several recommendations for Big Data analytics:

    •  Data analysts should be involved already in the business analysis phase.

    •  There should be continuous architecture support.

    •  Agile steps are important and helpful due to fast technology and requirements changes in this area.

    •  Whenever possible, it is better to follow the reference architecture to make development and evolution of data processing much easier.

    •  Feedback loops need to be open and should include both technical and business aspects.

    As we already mentioned, processing of data and their lifecycle is quite an important aspect in this area. Moreover, the setup of processing architecture and technology stack is probably of the same importance in the Big Data context. One approach for solving such issues is related to the Big Data Integrator (BDI)Platform (Ermilov et al., 2017), developed within the Big Data Europe H2020 flagship project, which provides distribution of Big Data components as one platform with easy installation and setup. While there are several other similar distributions, authors of this platform also provided to potential users a methodology for developing Big Data stack applications and several use cases from different domains. One of their inspirations was to use the CRISP-DM structure and terminology and apply them to a Big Data context, like in Grady (2016), where the author extends CRISP-DM to process scientific Big Data. In the scope of the BDI Platform, authors proposed a BDI Stack Lifecycle methodology, which supports the creation, deployment, and maintenance of the complex Big Data applications. The BDI Stack Lifecycle consists of the following steps (they developed documentation and tools for each of the steps):

    1.  Development – templates for technological frameworks, most common programming languages, different IDEs applied, distribution formalized for the needs of users (data processing task).

    2.  Packaging – dockerization and publishing of the developed or existing components, including best practices that can help the user to decide.

    3.  Composition – assembly of a BDI stack, integration of several components to address the defined data processing

    Enjoying the preview?
    Page 1 of 1