Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Targeting Uplift: An Introduction to Net Scores
Targeting Uplift: An Introduction to Net Scores
Targeting Uplift: An Introduction to Net Scores
Ebook722 pages7 hours

Targeting Uplift: An Introduction to Net Scores

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book explores all relevant aspects of net scoring, also known as uplift modeling: a data mining approach used to analyze and predict the effects of a given treatment on a desired target variable for an individual observation. After discussing modern net score modeling methods, data preparation, and the assessment of uplift models, the book investigates software implementations and real-world scenarios. Focusing on the application of theoretical results and on practical issues of uplift modeling, it also includes a dedicated chapter on software solutions in SAS, R, Spectrum Miner, and KNIME, which compares the respective tools. This book also presents the applications of net scoring in various contexts, e.g. medical treatment, with a special emphasis on direct marketing and corresponding business cases. The target audience primarily includes data scientists, especially researchers and practitioners in predictive modeling and scoring, mainly, but not exclusively, in the marketing context. 


LanguageEnglish
PublisherSpringer
Release dateSep 9, 2019
ISBN9783030226251
Targeting Uplift: An Introduction to Net Scores

Related to Targeting Uplift

Related ebooks

Business For You

View More

Related articles

Related categories

Reviews for Targeting Uplift

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Targeting Uplift - René Michel

    © Springer Nature Switzerland AG 2019

    R. Michel et al.Targeting Uplifthttps://doi.org/10.1007/978-3-030-22625-1_1

    1. Introduction

    René Michel¹ , Igor Schnakenburg² and Tobias von Martens¹

    (1)

    Deutsche Bank AG, Frankfurt am Main, Germany

    (2)

    DeTeCon International GmbH, Berlin, Germany

    1.1 Problem Statement

    In various areas of application, treatments are commonly used in order to affect behavior. In the last decades, the systematic collection and analysis of data on behavior by means of advanced statistical methods allowed for the identification of behavioral patterns that may have been hidden before. The application of analytics to estimate the impact of treatments on behavior (i.e., uplift) is not just a natural extension to this but a beneficial challenge, since the effective and efficient control of behavior may be crucial for competitive advantages (or a non-monetary equivalent). This book focuses on uplift analytics and shows how they can be applied.

    The following exemplary use cases underline the diversity of application areas in which treatments are used to affect behavior:

    Direct marketing tries to convince customers to purchase a product or service.

    Churn prevention campaigns strengthen or win back customers’ loyalty.

    Medical treatments are applied to help patients to recover from a disease or ease pain.

    Fertilizers are used to increase yields in agriculture.

    Pre-emptive maintenance is used to avoid machine malfunctioning.

    Police forces are used pre-emptively to avoid crimes, especially break-ins.

    Some of these treatments—specifically when trying to influence human behavior—may be characterized as nudges as presented in [14], i.e., treatments that softly stimulate people for their advantage without taking away their freedom to decide. One goal of the methods presented in this book is to make the effect of such nudges both measurable and predictable.

    Most often, the magnitude of the effect that the treatment exerts on behavior is only assumed but not exactly known in advance. However, a subsequent estimation of the effect is possible in most cases:

    If any additional influencing factors can be excluded (e.g., by means of an experimental design), behavior before and after the treatment might be compared.

    If there is a structurally identical group of observations not exposed to the treatment (i.e., control group ), the behavior of the group of observations that received the treatment (i.e., target group) can be compared to the behavior of the control group. This is assumed as the standard approach in this book.

    Decisions on the utilization of treatments, however, require an estimation of their assumed effect beforehand. Therefore, the development of statistical models for the forecast of treatment impacts on an observation is a challenge that research has dealt with for several years. This forecast, known as uplift modeling , net scoring , incremental response modeling , causal modeling , average causal effect (ACE) modeling , or personalized medicine (the latter two in the medical context), is based on the characteristics of an observation and its environment. In order to support decisions on whether or not to apply the treatment, behavioral changes due to the treatment have to be forecasted and evaluated:

    Customers may or may not purchase the product or service after having been targeted by a direct marketing campaign.

    Customers may extend or quit their telecommunication contract after they have been addressed by a churn prevention campaign of their provider.

    Patients recover or do not recover after they have been drugged with a specific medication.

    Fields provide or do not provide greater yields after they have been fertilized.

    Machines fail or do not fail to a lesser extent after they have been maintained according to a specific process.

    The number of crimes committed in a certain area may or may not reduce after more police forces have been sent out.

    All use cases presume that exerting the treatment on all observations is not possible, mostly because of limited (financial) resources, or even not recommended, since the individual behavior of some observations may not or even be negatively affected. Hence, a decision on which observations should receive the treatment is required. Just as Lo (see [6]) emphasizes, uplift modeling is capable of identifying those observations whose response will be positively influenced by the treatment. Whereas gross scoring , the classical scoring, predicts the probability of a certain behavior given (but not necessarily induced by) a treatment, net scoring predicts the difference in behavior given a treatment compared to no treatment (see [9]).

    As stated in [2], instead of telling what happens (descriptive analytics ) or what will happen (predictive analytics ), prescriptive analytics aims at telling what should be done to make things happen. Uplift modeling is a core approach in the field of prescriptive analytics.

    Despite the fact that some methods for the development of appropriate statistical models have been suggested, the problem is often simplified in practice to the forecast of behavior without taking a (previous) treatment into account.

    In direct marketing , for example, a common approach is to address only those customers by campaigns that have an above-the-average affinity towards a product or service. The underlying assumption is that the relative effect of a direct marketing campaign (i.e., the treatment) is approximately equal for all customers, i.e., customers with a high affinity are uplifted more by a campaign than customers with a low affinity. Clearly, the mathematical challenge of considering the effect of treatments is avoided in this case.

    However, forecasting behavior while ignoring the effect of treatments can lead to misinterpretations and a wrong allocation of (financial) resources:

    Some customers are targeted by a direct marketing campaign but they would have decided in favor of a product or service anyway (Sure Things ). Other customers whose behavior could positively be affected by a direct marketing campaign (Persuadables ) are not considered, since they have a low basic affinity towards this product or service. Customers without inclination to purchase no matter if the treatment is applied or not are usually referred to as Lost Causes .

    Some telecommunication or insurance customers (Sleeping Dogs ) may quit their contract because a churn prevention campaign made them actively think about ending their contract and searching for alternatives offered by competitors.

    Some patients get a specific medication, although they would have recovered without it, too. Other patients do not get that medication, although they would have taken benefit from it.

    Fields are fertilized that would have provided good yields anyway, while other fields remain unfertilized but would provide higher yields if fertilized.

    Maintenance is focused on some machines that fail often (which cannot be prevented), while other machines are not maintained, although predictive maintenance on those machines could have prevented their rare failure.

    Police resources are sent to areas where crimes cannot be prevented. Intelligent usage of police forces could (also) send them to areas where their presence could reduce crimes.

    From a methodological perspective, the consideration of a treatment effects increases complexity, e.g., the problem that an observation cannot be treated and not treated at the same time has to be overcome. The target criterion, e.g., product purchase vs. no product purchase or recovery vs. illness, according to which groups should be discriminated is no longer sufficient. In addition, the treatment has to be regarded as an additional dimension in the modeling algorithm itself. Moreover, historical data rarely allows for the analysis of one group’s behavior after receiving a specific treatment compared to the behavior of another group that received a different (or no) treatment. Therefore, research on this subject does not only have to consider the development of modeling algorithms for treatment data but also the availability of such data.

    It is possibly due to the mathematical and practical challenges and increased demands regarding the available data that the uplift approach has not received appropriate attention by research in the past (at least, compared to the vast amount of research on classical scoring methods) and, hence, has not gained the broad application it deserves with respect to its positive impact on effectiveness and efficiency. However, in recent years, the number of publications on uplift modeling has increased. Most of those publications consider a specific issue of uplift modeling, while a general summary of the methods, their comparison, their applications, and practical use cases seems to be missing in the literature. This book intends to fill this gap.

    1.2 State-of-the-Art

    Most of the contributions to uplift modeling can be found in the data mining literature. Typically, they consider the problem in the context of direct marketing or medical treatments, since these can be regarded as primary areas of application. The suggested approaches range from the enhancement of classical methods to the development of new methods, e.g.:

    Lo (see [6]) points out that the classical methodology is not directly designed to solve the appropriate business objective (i.e., identifying the most responsive customers) and suggests a new scoring method based on logistic regression.

    Hansotia and Rukstales (see [4]) describe tree- and regression-based approaches to develop incremental decision rules and justify marketing investments.

    Radcliffe and Simpson (see [8]) illustrate how retention campaigns based on conventional scoring methods may even provoke some customers to leave. They suggest Qini graphs and the Qini coefficient as generalizations of gains charts and the Gini coefficient, respectively, for the measurement of the discriminatory power of uplift models.

    Radcliffe and Surry (see [10]) document the then state-of-the-art in uplift modeling. They propose quality measures and success criteria of uplift modeling and suggest significance-based uplift trees as an appropriate scoring method.

    Austin (see [1]) shows ensemble methods in the uplift context, i.e., unifying several models into one common, superior model, and shows their effectiveness with simulated data.

    Rzepakowski and Jaroszewicz (see [11]) present tree-based classifiers in order to decide which action out of a set of potential treatments should be used in order to maximize (incremental) profit.

    Jaskowski and Jaroszewicz (see [5]) extend standard probabilistic classification models, such as logistic regression, for uplift modeling on clinical trial data. To that end, they apply either class variable transformation or treatment and control classifiers in logistic regression analysis.

    Guelman et al. (see [3]) introduce a new, statistically advanced way of uplift modeling based on random forests together with an implementation package in the common statistical software R.

    Michel et al. (see [7]) introduce $$\chi ^2_{{\mathrm{net}}}$$ as a modification of the classical χ² statistic for uplift modeling and show detailed net scoring scenarios for marketing.

    Devriendt et al. (see [2]) give an overview of the relevant literature regarding uplift modeling. They also raise a lot of open questions, such as the influence of sample sizes and other factors on net scoring performance and model stability as well as suitable business cases to validate the economic impact of net scoring. These aspects will be addressed in this book.

    Also, first summaries exist as chapters of introductory books on data mining and predictive analytics, such as [12] and [13].

    The contributions mentioned above illustrate that the relevance of uplift modeling has been acknowledged in recent years. All authors share the perception that classical scoring methods currently used in practice are not designed to serve the primary objective, i.e., identifying the most responsive customers (or patients), in order to support decisions on the utilization of treatments. Furthermore, the research contributions at hand are able to prove by means of simulated or real-world data that uplift modeling outperforms classical scoring methods with regard to treatment effectiveness. Hence, this book comprises the recent state-of-the-art of research and is, thus, in the tradition of [10] and [2].

    1.3 Structure of the Book

    This book aims at examining uplift modeling in all of its facets. It contributes to research, since the state-of-the-art of uplift modeling is summarized and enhanced where research gaps have been identified. The book also contributes to practical experience by addressing the application of uplift modeling and corresponding challenges comprehensively. The scoring methods found in the literature and the methods proposed by the authors are compared to each other both conceptually and by means of simulation studies with current software implementations. Furthermore, topics that have received minor attention so far, such as suitable sample sizes, a closed-loop approach to uplift modeling in practice as well as a systematic identification of potential areas of application, are described.

    The book is structured in the following way:

    At first, both scoring approaches, i.e., the classical scoring (also referred to as gross scoring in this book) and uplift modeling (also referred to as net scoring), are presented and compared to each other with regard to the problem statement, available methods, and the assessment of modeling results.

    After that, main challenges of uplift modeling, such as the assessment of net scoring models as well as variable preselection for modeling, are explored.

    Next, focusing on the application of uplift modeling in practice, currently available software implementations are presented and compared by their functionality and performance on a given dataset.

    Another important practical aspect, namely the kind of data that has to be available, is investigated and appropriate sample sizes are suggested.

    Finally, potential areas of application for uplift modeling are identified. For the marketing use case, a framework for an alignment with the business strategy is proposed. Moreover, a process model for implementing uplift modeling is suggested.

    References

    1.

    P. Austin. Using ensemble-based methods for directly estimating causal effects: An investigation of tree-based g-computation. Multivariate Behavioral Research, 47:115–135, 2012.Crossref

    2.

    V. Devriendt, D. Moldovan, and W. Verbeke. A literature survey and experimental evaluation of the state-of-the-art in uplift modeling: A stepping stone toward the development of prescriptive analytics. Big Data, 6(1):13–41, 2018. https://​doi.​org/​10.​1089/​big.​2017.​0104.Crossref

    3.

    L. Guelman, M. Guillén, and A.M. Perez-Marin. Optimal personalized treatment rules for marketing interventions: A review of methods, a new proposal, and an insurance case study. UB Riskcenter Working Paper Series, 2014(06), 2014.

    4.

    B. Hansotia and B. Rukstales. Incremental value modeling. Journal of Interactive Marketing, 16(3):35–46, 2002.Crossref

    5.

    M. Jaskowski and S. Jaroszewicz. Uplift modeling for clinical trial data. ICML 2012 Workshop on Clinical Data Analysis, 2012.

    6.

    V. Lo. The true lift model - a novel data mining approach to response modeling in database marketing. SIGKDD Explorations, 4(2):78–86, 2002.Crossref

    7.

    R. Michel, I. Schnakenburg, and T. von Martens. Effiziente Ressourcenallokation für Vertriebskampagnen durch Nettoscores. Betriebswirtschaftliche Forschung und Praxis, 67(6):665–677, 2015.

    8.

    N.J. Radcliffe and R. Simpson. Identifying who can be saved and who will be driven away by retention activity. Journal of Telecommunications Management, 1(2):168–176, 2008.

    9.

    N.J. Radcliffe and P.D. Surry. Quality measures for uplift models. 2011. Working paper. http://​stochasticsoluti​ons.​com/​pdf/​kdd2011late.​pdf.

    10.

    N.J. Radcliffe and P.D. Surry. Real-world uplift modeling with significance-based uplift trees. 2011. Technical Report, Stochastic Solutions.

    11.

    P. Rzepakowski and S. Jaroszewicz. Decision trees for uplift modeling with single and multiple treatments. Knowledge and Information Systems, 32:303–327, 2012.Crossref

    12.

    E. Siegel. Predictive Analytics: The Power to Predict who will Click, Lie or Die. John Wiley & Sons, 2015.Crossref

    13.

    J. Strickland. Predictive Analytics Using R. Lulu Pr, 2015.

    14.

    R. Thaler and C. Sunstein. Nudge: Improving Decisions About Health, Wealth and Happiness. Penguin, 2009.

    © Springer Nature Switzerland AG 2019

    R. Michel et al.Targeting Uplifthttps://doi.org/10.1007/978-3-030-22625-1_2

    2. The Traditional Approach: Gross Scoring

    René Michel¹ , Igor Schnakenburg² and Tobias von Martens¹

    (1)

    Deutsche Bank AG, Frankfurt am Main, Germany

    (2)

    DeTeCon International GmbH, Berlin, Germany

    Model building and scoring as a statistical methodology have been known for decades, and there is a wide variety of literature available for studies, e.g., [4] or [11] as two examples. It is not the intention of this chapter to give a complete overview of model building and scoring. Instead, typical methods of model building and scoring are presented which are required to understand the change in paradigm with the introduction of net scoring and its methods later in Chap. 3. In order to distinguish between both approaches, the classical approach will be referred to as gross scoring , whereas the new approach will be referred to as net scoring or uplift modeling interchangeably. At first, we explain and formalize the problem to be solved. Section 2.2 will introduce common methods for scoring like decision trees or (logistic) regression, always with the generalization to net scoring in mind. Section 2.3 contains an introduction to well-known quality measures for scoring models. This introduction also serves as a preparation to generalize those quality indicators to the net scoring setup in Chap. 4.

    Although the facts presented in this chapter may be known to many readers, it is nevertheless recommended to study this chapter in order to get familiar with the way scoring methods are presented and described in this book. This will help to understand the net approaches that will be described later on.

    2.1 Problem Statement

    To put it simple: The problem in the classical prediction case is to calculate the probability of an event to occur in the future. In reality, either the event does happen or it does not happen, but this is not known at the moment of calculation. The precise context of this general setup can be in different forms which are, however, not important for the mathematical considerations. Some examples where calculated probabilities may trigger an action are the following:

    A company aims at predicting product purchases for all registered customers (i.e., customers and their corresponding data are known to this company). The customer may or may not purchase a specific product.

    An enterprise aims at predicting the failure of parts of a machine it produces (or uses) in order to have the relevant spare parts available in due time. The respective part may or may not fail.

    The police aims at predicting crimes in order to be present and prevent them. The crimes may or may not be committed.

    A bank aims at predicting credit default rates on a customer-individual level in order to take appropriate precautions. The credits may or may not default.

    A doctor aims at predicting whether a patient can recover from a current disease. The patient may or may not recover.

    All of the examples above have the following structural elements in common:

    a set of observations, such as customers, patients, or machines

    information on the observations in form of explanatory variables , such as age, blood pressure, or type of machine

    a target event, such as a product purchase, malfunction, sickness, or recovery

    The target variable in the simplest case (which will be the focus in this chapter and most parts of the book) is a binary variable, where 1 describes the occurrence of an event, and 0 describes its non-occurrence. The occurrence of the desired event for every individual observation based on its attributes shall be predicted. This information is then, for example, used to implement some business strategy like approving or rejecting a credit request, repairing a machine, or targeting a customer. Non-binary variables are also possible as targets. A target which can (at least theoretically) assume any numerical value will be referred to as interval-scaled, metric, or continuous. These designations will be used as synonyms throughout the text.

    In order to generalize scoring to the net case, a formalization of the setup of the classical gross scoring is useful.

    Let X be a random vector of explanatory variables and x a realization of that random vector. In order to ease notations, it is assumed without loss of generality that $${\boldsymbol {x}}\in \mathbb R^s$$ , i.e., any categorical variables in the data at hand are modeled as numbers. Further on, let R be a binary random variable describing the target (occurrence of the desired event) for each observation.

    Then, P(R = 1|X = x) denotes the probability of a target event for an observation with the explanatory vector x. This is the conditional probability for an occurrence R to happen under X = x. For didactical reasons, it is neglected that P(X = x) may be 0 and that the conditional probability may not be well defined and has, instead, to be defined with the help of suitable limit considerations. Thus, in order to ease notation, assume that P(X = x) > 0, although this may not always be the case in the probability model.

    The central goal of gross scoring is to find estimators $$\hat p_{\boldsymbol {x}}$$ that give reasonable empirical approximations for P(R = 1|X = x), based on the explanatory variables. These estimators will usually be based on n independent and identically distributed copies of the random tuple (R i, X i), i = 1, …, n, as observable, for example, from n customers, n patients, or n machine parts.

    When implementing the classical approach to scoring, one usually puts all (potentially explanatory) data about the observations under consideration into one flat file. Additionally, information about the desired event has to be included. This means that historical data is used that contains cases where the event occurred and where it did not. It is important that the explanatory variables are recorded at a point in time before the event occurred. Such a flat file usually starts with an ID for each observation. The observation’s properties can be distinguished into several classes:

    Explanatory variables —these may be segmented into the following common categories:

    General observation data —e.g., name, date of birth, gender, or ID. If the observation is a customer, then general customer data includes address or contact details. In the context of asset maintenance, general information may include material, texture, or production line. In other contexts, general information is meant to distinguish the specific observation from other observations (if not from all other observations); however, general information is also meant to be stable.

    Development data —this information, too, is more or less specific for the observation, but it may vary with time and is not fixed from the beginning. For customers, this may be transactional data, product usage, or (online) behavioral data,¹ while for assets, this may be usage, position, velocity, accelerations, etc. Development data contains historical data as well as recent data from several points in time.

    Context data —rather than specifying the observation itself, contextual data describes the environment or the neighbors of the observation. In marketing, this may be the network of people the customer has contact with, or the specific customer behavior in the geographical vicinity (communication impacts); for assets, it may be crucial at which part in a framework they are located (central or boundary) or what happened to surrounding assets (root cause analysis).

    Treatment data—information about how the observation has been treated in the past; for assets, this may be the number of repairs in the past, while for customers, it could be a treatment or campaigning history that those individuals have been exposed to (rather than initiating it by themselves).

    Finally, there is derived information like all kinds of statistical measures: maximum, minimum, average, regression slope and intercept, deviations but also certain ratios (e.g., wallet shares, amount of credit volume per postal code, average maintenance cost in a region, above- or below-average pressure levels, etc.). There is no limit to bringing in additional data and deriving variables from existing data as long as it can be assigned to a specific observation. Clearly, this is due to subject matter expertise, as not every piece of new information will be correlated with the target variable.

    Target variable—in this chapter, a binary target variable is considered, e.g., a customer has purchased or churned, an asset has failed or recovered, etc.

    After data collection, a suitable number of observations where the event occurred and a suitable number of observations where it could have occurred,² but did not, are used for the modeling process . Modeling process means that a statistical algorithm is deployed in order to identify connections between the predictor variables and the target variable. Which connections among data are useful, though? Following [4], there are several criteria:

    easily understandable

    valid on new data

    potentially useful

    novel

    Many different classical algorithms are available to gain understandable, valid, useful, and novel information. Some of the most common ones will be presented in this chapter. We restrict ourselves to methods which will be important for the generalization to net scoring in the next chapter, so this is not thought to be a complete overview. The presented methods are

    Decision trees

    (Logistic) regression

    Neural networks

    Nearest neighbor

    Bayesian classifiers

    With the exception of neural networks, a direct uplift generalization of the methods above will be given in Chap. 3. Neural networks are presented due to the existence of net scoring methods which use a combination of arbitrary gross scoring methods and they have been very popular in recent years among data scientists.

    Using these or other algorithms, the rules or formulae may be found and afterwards be deployed on new observations in order to estimate the requested target. The prediction for each observation can then be taken as a base for decision-making or further calculations.

    2.2 Methods

    The standard process of data mining is, for example, described by CRISP DM (CRoss Industry Standard Process for Data Mining, see [7]). This is an iterative or circular-shaped process which indicates that a task in data mining may never be considered as accomplished in total.

    Data mining generally starts with a problem understanding . It is helpful to write down explicitly what needs to be mined. Firstly, because nomenclature in precise problems may differ from real-world vocabulary, and secondly, in order to align expectations about what is doable, feasible, and in scope or out of scope. It also belongs to the understanding of business problems to agree what the results will be used for, how often they need to be produced, and what the ideal structure of the resulting data looks like. In most real-world scenarios, the data mining methods are already decided upon as well as the tool to be used. However, in principle, the data miner should be free to choose the deployed algorithm and the analytical software.

    If the goals have been defined, the next step is typically to understand data . What kind of data is available and can be used for the question at hand? Is the data readily accessible from a data warehouse (DWH), from the cloud, or a data lake? What kind of transformations or preprocessing steps are required, or—in the extreme case—is a study required that generates the data to be used for analysis?

    When understanding is accomplished, the analyst starts with data preparation . This especially includes the actual retrieval of data. This step is usually very tedious in practice and takes a lot of time.

    Once the data is present (usually in some electronic form), the analyst starts examining the data. Data checking comprises several dimensions like the following:

    Metadata: The variables that are available, their formats with respect to dimensions (date, currency, alphanumeric, numeric), the formats they come in (csv-files, database tables, txt-files, stream data, video data, voice data), their latency, their aggregation, the filter they come through, history of data

    Quality of data: Missing values, corrupted values, precision, logical structure (e.g., fact tables, dimension tables), consistency, outliers

    Simple statistics: Frequency counts, minima, maxima, averages, standard deviations of each variable. This is important as certain values may hardly occur which may have a direct impact on the methods and algorithms to be deployed.

    Visual explorations: Simple data explorations by means of graphics, such as bar charts or line plots, help to get a feeling for the data.

    Simple connections: Correlations, frequency tables, distributions, and various two- or multidimensional plots

    The connection between problem understanding and data understanding will not only allow to estimate the effort of cleansing the data and putting it into the right shape for data mining (typically a flat table), but it will also allow to construct new derived variables, e.g., trends, densities, ratios. At this stage, it may even turn out that without further information, the task at hand may not be (sufficiently) solvable.

    The next task for the analyst is model building which is in the focus of this section, i.e., the application of statistical methods like decision trees or regressions, usually for being able to make prognoses for the future behavior of the observations. The results then need to be validated in order to see if they answer the question of the business problem sufficiently. At this point, results typically show that some of the earlier steps need to be improved and the next iterations begin. This is continued until the results are satisfactory or until there is convincing evidence that, given current resources, a better result is not possible. The final step then is the deployment of results in some productive system in order to solve the original problem. However, even after deployment, a regular validation of the models is important.

    The idea behind CRISP DM is an iterative approach not only as a whole but also for certain parts. It is often necessary to get back to the previous step when new information comes up. For example, when unexpectedly a data field is not available during the data preparation step, a return to data understanding (or even problem understanding) is required to integrate this new knowledge. Furthermore, results from modeling might give indications on how to improve data preparation. A graphical representation of the CRISP DM process is shown in Fig. 2.1.

    ../images/369368_1_En_2_Chapter/369368_1_En_2_Fig1_HTML.png

    Fig. 2.1

    Structured overview of the CRISP DM process. The iterative or back-and-forth nature is indicated by the corresponding arrows

    Another way of organizing the workflows typical of data science is known as SEMMA (abbreviation for: Select, Examine, Modify, Model, Assess). It has been introduced by the statistical software company SAS , mainly for the functional organization of its main data mining software SAS Enterprise Miner and is, for example, described in Chapter 1 of [10]. The SAS Enterprise Miner is also capable of net scoring and this will be shown in Sect. 7.​2. SEMMA assumes problem understanding and deployment as prerequisites but does not mention them. Just like CRISP DM, it emphasizes the importance of examining the available data.

    During data exploration, it may turn out that too few relevant data or observations are available which may require additional effort to correct these shortcomings. If only very few observations are available, more observations can be produced artificially by sampling with replacement. In times of Big Data , this method may not seem to be required very often, but it is still used for good reasons. If, on the contrary, more data is available than processable or meaningful, then sampling seems a promising solution, i.e., taking only a random part of the data. If sampling is stratified with respect to the target variable, it is called over- or undersampling depending on whether more or fewer observations than the original fraction of the corresponding target value are selected for the sample.

    Finally, the combination of data preparation and model evaluation typically includes the separation of the prepared data into several hold-out samples, for

    Enjoying the preview?
    Page 1 of 1