Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Practical Predictive Analytics
Practical Predictive Analytics
Practical Predictive Analytics
Ebook912 pages5 hours

Practical Predictive Analytics

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book
  • A unique book that centers around develop six key practical skills needed to develop and implement predictive analytics
  • Apply the principles and techniques of predictive analytics to effectively interpret big data
  • Solve real-world analytical problems with the help of practical case studies and real-world scenarios taken from the world of healthcare, marketing, and other business domains
Who This Book Is For

This book is for those with a mathematical/statistics background who wish to understand the concepts, techniques, and implementation of predictive analytics to resolve complex analytical issues. Basic familiarity with a programming language of R is expected.

LanguageEnglish
Release dateJun 30, 2017
ISBN9781785880469
Practical Predictive Analytics

Related to Practical Predictive Analytics

Related ebooks

Computers For You

View More

Related articles

Reviews for Practical Predictive Analytics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Practical Predictive Analytics - Ralph Winters

    Practical Predictive Analytics

    Back to the future with R, Spark, and more!

    Ralph Winters

    BIRMINGHAM - MUMBAI

    < html PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN http://www.w3.org/TR/REC-html40/loose.dtd>

    Practical Predictive Analytics

    Copyright © 2017 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: June 2017

    Production reference: 1300617

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham

    B3 2PB, UK.

    ISBN 978-1-78588-618-8

    www.packtpub.com

    Credits

    About the Author

    Ralph Winters started his career as a database researcher for a music performing rights organization (he composed as well!), and then branched out into healthcare survey research, finally landing in the Analytics and Information technology world. He has provided his statistical and analytics expertise to many large fortune 500 companies in the financial, direct marketing, insurance, healthcare, and pharmaceutical industries. He has worked on many diverse types of predictive analytics projects involving customer retention, anti-money laundering, voice of the customer text mining analytics, and health care risk and customer choice models.

    He is currently data architect for a healthcare services company working in the data and advanced analytics group. He enjoys working collaboratively with a smart team of business analysts, technologists, actuaries as well as with other data scientists.

    Ralph considered himself a practical person. In addition to authoring Practical Predictive Analytics for Packt Publishing, he has also contributed two tutorials illustrating the use of predictive analytics in Medicine and Healthcare in Practical Predictive Analytics and Decisioning Systems for Medicine: Miner et al., Elsevier September, 2014, and also presented Practical Text Mining with SQL using Relational Databases, at the 2013 11th Annual Text and Social Analytics Summit in Cambridge, MA.

    Ralph resides in New Jersey with his loving wife Katherine, amazing daughters Claire and Anna, and his four-legged friends, Bubba and Phoebe, who can be unpredictable.

    Ralph's web site can be found at ralphwinters.com.

    About the Reviewers

    Armando Fandango serves as chief technology officer of REAL Inc., building AI-based products and platforms for making smart connections between brands, agencies, publishers, and audiences. Armando founded NeuraSights with the goal of creating insights from small and big data using neural networks and machine learning. Previously, as chief data scientist and chief technology officer (CTO) for Epic Engineering and Consulting Group LLC, Armando worked with government agencies and large private organizations to build smart products by incorporating machine learning, big data engineering, enterprise data repositories, and enterprise dashboards. Armando has led data science and engineering teams as head of data for Sonobi Inc., driving big data and predictive analytics technology and strategy for JetStream, Sonobi's AdTech platform. Armando has managed high-performance computing (HPC) consulting and infrastructure for the Advanced Research Computing Centre at UCF. Armando has also been advising high-tech startups Quantfarm, Cortxia Foundation, and Studyrite as an advisory board member and AI expert. Armando has authored a book titled Python Data Analysis - Second Edition and has published research in international journals and conferences.

    Alberto Boschetti is a data scientist, with strong expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he daily faces challenges spanning among natural language processing (NLP), machine learning, and distributed processing. He is very passionate about his job and he always tries to be updated on the latest developments in data science technologies, attending meetups, conferences, and other events. He is the author of Python Data Science Essentials, Regression Analysis with Python and Large Scale Machine Learning with Python, all published by Packt.

    www.PacktPub.com

    For support files and downloads related to your book, please visit www.PacktPub.com.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www.packtpub.com/mapt

    Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Customer Feedback

    Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1785886185.

    If you'd like to join our team of regular reviewers, you can e-mail us at customerreviews@packtpub.com. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

    Table of Contents

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Downloading the color images of this book

    Errata

    Piracy

    Questions

    Getting Started with Predictive Analytics

    Predictive analytics are in so many industries

    Predictive Analytics in marketing

    Predictive Analytics in healthcare

    Predictive Analytics in other industries

    Skills and roles that are important in Predictive Analytics

    Related job skills and terms

    Predictive analytics software

    Open source software

    Closed source software

    Peaceful coexistence

    Other helpful tools

    Past the basics

    Data analytics/research

    Data engineering

    Management

    Team data science

    Two different ways to look at predictive analytics

    R

    CRAN

    R installation

    Alternate ways of exploring R

    How is a predictive analytics project organized?

    Setting up your project and subfolders

    GUIs

    Getting started with RStudio

    Rearranging the layout to correspond with the examples

    Brief description of some important panes

    Creating a new project

    The R console

    The source window

    Creating a new script

    Our first predictive model

    Code description

    Saving the script

    Your second script

    Code description

    The predict function

    Examining the prediction errors

    R packages

    The stargazer package

    Installing stargazer package

    Code description

    Saving your work

    References

    Summary

    The Modeling Process

    Advantages of a structured approach

    Ways in which structured methodologies can help

    Analytic process methodologies

    CRISP-DM and SEMMA

    CRISP-DM and SEMMA chart

    Agile processes

    Six sigma and root cause

    To sample or not to sample?

    Using all of the data

    Comparing a sample to the population

    An analytics methodology outline – specific steps

    Step 1 business understanding

    Communicating business goals – the feedback loop

    Internal data

    External data

    Tools of the trade

    Process understanding

    Data lineage

    Data dictionaries

    SQL

    Example – Using SQL to get sales by region

    Charts and plots

    Spreadsheets

    Simulation

    Example – simulating if a customer contact will yield a sale

    Example – simulating customer service calls

    Step 2 data understanding

    Levels of measurement

    Nominal data

    Ordinal data

    Interval data

    Ratio data

    Converting from the different levels of measurement

    Dependent and independent variables

    Transformed variables

    Single variable analysis

    Summary statistics

    Bivariate analysis

    Types of questions that bivariate analysis can answer

    Quantitative with quantitative variables

    Code example

    Nominal with nominal variables

    Cross-tabulations

    Mosaic plots

    Nominal with quantitative variables

    Point biserial correlation

    Step 3 data preparation

    Step 4 modeling

    Description of specific models

    Poisson (counts)

    Logistic regression

    Support vector machines (SVM)

    Decision trees

    Random forests

    Example - comparing single decision trees to a random forest

    An age decision tree

    An alternative decision tree

    The random forest model

    Random forest versus decision trees

    Variable importance plots

    Dimension reduction techniques

    Principal components

    Clustering

    Time series models

    Naive Bayes classifier

    Text mining techniques

    Step 5 evaluation

    Model validation

    Area under the curve

    Computing an ROC curve using the titanic dataset

    In sample/out of sample tests, walk forward tests

    Training/test/validation datasets

    Time series validation

    Benchmark against best champion model

    Expert opinions: man against machine

    Meta-analysis

    Dart board method

    Step 6 deployment

    Model scoring

    References

    Notes

    Summary

    Inputting and Exploring Data

    Data input

    Text file Input

    The read.table function

    Database tables

    Spreadsheet files

    XML and JSON data

    Generating your own data

    Tips for dealing with large files

    Data munging and wrangling

    Joining data

    Using the sqldf function

    Housekeeping and loading of necessary packages

    Generating the data

    Examining the metadata

    Merging data using Inner and Outer joins

    Identifying members with multiple purchases

    Eliminating duplicate records

    Exploring the hospital dataset

    Output from the str(df) function

    Output from the View function

    The colnames function

    The summary function

    Sending the output to an HTML file

    Open the file in the browser

    Plotting the distributions

    Visual plotting of the variables

    Breaking out summaries by groups

    Standardizing data

    Changing a variable to another type

    Appending the variables to the existing dataframe

    Extracting a subset

    Transposing a dataframe

    Dummy variable coding

    Binning – numeric and character

    Binning character data

    Missing values

    Setting up the missing values test dataset

    The various types of missing data

    Missing Completely at Random (MCAR)

    Testing for MCAR

    Missing at Random (MAR)

    Not Missing at Random (NMAR)

    Correcting for missing values

    Listwise deletion

    Imputation methods

    Imputing missing values using the 'mice' package

    Running a regression with imputed values

    Imputing categorical variables

    Outliers

    Why outliers are important

    Detecting outliers

    Transforming the data

    Tracking down the cause of the outliers

    Ways to deal with outliers

    Example – setting the outliers to NA

    Multivariate outliers

    Data transformations

    Generating the test data

    The Box-Cox Transform

    Variable reduction/variable importance

    Principal Components Analysis (PCA)

    Where is PCA used?

    A PCA example – US Arrests

    All subsets regression

    An example – airquality

    Adjusted R-square plot

    Variable importance

    Variable influence plot

    References

    Summary

    Introduction to Regression Algorithms

    Supervised versus unsupervised learning models

    Supervised learning models

    Unsupervised learning models

    Regression techniques

    Advantages of regression

    Generalized linear models

    Linear regression using GLM

    Logistic regression

    The odds ratio

    The logistic regression coefficients

    Example - using logistic regression in health care to predict pain thresholds

    Reading the data

    Obtaining some basic counts

    Saving your data

    Fitting a GLM model

    Examining the residuals

    Residual plots

    Added variable plots

    Outliers in the regression

    P-values and effect size

    P-values and effect sizes

    Variable selection

    Interactions

    Goodness of fit statistics

    McFadden statistic

    Confidence intervals and Wald statistics

    Basic regression diagnostic plots

    Description of the plots

    An interactive game – guessing if the residuals are random

    Goodness of fit – Hosmer-Lemeshow test

    Goodness of fit example on the PainGLM data

    Regularization

    An example – ElasticNet

    Choosing a correct lamda

    Printing out the possible coefficients based on Lambda

    Summary

    Introduction to Decision Trees, Clustering, and SVM

    Decision tree algorithms

    Advantages of decision trees

    Disadvantages of decision trees

    Basic decision tree concepts

    Growing the tree

    Impurity

    Controlling the growth of the tree

    Types of decision tree algorithms

    Examining the target variable

    Using formula notation in an rpart model

    Interpretation of the plot

    Printing a text version of the decision tree

    The ctree algorithm

    Pruning

    Other options to render decision trees

    Cluster analysis

    Clustering is used in diverse industries

    What is a cluster?

    Types of clustering

    Partitional clustering

    K-means clustering

    The k-means algorithm

    Measuring distance between clusters

    Clustering example using k-means

    Cluster elbow plot

    Extracting the cluster assignments

    Graphically displaying the clusters

    Cluster plots

    Generating the cluster plot

    Hierarchical clustering

    Examining some examples from cluster 1

    Examining some examples from cluster 2

    Examining some examples from cluster 3

    Support vector machines

    Simple illustration of a mapping function

    Analyzing consumer complains data using SVM

    Converting unstructured to structured data

    References

    Summary

    Using Survival Analysis to Predict and Analyze Customer Churn

    What is survival analysis?

    Time-dependent data

    Censoring

    Left censoring

    Right censoring

    Our customer satisfaction dataset

    Generating the data using probability functions

    Creating the churn and no churn dataframes

    Creating and verifying the new simulated variables

    Recombining the churner and non-churners

    Creating matrix plots

    Partitioning into training and test data

    Setting the stage by creating survival objects

    Examining survival curves

    Better plots

    Contrasting survival curves

    Testing for the gender difference between survival curves

    Testing for the educational differences between survival curves

    Plotting the customer satisfaction and number of service call curves

    Improving the education survival curve by adding gender

    Transforming service calls to a binary variable

    Testing the difference between customers who called and those who did not

    Cox regression modeling

    Our first model

    Examining the cox regression output

    Proportional hazards test

    Proportional hazard plots

    Obtaining the cox survival curves

    Plotting the curve

    Partial regression plots

    Examining subset survival curves

    Comparing gender differences

    Comparing customer satisfaction differences

    Validating the model

    Computing baseline estimates

    Running the predict() function

    Predicting the outcome at time 6

    Determining concordance

    Time-based variables

    Changing the data to reflect the second survey

    How survSplit works

    Adjusting records to simulate an intervention

    Running the time-based model

    Comparing the models

    Variable selection

    Incorporating interaction terms

    Displaying the formulas sublist

    Comparing AIC among the candidate models

    Summary

    Using Market Basket Analysis as a Recommender Engine

    What is market basket analysis?

    Examining the groceries transaction file

    Format of the groceries transaction Files

    The sample market basket

    Association rule algorithms

    Antecedents and descendants

    Evaluating the accuracy of a rule

    Support

    Calculating support

    Examples

    Confidence

    Lift

    Evaluating lift

    Preparing the raw data file for analysis

    Reading the transaction file

    capture.output function

    Analyzing the input file

    Analyzing the invoice dates

    Plotting the dates

    Scrubbing and cleaning the data

    Removing unneeded character spaces

    Simplifying the descriptions

    Removing colors automatically

    The colors() function

    Cleaning up the colors

    Filtering out single item transactions

    Looking at the distributions

    Merging the results back into the original data

    Compressing descriptions using camelcase

    Custom function to map to camelcase

    Extracting the last word

    Creating the test and training datasets

    Saving the results

    Loading the analytics file

    Determining the consequent rules

    Replacing missing values

    Making the final subset

    Creating the market basket transaction file

    Method one – Coercing a dataframe to a transaction file

    Inspecting the transaction file

    Obtaining the topN purchased items

    Finding the association rules

    Examining the rules summary

    Examining the rules quality and observing the highest support

    Confidence and lift measures

    Filtering a large number of rules

    Generating many rules

    Plotting many rules

    Method two – Creating a physical transactions file

    Reading the transaction file back in

    Plotting the rules

    Creating subsets of the rules

    Text clustering

    Converting to a document term matrix

    Removing sparse terms

    Finding frequent terms

    K-means clustering of terms

    Examining cluster 1

    Examining cluster 2

    Examining cluster 3

    Examining cluster 4

    Examining cluster 5

    Predicting cluster assignments

    Using flexclust to predict cluster assignment

    Running k-means to generate the clusters

    Creating the test DTM

    Running the apriori algorithm on the clusters

    Summarizing the metrics

    References

    Summary

    Exploring Health Care Enrollment Data as a Time Series

    Time series data

    Exploring time series data

    Health insurance coverage dataset

    Housekeeping

    Read the data in

    Subsetting the columns

    Description of the data

    Target time series variable

    Saving the data

    Determining all of the subset groups

    Merging the aggregate data back into the original data

    Checking the time intervals

    Picking out the top groups in terms of average population size

    Plotting the data using lattice

    Plotting the data using ggplot

    Sending output to an external file

    Examining the output

    Detecting linear trends

    Automating the regressions

    Ranking the coefficients

    Merging scores back into the original dataframe

    Plotting the data with the trend lines

    Plotting all the categories on one graph

    Adding labels

    Performing some automated forecasting using the ets function

    Converting the dataframe to a time series object

    Smoothing the data using moving averages

    Simple moving average

    Computing the SMA using a function

    Verifying the SMA calculation

    Exponential moving average

    Computing the EMA using a function

    Selecting a smoothing factor

    Using the ets function

    Forecasting using ALL AGES

    Plotting the predicted and actual values

    The forecast (fit) method

    Plotting future values with confidence bands

    Modifying the model to include a trend component

    Running the ets function iteratively over all of the categories

    Accuracy measures produced by onestep

    Comparing the Test and Training for the UNDER 18 YEARS group

    Accuracy measures

    References

    Summary

    Introduction to Spark Using R

    About Spark

    Spark environments

    Cluster computing

    Parallel computing

    SparkR

    Dataframes

    Building our first Spark dataframe

    Simulation

    Importing the sample notebook

    Notebook format

    Creating a new notebook

    Becoming large by starting small

    The Pima Indians diabetes dataset

    Running the code

    Running the initialization code

    Extracting the Pima Indians diabetes dataset

    Examining the output

    Output from the str() function

    Output from the summary() function

    Comparing outcomes

    Checking for missing values

    Imputing the missing values

    Checking the imputations (reader exercise)

    Missing values complete!

    Calculating the correlation matrices

    Calculating the column means

    Simulating the data

    Which correlations to use?

    Checking the object type

    Simulating the negative cases

    Concatenating the positive and negative cases into a single Spark dataframe

    Running summary statistics

    Saving your work

    Summary

    Exploring Large Datasets Using Spark

    Performing some exploratory analysis on positives

    Displaying the contents of a Spark dataframe

    Graphing using native graph features

    Running pairwise correlations directly on a Spark dataframe

    Cleaning up and caching the table in memory

    Some useful Spark functions to explore your data

    Count and groupby

    Covariance and correlation functions

    Creating new columns

    Constructing a cross-tab

    Contrasting histograms

    Plotting using ggplot

    Spark SQL

    Registering tables

    Issuing SQL through the R interface

    Using SQL to examine potential outliers

    Creating some aggregates

    Picking out some potential outliers using a third query

    Changing to the SQL API

    SQL – computing a new column using the Case statement

    Evaluating outcomes based upon the Age segment

    Computing mean values for all of the variables

    Exporting data from Spark back into R

    Running local R packages

    Using the pairs function (available in the base package)

    Generating a correlation plot

    Some tips for using Spark

    Summary

    Spark Machine Learning - Regression and Cluster Models

    About this chapter/what you will learn

    Reading the data

    Running a summary of the dataframe and saving the object

    Splitting the data into train and test datasets

    Generating the training datasets

    Generating the test dataset

    A note on parallel processing

    Introducing errors into the test data set

    Generating a histogram of the distribution

    Generating the new test data with errors

    Spark machine learning using logistic regression

    Examining the output:

    Regularization Models

    Predicting outcomes

    Plotting the results

    Running predictions for the test data

    Combining the training and test dataset

    Exposing the three tables to SQL

    Validating the regression results

    Calculating goodness of fit measures

    Confusion matrix

    Confusion matrix for test group

    Distribution of average errors by group

    Plotting the data

    Pseudo R-square

    Root-mean-square error (RMSE)

    Plotting outside of Spark

    Collecting a sample of the results

    Examining the distributions by outcome

    Registering some additional tables

    Creating some global views

    User exercise

    Cluster analysis

    Preparing the data for analysis

    Reading the data from the global views

    Inputting the previously computed means and standard deviations

    Joining the means and standard deviations with the training data

    Joining the means and standard deviations with the test data

    Normalizing the data

    Displaying the output

    Running the k-means model

    Fitting the model to the training data

    Fitting the model to the test data

    Graphically display cluster assignment

    Plotting via the Pairs function

    Characterizing the clusters by their mean values

    Calculating mean values for the test data

    Summary

    Spark Models – Rule-Based Learning

    Loading the stop and frisk dataset

    Importing the CSV file to databricks

    Reading the table

    Running the first cell

    Reading the entire file into memory

    Transforming some variables to integers

    Discovering the important features

    Eliminating some factors with a large number of levels

    Test and train datasets

    Examining the binned data

    Running the OneR model

    Interpreting the output

    Constructing new variables

    Running the prediction on the test sample

    Another OneR example

    The rules section

    Constructing a decision tree using Rpart

    First collect the sample

    Decision tree using Rpart

    Plot the tree

    Running an alternative model in Python

    Running a Python Decision Tree

    Reading the Stop and Frisk table

    Indexing the classification features

    Mapping to an RDD

    Specifying the decision tree model

    Producing a larger tree

    Visual trees

    Comparing train and test decision trees

    Summary

    Preface

    This is a different kind of predictive analytics book. My original intention was to introduce predictive analytics techniques targeted towards legacy analytics folks, using open source tools.

    However, I soon realized that they were certain aspects of legacy analytics tools that could benefit the new generation of data scientists. Having worked a large part of my career in enterprise data solutions, I was interested in writing about some different kinds of topics, such as analytics methodologies, agile, metadata, SQL analytics, and reproducible research, which are often neglected in some data science/predictive analytics books, but still critical to the success of analytics project.

    I also wanted to write about some underrepresented analytics techniques that extend beyond standard regression and classification tasks, such as using survival analysis to predict customer churn, and using market basket analysis as a recommendation engine.

    Since there is a lot of movement towards cloud-based solutions, I thought it was important to include some chapters on cloud based analytics (big data), so I included several chapters on developing predictive analytics solutions within a Spark environment.

    Whatever your orientation is, a key point of this book is collaboration, and I hope that regardless of your definition of data science, predictive analytics, big data, or even a benign term such as forecasting, you will find something here that suits your needs.

    Furthermore, I wanted to pay homage to the domain expert as part of the data science team. Often, these analysts are not given fancy titles, but business analysts, can make the difference between a successful analytics project and one that falls flat on its face. Hopefully, some of the topics I discuss will strike a chord with them, and get them more interested in some of the technical concepts of predictive analytics.

    When I was asked by Packt to write a book about predictive analytics, I first wondered what would be a good open source language to bridge the gap between legacy analytics and today's data scientist world. I thought about this considerably, since each language brings its own nuances in terms of how solutions to problems are expressed. However, I decided ultimately not to sweat the details, since predictive analytics concepts are not language-dependent, and the choice of language often is determined by personal preference as well as what is in use within the company in which you work.

    I chose the R language because my background is in statistics, and I felt that R had good statistical rigor and now has reasonable integration with propriety software such as SAS, and also has good integration with relational database systems, as well as web protocols. It also has an excellent plotting and visualization system, and along with its many good user contributed packages, covers most statistical and predictive analytics tasks.

    Regarding statistics, I suggest that you learn as much statistics as you can. Knowing statistics can help you separate good models from bad, and help you identify many problems in bad data just by understanding basic concepts such as measures of central tendencies (mean, median, mode), hypothesis testing, p-values, and effect sizes. It will also help you shy away from merely running a package in an automated way, and help you look a little at what is under the hood.

    One downside to R is that it processes data in memory, so the software can limit the size of potentially larger datasets when used on a single PC. For the datasets we use in this book, there should be no problems running R on a single PC. If you are interested in analyzing big data, I do spend several chapters discussing R and Spark within a cloud environment, in which you can processes very large datasets that are distributed between many different computers.

    Speaking of the datasets used in this book, I did not want to use the same datasets that you see analyzed repeatedly. Some of these are datasets are excellent for demonstrating techniques, but I wanted some alternatives. However, I did not see a whole lot of them that I thought would be useful for this book. Some were from unknown sources, some needed formal permission to use, some lacked a good data dictionary. So, for many chapters, I ended up generating my own data using simulation techniques in R. I believe that was a good choice, since it enabled me to introduce some data generating techniques that you can use in your own work.

    The data I used covers a good spectrum of marketing, retail and healthcare applications. I also would have liked to include some financial predictive analytics use cases but ran out of time. Maybe I will leave that for another book!

    What this book covers

    Chapter 1, Getting Started with Predictive Analytics, begins with a little bit of history of how predictive analytics developed. We then discuss some different roles of predictive analytics practitioners, and describe the industries in which they work. Ways to organize predictive analytic projects on a PC is discussed next, the R language is introduced, and we end the chapter with a short example of a predictive model.

    Chapter 2, The Modeling Process, discusses how the development of predictive models can be organized into a series of stages, each with different goals, such as exploration and problem definition, leading to the actual development of a predictive model. We discuss two important analytics methodologies, CRISP-DM and SEMMA. Code examples are sprinkled through the chapter to demonstrate some of the ideas central to the methodologies, so you will hopefully, never be bored...

    Chapter 3, Inputting and Exploring Data, introduces various ways that you can bring your own input data into R. We also discuss various data preparation techniques using standard SQL functions as well as analogous methods using the R dplyr package. Have no data to input? No problem. We will show you how to generate your own human-like data using the R package wakefield.

    Chapter 4, Introduction to Regression Algorithms, begins with a discussion of supervised versus unsupervised algorithms. The rest of the chapter concentrates on regression algorithms, which represent the supervised algorithm category. You will learn about interpreting regression output such as model coefficients and residual plots. There is even an interactive game that supplies an interact test to see if you can determine if a series of residuals are random or not.

    Chapter 5, Introduction to Decision trees, Clustering, and SVM, concentrates on three other core predictive algorithms that have widespread use, and, along with regression, can be used to solve many, if not most, of your predictive analytics problems. The last algorithm discussed, Support Vector Machines (SVMs), are often used with high-dimensional data, such as unstructured text, so we will accompany this example with some text mining techniques using some customer complaint comments.

    Chapter 6, Using Survival Analysis to Predict and Analyze Customer Churn, discusses a specific modeling technique known as survival analysis and follows a hypothetical customer marketing satisfaction and retention example. We will also delve more deeply into simulating customer choice using some sampling functions available in R.

    Chapter 7, Using Market Basket Analysis as a Recommender Engine, introduces the concept of association rules and market basket analysis, and steps you through some techniques that can predict future purchases based upon various combinations of previous purchases from an online retail store. It also introduces some text analytics techniques coupled with some cluster analysis that places various customers into different segments. You will learn some additional data cleaning techniques, and learn how to generate some interesting association plots.

    Chapter 8, Exploring Health Care Enrollment Data as a Time Series, introduces time series analytics. Healthcare enrollment data from the CMS website is first explored. Then we move on to defining some basic time series concepts such as simple and exponential moving averages. Finally, we work with the R forecast package which, as its name implies, helps you to perform some time series forecasting.

    Chapter 9, Introduction to Spark Using R, introduces RSpark, which is an environment for accessing large Spark clusters using R. No local version of R needs to be installed. It also introduces Databricks, which is a cloud-based environment for running R (as well as Python, SQL, and other language), against Spark-based big data. This chapter also demonstrates techniques for transforming small datasets into larger Spark clusters using the Pima Indians Diabetes database as reference.

    Chapter 10, Exploring Large Datasets Using Spark, shows how to perform some exploratory data analysis using a combination of RSpark and Spark SQL using the Pima Indians Diabetes data loaded into Spark. We will learn the basics of exploring Spark data using some Spark-specific commands that allow us to filter, group and summarize, and visualize our Spark data.

    Chapter 11, Spark Machine Learning – Regression and Cluster Models, covers machine learning by first illustrating a logistic regression model that has been built using a Spark cluster. We will learn how to split Spark data into training and test data in Spark, run a logistic regression model, and then evaluate its performance.

    Chapter 12, Spark Models - Rules-Based Learning, teaches you how to run decision tree models in Spark using the Stop and Frisk dataset. You will learn how to overcome some of the algorithmic limitations of the Spark MLlib environment by extracting some cluster samples to your local machine and then run some non-Spark algorithms that you are already familiar with. This chapter will also introduce you to a new rule-based algorithm, OneR, and will also demonstrate how you can mix different languages together in Spark, such as mixing R, SQL, and even Python code together in the same notebook using the %magic directive.

    What you need for this book

    This is neither an introductory predictive analytics book, not an introductory book for learning R or Spark. Some knowledge of base R data manipulation techniques is expected. Some prior knowledge of predictive analytics is useful. As mentioned earlier, knowledge of basic statistical concepts such as hypothesis testing, correlation, means, standard deviations, and p-values will also help you navigate this book.

    Who this book is for

    This book is for those who have already had an introduction to R, and are looking to learn how to develop enterprise predictive analytics solutions. Additionally, traditional business analysts and managers who wish to extend their skills into predictive analytics using open source R may find the book useful. Existing predictive analytic practitioners who know another language, or those who wish to learn about analytics using Spark, will also find the chapters on Spark and R beneficial.

    Conventions

    In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

    Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows:

    Save all output to the /PracticalPredictiveAnalytics/Outputs directory.

    A block of code is set as follows:

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    Any command-line, (including commands at the R console) input or output is written as follows:

    New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: Clicking the Next button moves you to the next screen.

    Warnings or important notes appear in a box like this.

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    You can download the code files by following these steps:

    Log in or register to our website using your e-mail address and password.

    Hover the mouse pointer on the SUPPORT tab at the top.

    Click on Code Downloads & Errata.

    Enter the name of the book in the Search box.

    Select the book for which you're looking to download the code files.

    Choose from the drop-down menu where you purchased this book from.

    Click on Code Download.

    Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

    WinRAR / 7-Zip for Windows

    Zipeg / iZip / UnRarX for Mac

    7-Zip / PeaZip for Linux

    The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Predictive-Analytics. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Downloading the color images of this book

    We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PracticalPredictiveAnalytics_ColorImages.pdf.

    Errata

    Although we have taken every care to ensure

    Enjoying the preview?
    Page 1 of 1