Introduction to Statistical and Machine Learning Methods for Data Science
()
About this ebook
Data science is an exciting, interdisciplinary field that extracts insights from data to solve business problems. This book introduces common data science techniques and methods and shows you how to apply them in real-world case studies. From data preparation and exploration to model assessment and deployment, this book describes every stage of the analytics life cycle, including a comprehensive overview of unsupervised and supervised machine learning techniques. The book guides you through the necessary steps to pick the best techniques and models and then implement those models to successfully address the original business need.
No software is shown in the book, and mathematical details are kept to a minimum. This allows you to develop an understanding of the fundamentals of data science, no matter what background or experience level you have.
Carlos Andre Reis Pinheiro
Dr. Carlos Andre Reis Pinheiro is a Principal Data Scientist at SAS and a Visiting Professor at Data ScienceTech Institute in France. He has been working in analytics since 1996 for some of the largest telecommunications providers in Brazil in multiple roles from technical to executive. He worked as a Senior Data Scientist for EMC in Brazil on network analytics, optimization, and text analytics projects, and as a Lead Data Scientist for Teradata on machine learning projects. Dr. Pinheiro has a BSc in Applied Mathematics and Computer Science, an MSc in Computing, and a DSc in Engineering from the Federal University of Rio de Janeiro. Carlos has completed a series of postdoctoral research terms in different fields, including Dynamic Systems at IMPA, Brazil; Social Network Analysis at Dublin City University, Ireland; Transportation Systems at Université de Savoie, France; Dynamic Social Networks and Human Mobility at Katholieke Universiteit Leuven, Belgium; and Urban Mobility and Multi-modal Traffic at Fundação Getúlio Vargas, Brazil. He has published several papers in international journals and conferences, and he is author of Social Network Analysis in Telecommunications and Heuristics in Analytics: A Practical Perspective of What Influence Our Analytical World, both published by John Wiley and Sons, Inc.
Related to Introduction to Statistical and Machine Learning Methods for Data Science
Related ebooks
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next Rating: 0 out of 5 stars0 ratingsMastering Python for Data Science Rating: 3 out of 5 stars3/5Data Analytics with Python: Data Analytics in Python Using Pandas Rating: 3 out of 5 stars3/5End-to-End Data Science with SAS: A Hands-On Programming Guide Rating: 0 out of 5 stars0 ratingsApplied Data Mining for Forecasting Using SAS Rating: 0 out of 5 stars0 ratingsSmarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects Rating: 0 out of 5 stars0 ratingsMachine Learning Algorithms for Data Scientists: An Overview Rating: 0 out of 5 stars0 ratingsPractical Predictive Analytics Rating: 0 out of 5 stars0 ratingsPractical Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsMachine Learning with SAS Viya Rating: 0 out of 5 stars0 ratingsApplying Data Science: Business Case Studies Using SAS Rating: 0 out of 5 stars0 ratingsBuilding a Recommendation System with R Rating: 0 out of 5 stars0 ratingsMastering Machine Learning with R Rating: 0 out of 5 stars0 ratingsData Science: Concepts and Practice Rating: 3 out of 5 stars3/5Python Machine Learning By Example Rating: 4 out of 5 stars4/5Data Analytics Rating: 1 out of 5 stars1/5Deep Learning for Computer Vision with SAS: An Introduction Rating: 0 out of 5 stars0 ratingsPython Data Science Essentials - Second Edition Rating: 4 out of 5 stars4/5R for Data Science Rating: 5 out of 5 stars5/5Learning pandas - Second Edition Rating: 4 out of 5 stars4/5Machine Learning with Spark - Second Edition Rating: 0 out of 5 stars0 ratingsPython Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsMachine Learning and Data Mining Rating: 3 out of 5 stars3/5Getting Started with Python Data Analysis Rating: 0 out of 5 stars0 ratingsData Preparation for Data Mining Using SAS Rating: 5 out of 5 stars5/5
Computers For You
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsThe ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 0 out of 5 stars0 ratingsHow to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5Elon Musk Rating: 4 out of 5 stars4/5The Best Hacking Tricks for Beginners Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5The Designer's Web Handbook: What You Need to Know to Create for the Web Rating: 0 out of 5 stars0 ratingsChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology Rating: 0 out of 5 stars0 ratingsGrokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Learning the Chess Openings Rating: 5 out of 5 stars5/5The Mega Box: The Ultimate Guide to the Best Free Resources on the Internet Rating: 4 out of 5 stars4/5CompTIA Security+ Practice Questions Rating: 2 out of 5 stars2/5Practical Lock Picking: A Physical Penetration Tester's Training Guide Rating: 5 out of 5 stars5/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Master Builder Roblox: The Essential Guide Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles Rating: 4 out of 5 stars4/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5
Reviews for Introduction to Statistical and Machine Learning Methods for Data Science
0 ratings0 reviews
Book preview
Introduction to Statistical and Machine Learning Methods for Data Science - Carlos Andre Reis Pinheiro
Introduction to
Statistical and Machine Learning Methods for Data Science
Carlos Andre Reis Pinheiro
Mike Patetta
sas.com/books
The correct bibliographic citation for this manual is as follows: Pinheiro, Carlos Andre Reis and Mike Patetta. 2021. Introduction to Statistical and Machine Learning Methods for Data Science. Cary, NC: SAS Institute Inc.
Introduction to Statistical and Machine Learning Methods for Data Science
Copyright © 2021, SAS Institute Inc., Cary, NC, USA
ISBN 978-1-953329-64-6 (Hardcover)
ISBN 978-1-953329-60-8 (Paperback)
ISBN 978-1-953329-61-5 (Web PDF)
ISBN 978-1-953329-62-2 (EPUB)
ISBN 978-1-953329-63-9 (Kindle)
All Rights Reserved. Produced in the United States of America.
For a hard copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.
For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication, or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR 227.7202-4, and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government’s rights in Software and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414
August 2021
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
SAS software may be provided with certain third-party software, including but not limited to open-source software, which is licensed under its applicable third-party software license agreement. For license information about third-party software distributed with SAS software, refer to http://support.sas.com/thirdpartylicenses.
Contents
About This Book
About These Authors
Acknowledgments
Foreword
Chapter 1: Introduction to Data Science
Chapter Overview
Data Science
Mathematics and Statistics
Computer Science
Domain Knowledge
Communication and Visualization
Hard and Soft Skills
Data Science Applications
Data Science Lifecycle and the Maturity Framework
Understand the Question
Collect the Data
Explore the Data
Model the Data
Provide an Answer
Advanced Analytics in Data Science
Data Science Practical Examples
Customer Experience
Revenue Optimization
Network Analytics
Data Monetization
Summary
Additional Reading
Chapter 2: Data Exploration and Preparation
Chapter Overview
Introduction to Data Exploration
Nonlinearity
High Cardinality
Unstructured Data
Sparse Data
Outliers
Mis-scaled Input Variables
Introduction to Data Preparation
Representative Sampling
Event-based Sampling
Partitioning
Imputation
Replacement
Transformation
Feature Extraction
Feature Selection
Model Selection
Model Generalization
Bias–Variance Tradeoff
Summary
Chapter 3: Supervised Models – Statistical Approach
Chapter Overview
Classification and Estimation
Linear Regression
Use Case: Customer Value
Logistic Regression
Use Case: Collecting Predictive Model
Decision Tree
Use Case: Subscription Fraud
Summary
Chapter 4: Supervised Models – Machine Learning Approach
Chapter Overview
Supervised Machine Learning Models
Ensemble of Trees
Random Forest
Gradient Boosting
Use Case: Usage Fraud
Neural Network
Use Case: Bad Debt
Summary
Chapter 5: Advanced Topics in Supervised Models
Chapter Overview
Advanced Machine Learning Models and Methods
Support Vector Machines
Use Case: Fraud in Prepaid Subscribers
Factorization Machines
Use Case: Recommender Systems Based on Customer Ratings in Retail
Ensemble Models
Use Case Study: Churn Model for Telecommunications
Two-stage Models
Use Case: Anti-attrition
Summary
Additional Reading
Chapter 6: Unsupervised Models—Structured Data
Chapter Overview
Clustering
Hierarchical Clustering
Use Case: Product Segmentation
Centroid-based Clustering (k-means Clustering)
Use Case: Customer Segmentation
Self-organizing Maps
Use Case Study: Insolvent Behavior
Cluster Evaluation
Cluster Profiling
Additional Topics
Summary
Additional Reading
Chapter 7: Unsupervised Models—Semi Structured Data
Chapter Overview
Association Rules Analysis
Market Basket Analysis
Confidence and Support Measures
Use Case: Product Bundle Example
Expected Confidence and Lift Measures
Association Rules Analysis Evaluation
Use Case: Product Acquisition
Sequence Analysis
Use Case: Next Best Offer
Link Analysis
Use Case: Product Relationships
Path Analysis
Use Case Study: Online Experience
Text Analytics
Use Case Study: Call Center Categorization
Summary
Additional Reading
Chapter 8: Advanced Topics in Unsupervised Models
Chapter Overview
Network Analysis
Network Subgraphs
Network Metrics
Use Case: Social Network Analysis to Reduce Churn in Telecommunications
Network Optimization
Network Algorithms
Use Case: Smart Cities – Improving Commuting Routes
Summary
Chapter 9: Model Assessment and Model Deployment
Chapter Overview
Methods to Evaluate Model Performance
Speed of Training
Speed of Scoring
Business Knowledge
Fit Statistics
Data Splitting
K-fold Cross-validation
Goodness-of-fit Statistics
Confusion Matrix
ROC Curve
Model Evaluation
Model Deployment
Challenger Models
Monitoring
Model Operationalization
Summary
About This Book
What Does This Book Cover?
This book gives an overview of the statistical and machine learning methods used in data science projects, with an emphasis on the applicability to business problem solving. No software is shown, and the mathematical details are kept to a minimum. The book describes the tasks associated with all stages of the analytical life cycle, including data preparation and data exploration, feature engineering and selection, analytical modeling considering supervised and unsupervised techniques, and model assessment and deployment. It describes the techniques and provides real-world case studies to exemplify the techniques. Readers will learn the most important techniques and methods related to data science and when to apply them for different business problems. The book provides a comprehensive overview about the statistical and machine learning techniques associated with data science initiatives and guides readers through the necessary steps to successfully deploy data science projects.
This book covers the most important data science skills, the types of different data science applications, the phases in the data science lifecycle, the techniques assigned to the data preparation steps for data science, some of the most common techniques associated to supervised machine learning models (linear and logistic regression, decision tree, forest, gradient boosting, neural networks, support vector machines, and factorization machines), advanced supervised modeling methods like ensemble models and two-stage models, the most important techniques associated to unsupervised machine learning models (clustering, association rules, sequence analysis, link analysis, path analysis, network analysis, and network optimization), the method and fits statistics to assess model results, different approaches to deploy analytical models in production, and the main topics related to the model operationalization process.
This book does not cover the techniques for data engineering in depth. It also does not provide any programming code for the supervised and unsupervised models, nor does it show in practice how to deploy models in production.
Is This Book for You?
The audience of this book is data scientists, data analysts, data engineers, business analysts, market analysts, or computer scientists. However, anyone who wants to learn more about data science skills could benefit from reading this book.
What Are the Prerequisites for This Book?
There are no prerequisites for this book.
We Want to Hear from You
SAS Press books are written by SAS Users for SAS Users. We welcome your participation in their development and your feedback on SAS Press books that you are using. Please visit sas.com/books to do the following:
Sign up to review a book
Recommend a topic
Request information about how to become a SAS Press author
Provide feedback on a book
About These Authors
Dr. Carlos Pinheiro is a Principal Data Scientist at SAS and a Visiting Professor at Data ScienceTech Institute in France. He has been working in analytics since 1996 for some of the largest telecommunications providers in Brazil in multiple roles from technical to executive. He worked as a Senior Data Scientist for EMC in Brazil on network analytics, optimization, and text analytics projects, and as a Lead Data Scientist for Teradata on machine learning projects. Dr. Pinheiro has a BSc in Applied Mathematics and Computer Science, an MSc in Computing, and a DSc in Engineering from the Federal University of Rio de Janeiro. Carlos has completed a series of postdoctoral research terms in different fields, including Dynamic Systems at IMPA, Brazil; Social Network Analysis at Dublin City University, Ireland; Transportation Systems at Université de Savoie, France; Dynamic Social Networks and Human Mobility at Katholieke Universiteit Leuven, Belgium; and Urban Mobility and Multi-modal Traffic at Fundação Getúlio Vargas, Brazil. He has published several papers in international journals and conferences, and he is author of Social Network Analysis in Telecommunications and Heuristics in Analytics: A Practical Perspective of What Influence Our Analytical World, both published by John Wiley Sons, Inc.
Michael Patetta has been a statistical instructor for SAS since 1994. He teaches a variety of courses including Supervised Machine Learning Procedures Using SAS® Viya® in SAS® Studio, Predictive Modeling Using Logistic Regression, Introduction to Data Science Statistical Methods, and Regression Methods Using SAS Viya. Before coming to SAS, Michael worked in the North Carolina State Health Department for 10 years as a health statistician and program manager. He has authored or co-authored 10 published papers since 1983. Michael has a BA from the University of Notre Dame and a MA from the University of North Carolina at Chapel Hill. In his spare time, he loves to hike in National Parks.
Learn more about these authors by visiting their author pages, where you can download free book excerpts, access example code and data, read the latest reviews, get updates, and more:
http://support.sas.com/pinheiro
http://support.sas.com/patetta
To Daniele, Lucas and Maitê.
Acknowledgments
I joined SAS on December 7th, 2015, but many people believed I had worked for SAS before. Not officially. But indeed, I have a long story with SAS.
I started using SAS in 2002 when I was working for Brasil Telecom, where I created a very active data mining group, developing supervised and unsupervised models across the entire corporation. In 2008, I moved to Dublin, Ireland, to perform a postdoc at Dublin City University. For two years I used SAS for social network analysis. I deployed SNA models at Eircom as a result of my research. After that, I spent six months at SAS Ireland using the brand new OPTGRAPH procedure. I developed some models to detect fraud in auto insurance and taxpayers.
In 2010, I returned to Brazil, and I had the opportunity to create an Analytics Lab at Oi. The Lab focused on developing innovative analytics for marketing, fraud, finance, collecting and engineering. SAS was a big sponsor/partner of it.
At the beginning of 2012, I worked for few months with SAS Turkey creating some network analysis projects for communications companies, and thereafter I moved to Annecy, France, to perform a postdoc at Université de Savoie, France. The research was focused on transportation systems, and I used SAS to develop network models. In 2013, I moved to Leuven, Belgium, to perform a postdoc at KU Leuven. The research was focused on dynamic network analysis, and I also used SAS for the model development. Back to Brazil in 2014, I worked as a data scientist for EMC² and Teradata, but most of the time I was still using SAS, sometimes with open-source packages. In 2014/2015, I performed a postdoc at Fundação Getúlio Vargas. The research was focused on human mobility and guess what, I used SAS.
Finally, thanks to Cat Truxillo, I found my place at SAS. I joined the Advanced Analytics group in Education. I have learned so much working at this group. It was a big challenge to keep up with such brilliant minds. I would like to thank each and every person in the Education group who has taught me over those years, but I would like to name a few of them specifically: Chris Daman, Robert Blanchard, Jeff Thompson, Terry Woodfield, and Chip Wells. To all of you, many thanks!
A special thanks to Jeff Thompson and Tarek Elnaccash for a relentless review. Both were instrumental in getting this book done.
Thanks to Suzanne Morgen for being an amazing editor and walking us through this process so smoothly.
Carlos Andre Reis Pinheiro
The idea for this book originated with Carlos Pinheiro. His experience as a data scientist has always impressed me, and this book highlights many of Carlos’s success stories. Therefore, I would like to give thanks to Carlos for the inspiration for this book. I would like to thanks to the reviewers, Jeff Thompson, Tarek Elnaccash, and Cat Truxillo, for their diligent work to make the book technically accurate. Finally, I would like to give thanks to Suzanne Morgen, whose edits made the book flow as smoothly as possible.
Michael James Patetta
Foreword
The book you have open in front of you provides a taste of many data science techniques, interspersed with tales of real-world implementations and discoveries. The idea for this book originated when my team and I were designing the SAS Academy for Data Science. We designed a fairly ambitious training and certification program, assuming that people who enroll in the academy would have several years’ experience working with data and analytics before they get started.
In 2015, the SAS Academy for Data Science was launched as a self-paced e-learning program. Designing the academy’s curriculum required research into the state of data science, discussions with faculty training the next generation of data scientists, and shadowing consultants who bring the data to life for their clients. Those topics shift and evolve over time, and today, it is one of the top data science training programs in the world. The curriculum has been adopted by university graduate programs on every continent except Antarctica.
What we have found in practice, however, is that there is a considerably broader audience who want to enroll in the academy, including smart people who have experience in a different area, but do not have the benefit of several years’ data analysis to guide their thinking of how they can apply analytics in their own fields.
For learners like these, where to begin? Carlos Pinheiro and Mike Patetta had the idea to create a short course that provides an overview of data science methods and lots of first-hand experiences as working data scientists.
Carlos Andre Reis de Pinheiro has written extensively in data science, including a Business Knowledge Series course (and later, his book) on Social Network Analysis. It was through this course that Carlos and I started working together. The first thing you notice about Carlos is that he is a born storyteller. The second thing you notice is that he loves soccer—I mean he really, really loves soccer. Over time I got to know more about this soccer-crazy professor who can keep everyone’s attention with amazing stories from his data science research. Carlos has lived and worked in (at least) six different countries, and he is fluent in (at least) four languages. Here is a person with unstoppable curiosity and drive for growth. In 2016, he joined my colleagues and me in the Advanced Analytics Education department at SAS, where he has contributed his relentless hard work and ingenuity to solve business problems with data and analytics. Today he takes a direct, hands-on approach to showing companies what is possible with some data management elbow-grease, some well-trained models, and curiosity.
Mike Patetta has been a friend and colleague for over 20 years. In fact, he was the first person who interviewed me, in 1999, when I applied to work at SAS. Mike has a natural gift for educating others. He is someone who can dive into an unfamiliar topic in statistics and distill a shelf-full of books and journal articles down to a few learner-friendly hour-long lectures. The partnership between these two authors resulted in a course—and now, a book—that is rich with detailed information, written in an easy, comfortable style, with ample use cases from the authors’ own experiences.
Data science is fun, or that’s what recruiters would have you believe. Data science entails coaxing patterns, meanings, and insights from large and diverse volumes of messy data. In practice, that means spending more time than you might like on getting access to data, determining what is in a record, how records are represented in files, how the file is structured, and how to combine the information in a meaningful way with other files. That is, for many of us, most of the work a data scientist does. So where is the fun?
The reward of data science work comes when the data are organized, cleaned, and arranged for analysis. That first batch of visualizations, the feature engineering, the modeling—that is what makes data science such rewarding work. More than almost any other career, data scientists get to ask question after question, the answers leading to subsequent questions. From one day to another, your work can be completely different. You don’t get to tell the data what to say—the data will speak to you, if you have the tools and curiosity to listen.
This book (and its accompanying course) provide a framework for doing project work, the analytics lifecycle. The analytics lifecycle acknowledges and addresses all members of the data science team—IT, computer engineers, statisticians, and executive stakeholders—and makes clear how the work and responsibilities are divided through the entire lifecycle of a data science project. The emphasis of this book is on making sense—of data, of models, and of results from deployed models. You might say that the ideal audience for this book is a Citizen Data Scientist (to use Gartner’s term) or a statistical business analyst. This is not a book that teaches about writing scripts to pull