Introduction to Statistical and Machine Learning Methods for Data Science

Ebook345 pages3 hours

Introduction to Statistical and Machine Learning Methods for Data Science

Name: Introduction to Statistical and Machine Learning Methods for Data Science
Author: Carlos Andre Reis Pinheiro
ISBN: 9781953329622

By Carlos Andre Reis Pinheiro and Mike Patetta

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Boost your understanding of data science techniques to solve real-world problems

Data science is an exciting, interdisciplinary field that extracts insights from data to solve business problems. This book introduces common data science techniques and methods and shows you how to apply them in real-world case studies. From data preparation and exploration to model assessment and deployment, this book describes every stage of the analytics life cycle, including a comprehensive overview of unsupervised and supervised machine learning techniques. The book guides you through the necessary steps to pick the best techniques and models and then implement those models to successfully address the original business need.

No software is shown in the book, and mathematical details are kept to a minimum. This allows you to develop an understanding of the fundamentals of data science, no matter what background or experience level you have.

Skip carousel

LanguageEnglish

PublisherSAS Institute

Release dateAug 6, 2021

ISBN9781953329622

Author

Carlos Andre Reis Pinheiro

Dr. Carlos Andre Reis Pinheiro is a Principal Data Scientist at SAS and a Visiting Professor at Data ScienceTech Institute in France. He has been working in analytics since 1996 for some of the largest telecommunications providers in Brazil in multiple roles from technical to executive. He worked as a Senior Data Scientist for EMC in Brazil on network analytics, optimization, and text analytics projects, and as a Lead Data Scientist for Teradata on machine learning projects. Dr. Pinheiro has a BSc in Applied Mathematics and Computer Science, an MSc in Computing, and a DSc in Engineering from the Federal University of Rio de Janeiro. Carlos has completed a series of postdoctoral research terms in different fields, including Dynamic Systems at IMPA, Brazil; Social Network Analysis at Dublin City University, Ireland; Transportation Systems at Université de Savoie, France; Dynamic Social Networks and Human Mobility at Katholieke Universiteit Leuven, Belgium; and Urban Mobility and Multi-modal Traffic at Fundação Getúlio Vargas, Brazil. He has published several papers in international journals and conferences, and he is author of Social Network Analysis in Telecommunications and Heuristics in Analytics: A Practical Perspective of What Influence Our Analytical World, both published by John Wiley and Sons, Inc.

Related authors

Skip carousel

Related to Introduction to Statistical and Machine Learning Methods for Data Science

Related ebooks

Skip carousel

Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next
Ebook
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next
byRupam Kumar Sharma
Rating: 0 out of 5 stars
0 ratings
Mastering Python for Data Science
Ebook
Mastering Python for Data Science
bySamir Madhavan
Rating: 3 out of 5 stars
3/5
Data Analytics with Python: Data Analytics in Python Using Pandas
Ebook
Data Analytics with Python: Data Analytics in Python Using Pandas
byFrank Millstein
Rating: 3 out of 5 stars
3/5
End-to-End Data Science with SAS: A Hands-On Programming Guide
Ebook
End-to-End Data Science with SAS: A Hands-On Programming Guide
byJames Gearheart
Rating: 0 out of 5 stars
0 ratings
Data Science: What the Best Data Scientists Know About Data Analytics, Data Mining, Statistics, Machine Learning, and Big Data – That You Don't
Ebook
Data Science: What the Best Data Scientists Know About Data Analytics, Data Mining, Statistics, Machine Learning, and Big Data – That You Don't
byHerbert Jones
Rating: 5 out of 5 stars
5/5
Python Data Science: A Step-By-Step Guide to Data Analysis. What a Beginner Needs to Know About Machine Learning and Artificial Intelligence. Exercises Included
Ebook
Python Data Science: A Step-By-Step Guide to Data Analysis. What a Beginner Needs to Know About Machine Learning and Artificial Intelligence. Exercises Included
byAxel Ross
Rating: 0 out of 5 stars
0 ratings
Data Science for Business: Predictive Modeling, Data Mining, Data Analytics, Data Warehousing, Data Visualization, Regression Analysis, Database Querying, and Machine Learning for Beginners
Ebook
Data Science for Business: Predictive Modeling, Data Mining, Data Analytics, Data Warehousing, Data Visualization, Regression Analysis, Database Querying, and Machine Learning for Beginners
byHerbert Jones
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
Ebook
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
byDavid Loshin
Rating: 5 out of 5 stars
5/5
Applied Data Mining for Forecasting Using SAS
Ebook
Applied Data Mining for Forecasting Using SAS
byTim Rey
Rating: 0 out of 5 stars
0 ratings
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Ebook
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
byNeal Fishman
Rating: 0 out of 5 stars
0 ratings
Machine Learning Algorithms for Data Scientists: An Overview
Ebook
Machine Learning Algorithms for Data Scientists: An Overview
byVinaitheerthan Renganathan
Rating: 0 out of 5 stars
0 ratings
Practical Predictive Analytics
Ebook
Practical Predictive Analytics
byRalph Winters
Rating: 0 out of 5 stars
0 ratings
Practical Data Analysis - Second Edition
Ebook
Practical Data Analysis - Second Edition
byHector Cuesta
Rating: 0 out of 5 stars
0 ratings
Machine Learning with SAS Viya
Ebook
Machine Learning with SAS Viya
bySAS Institute Inc.
Rating: 0 out of 5 stars
0 ratings
Applying Data Science: Business Case Studies Using SAS
Ebook
Applying Data Science: Business Case Studies Using SAS
byGerhard Svolba
Rating: 0 out of 5 stars
0 ratings
Building a Recommendation System with R
Ebook
Building a Recommendation System with R
byGorakala Suresh K.
Rating: 0 out of 5 stars
0 ratings
Mastering Machine Learning with R
Ebook
Mastering Machine Learning with R
byLesmeister Cory
Rating: 0 out of 5 stars
0 ratings
Data Science: Concepts and Practice
Ebook
Data Science: Concepts and Practice
byVijay Kotu
Rating: 3 out of 5 stars
3/5
Python Machine Learning By Example
Ebook
Python Machine Learning By Example
byYuxi (Hayden) Liu
Rating: 4 out of 5 stars
4/5
Data Analytics
Ebook
Data Analytics
byJeffery Short
Rating: 1 out of 5 stars
1/5
Deep Learning for Computer Vision with SAS: An Introduction
Ebook
Deep Learning for Computer Vision with SAS: An Introduction
byRobert Blanchard
Rating: 0 out of 5 stars
0 ratings
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
Ebook
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
byPurna Chander Rao. Kathula
Rating: 5 out of 5 stars
5/5
Python Data Science Essentials - Second Edition
Ebook
Python Data Science Essentials - Second Edition
byBoschetti Alberto
Rating: 4 out of 5 stars
4/5
R for Data Science
Ebook
R for Data Science
byDan Toomey
Rating: 5 out of 5 stars
5/5
Learning pandas - Second Edition
Ebook
Learning pandas - Second Edition
byHeydt Michael
Rating: 4 out of 5 stars
4/5
Machine Learning with Spark - Second Edition
Ebook
Machine Learning with Spark - Second Edition
byNick Pentreath
Rating: 0 out of 5 stars
0 ratings
Python Data Analysis - Second Edition
Ebook
Python Data Analysis - Second Edition
byArmando Fandango
Rating: 0 out of 5 stars
0 ratings
Machine Learning and Data Mining
Ebook
Machine Learning and Data Mining
byIgor Kononenko
Rating: 3 out of 5 stars
3/5
Getting Started with Python Data Analysis
Ebook
Getting Started with Python Data Analysis
byVo.T.H Phuong
Rating: 0 out of 5 stars
0 ratings
Data Preparation for Data Mining Using SAS
Ebook
Data Preparation for Data Mining Using SAS
byMamdouh Refaat
Rating: 5 out of 5 stars
5/5

Computers For You

Skip carousel

Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
Ebook
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
byHadelin de Ponteves
Rating: 0 out of 5 stars
0 ratings
The Best Hacking Tricks for Beginners
Ebook
The Best Hacking Tricks for Beginners
byRAJ TYAGI
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
The Designer's Web Handbook: What You Need to Know to Create for the Web
Ebook
The Designer's Web Handbook: What You Need to Know to Create for the Web
byPatrick McNeil
Rating: 0 out of 5 stars
0 ratings
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
Ebook
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
byMaximus Wilson
Rating: 0 out of 5 stars
0 ratings
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
Learning the Chess Openings
Ebook
Learning the Chess Openings
byJef Kaan
Rating: 5 out of 5 stars
5/5
The Mega Box: The Ultimate Guide to the Best Free Resources on the Internet
Ebook
The Mega Box: The Ultimate Guide to the Best Free Resources on the Internet
byChris Mason
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Practical Lock Picking: A Physical Penetration Tester's Training Guide
Ebook
Practical Lock Picking: A Physical Penetration Tester's Training Guide
byDeviant Ollam
Rating: 5 out of 5 stars
5/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Master Builder Roblox: The Essential Guide
Ebook
Master Builder Roblox: The Essential Guide
byTriumph Books
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
Summary of Digital Minimalism: by Cal Newport - Choosing a Focused Life in a Noisy World - A Comprehensive Summary
Ebook
Summary of Digital Minimalism: by Cal Newport - Choosing a Focused Life in a Noisy World - A Comprehensive Summary
byAlexander Cooper
Rating: 5 out of 5 stars
5/5
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
Ebook
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
byPatrick McNeil
Rating: 4 out of 5 stars
4/5
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
Ebook
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
byTommy Swindali
Rating: 4 out of 5 stars
4/5
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
Ebook
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

Getting Technical about the Data Center Revolution with Jonathan Friedmann, CEO of Speedata
Podcast episode
Getting Technical about the Data Center Revolution with Jonathan Friedmann, CEO of Speedata
byMaking Data Simple
0 ratings
0% found this document useful
#54 Women in Data Science
Podcast episode
#54 Women in Data Science
byDataFramed
0 ratings
0% found this document useful
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
Podcast episode
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Measuring Your Python Learning Progress
Podcast episode
Measuring Your Python Learning Progress
byThe Real Python Podcast
100%
100% found this document useful
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
Podcast episode
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
byDataFramed
100%
100% found this document useful
MLA 018 Descript: (Optional episode) just showcasing a cool application using machine learning Dept uses Descript for some of their podcasting. I'm using it like a maniac, I think they're surprised at how into it I am. Check out the transcript & see how it...
Podcast episode
MLA 018 Descript: (Optional episode) just showcasing a cool application using machine learning Dept uses Descript for some of their podcasting. I'm using it like a maniac, I think they're surprised at how into it I am. Check out the transcript & see how it...
byMachine Learning Guide
0 ratings
0% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
#63 The Past and Present of Data Science
Podcast episode
#63 The Past and Present of Data Science
byDataFramed
0 ratings
0% found this document useful
Chapter 1: What is Data Science?
Podcast episode
Chapter 1: What is Data Science?
byBuild a Career in Data Science
0 ratings
0% found this document useful
Anaconda + Pyston and more: with Peter Wang, CEO of Anaconda
Podcast episode
Anaconda + Pyston and more: with Peter Wang, CEO of Anaconda
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
Podcast episode
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
byData Engineering Podcast
0 ratings
0% found this document useful
#10 Data Science, the Environment and MOOCs: Air pollution, the environment and data science: where do these intersect? Find out in this episode of DataFramed, in which Hugo speaks with Roger Peng, Professor in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health...
Podcast episode
#10 Data Science, the Environment and MOOCs: Air pollution, the environment and data science: where do these intersect? Find out in this episode of DataFramed, in which Hugo speaks with Roger Peng, Professor in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health...
byDataFramed
0 ratings
0% found this document useful
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
Podcast episode
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
This Week In Machine Learning & AI - 5/20/16: AI at Google I/O, Amazon's Deep Learning DSSTNE: This Week In Machine Learning & AI - May 20, 2016…
Podcast episode
This Week In Machine Learning & AI - 5/20/16: AI at Google I/O, Amazon's Deep Learning DSSTNE: This Week In Machine Learning & AI - May 20, 2016…
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
Podcast episode
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
How Column-Aware Development Tooling Yields Better Data Models: Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.
Podcast episode
How Column-Aware Development Tooling Yields Better Data Models: Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.
byData Engineering Podcast
0 ratings
0% found this document useful
Designing Data Platforms For Fintech Companies: Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
Podcast episode
Designing Data Platforms For Fintech Companies: Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
byData Engineering Podcast
0 ratings
0% found this document useful
Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling: For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
Podcast episode
Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling: For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
byData Engineering Podcast
0 ratings
0% found this document useful
Use Your Data Warehouse To Power Your Product Analytics With NetSpring: With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.
Podcast episode
Use Your Data Warehouse To Power Your Product Analytics With NetSpring: With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.
byData Engineering Podcast
0 ratings
0% found this document useful
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
Podcast episode
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
byData Engineering Podcast
0 ratings
0% found this document useful
Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed: As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems.
Podcast episode
Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed: As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems.
byData Engineering Podcast
0 ratings
0% found this document useful
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
Podcast episode
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
byData Engineering Podcast
0 ratings
0% found this document useful
Better Done Than Perfect. Using Surveys for Customer Success with Moritz Dausinger: Today we have another episode of Better Done Than Perfect. Listen in as we talk with Moritz Dausinger, founder of Refiner. Moritz shares the story behind his survey tool, when and how to survey your users, and many other tips for making the most of the survey data.
Podcast episode
Better Done Than Perfect. Using Surveys for Customer Success with Moritz Dausinger: Today we have another episode of Better Done Than Perfect. Listen in as we talk with Moritz Dausinger, founder of Refiner. Moritz shares the story behind his survey tool, when and how to survey your users, and many other tips for making the most of the survey data.
byUI Breakfast: UI/UX Design and Product Strategy
0 ratings
0% found this document useful
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
Podcast episode
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
byData Engineering Podcast
0 ratings
0% found this document useful
Using Data To Illuminate The Intentionally Opaque Insurance Industry: The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.
Podcast episode
Using Data To Illuminate The Intentionally Opaque Insurance Industry: The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.
byData Engineering Podcast
0 ratings
0% found this document useful
Data Sharing Across Business And Platform Boundaries: Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
Podcast episode
Data Sharing Across Business And Platform Boundaries: Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
byData Engineering Podcast
0 ratings
0% found this document useful
Quantifying The Return On Investment For Your Data Team: As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.
Podcast episode
Quantifying The Return On Investment For Your Data Team: As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.
byData Engineering Podcast
0 ratings
0% found this document useful
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
Podcast episode
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
byData Engineering Podcast
0 ratings
0% found this document useful
Building Applications With Data As Code On The DataOS: The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.
Podcast episode
Building Applications With Data As Code On The DataOS: The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.
byData Engineering Podcast
0 ratings
0% found this document useful

Skip carousel

Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read
Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
How to Make Predictive Analytics Work for Your Business
Entrepreneur
Article
How to Make Predictive Analytics Work for Your Business
Jul 1, 2014
1 min read
How Image Recognition Works
APC
Article
How Image Recognition Works
Nov 4, 2019
4 min read
Tensor Flow 101
APC
Article
Tensor Flow 101
Jan 27, 2020
4 min read
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Business Today
Article
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Jan 20, 2023
2 min read
Putting Artificial Intelligence to Work
Rotman Management
Article
Putting Artificial Intelligence to Work
May 1, 2018
11 min read
Web App Security
Linux Format
Article
Web App Security
Jun 29, 2021
8 min read
Q&A
Rotman Management
Article
Q&A
May 1, 2023
Describe the capability that companies like Netflix, UPS, Amazon and Caesars Entertainment have in common. These are all leading firms in their industries with respect to leveraging analytics as a source of competitive advantage. We now have so much
7 min read
Pivoting To First-party Data
NZ Marketing
Article
Pivoting To First-party Data
Jun 9, 2021
5 min read
The Democratization of Judgment
Rotman Management
Article
The Democratization of Judgment
Jan 1, 2018
8 min read
Taming Complexity With Intelligence: A Movement To Help Businesses Along The SAP S/4HANA Journey
The European Business Review
Article
Taming Complexity With Intelligence: A Movement To Help Businesses Along The SAP S/4HANA Journey
Jan 31, 2020
6 min read
The Current Frontier In Undustrial Manufacturing: BRINGING SOFTWARE SYSTEMS TO MARKET
The European Business Review
Article
The Current Frontier In Undustrial Manufacturing: BRINGING SOFTWARE SYSTEMS TO MARKET
Jan 31, 2020
6 min read
Cognitive Enterprise
Techfastly
Article
Cognitive Enterprise
Dec 1, 2021
6 min read
Facilities Systems
Facility Management
Article
Facilities Systems
Oct 21, 2018
5 min read
Leadership Forum: Investing in Disruption
Rotman Management
Article
Leadership Forum: Investing in Disruption
Jan 1, 2019
10 min read
What European Banks Need to Know about Competing with Ecosystems
The European Business Review
Article
What European Banks Need to Know about Competing with Ecosystems
Dec 3, 2019
6 min read
Salesforce Adding Einstein Analytics Al To Tableau Platform
Techfastly
Article
Salesforce Adding Einstein Analytics Al To Tableau Platform
Feb 4, 2021
3 min read
Integrated Workplace Management Systems
Facility Management
Article
Integrated Workplace Management Systems
Dec 23, 2018
Property and facilities management are data-rich operating worlds. This is becoming even more complex as the Internet of Things (IoT) provides the capability to imbed sensors and diagnostic tools to monitor the use and performance of everything in re
4 min read
How And Where You Use Machine-learning
APC
Article
How And Where You Use Machine-learning
Oct 7, 2019
4 min read
Arnab PANDEY
Techfastly
Article
Arnab PANDEY
Apr 1, 2021
11 min read
CULTURE SHIFT – An Indispensable Shift To Building An AI-Powered Organisation
Techfastly
Article
CULTURE SHIFT – An Indispensable Shift To Building An AI-Powered Organisation
May 3, 2021
5 min read
Jobs Of The Future
True Love
Article
Jobs Of The Future
Jan 26, 2023
5 min read
Playing With Numbers
India Today
Article
Playing With Numbers
Jul 18, 2019
In the last few years, we have probably created more data digitally than in the rest of human history. Think about the millions of Internet searches and social media posts that are made every minute, and the resultant data that corporations and gover
3 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
Leadership Forum: Making Digital Transformation A Reality
Rotman Management
Article
Leadership Forum: Making Digital Transformation A Reality
Jan 1, 2018
Glenda Crisp Senior Vice President and Chief Data Officer, TD Bank Group + Connie Bonello Associate Partner, Financial Services, IBM Canada IN MOST OF TODAY’S ORGANIZATIONS, data underpins every transaction, operation and interaction. And yet, the ab
8 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
ARTIFICIAL INTELLIGENCE (AI) IN SUPPLY CHAIN PLANNING THE Future is Here & Now
The European Business Review
Article
ARTIFICIAL INTELLIGENCE (AI) IN SUPPLY CHAIN PLANNING THE Future is Here & Now
Dec 3, 2019
7 min read
Adoption of Cognitive Computing Across Various Industries
Techfastly
Article
Adoption of Cognitive Computing Across Various Industries
Dec 1, 2021
5 min read
Harnessing Data And Research
NZ Marketing
Article
Harnessing Data And Research
Dec 8, 2023
4 min read

Related categories

Skip carousel

Reviews for Introduction to Statistical and Machine Learning Methods for Data Science

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Introduction to Statistical and Machine Learning Methods for Data Science - Carlos Andre Reis Pinheiro

Introduction to

Statistical and Machine Learning Methods for Data Science

Carlos Andre Reis Pinheiro

Mike Patetta

sas.com/books

The correct bibliographic citation for this manual is as follows: Pinheiro, Carlos Andre Reis and Mike Patetta. 2021. Introduction to Statistical and Machine Learning Methods for Data Science. Cary, NC: SAS Institute Inc.

Introduction to Statistical and Machine Learning Methods for Data Science

ISBN 978-1-953329-64-6 (Hardcover)

ISBN 978-1-953329-60-8 (Paperback)

ISBN 978-1-953329-61-5 (Web PDF)

ISBN 978-1-953329-62-2 (EPUB)

ISBN 978-1-953329-63-9 (Kindle)

For a hard copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.

For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.

The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated.

U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication, or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR 227.7202-4, and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government’s rights in Software and documentation shall be only those set forth in this Agreement.

SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414

August 2021

SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.

SAS software may be provided with certain third-party software, including but not limited to open-source software, which is licensed under its applicable third-party software license agreement. For license information about third-party software distributed with SAS software, refer to http://support.sas.com/thirdpartylicenses.

About This Book

About These Authors

Acknowledgments

Foreword

Chapter 1: Introduction to Data Science

Chapter Overview

Data Science

Mathematics and Statistics

Computer Science

Domain Knowledge

Communication and Visualization

Hard and Soft Skills

Data Science Applications

Data Science Lifecycle and the Maturity Framework

Understand the Question

Collect the Data

Explore the Data

Model the Data

Provide an Answer

Advanced Analytics in Data Science

Data Science Practical Examples

Customer Experience

Revenue Optimization

Network Analytics

Data Monetization

Summary

Additional Reading

Chapter 2: Data Exploration and Preparation

Chapter Overview

Introduction to Data Exploration

Nonlinearity

High Cardinality

Unstructured Data

Sparse Data

Outliers

Mis-scaled Input Variables

Introduction to Data Preparation

Representative Sampling

Event-based Sampling

Partitioning

Imputation

Replacement

Transformation

Feature Extraction

Feature Selection

Model Selection

Model Generalization

Bias–Variance Tradeoff

Summary

Chapter 3: Supervised Models – Statistical Approach

Chapter Overview

Classification and Estimation

Linear Regression

Use Case: Customer Value

Logistic Regression

Use Case: Collecting Predictive Model

Decision Tree

Use Case: Subscription Fraud

Summary

Chapter 4: Supervised Models – Machine Learning Approach

Chapter Overview

Supervised Machine Learning Models

Ensemble of Trees

Random Forest

Gradient Boosting

Use Case: Usage Fraud

Neural Network

Use Case: Bad Debt

Summary

Chapter 5: Advanced Topics in Supervised Models

Chapter Overview

Advanced Machine Learning Models and Methods

Support Vector Machines

Use Case: Fraud in Prepaid Subscribers

Factorization Machines

Use Case: Recommender Systems Based on Customer Ratings in Retail

Ensemble Models

Use Case Study: Churn Model for Telecommunications

Two-stage Models

Use Case: Anti-attrition

Summary

Additional Reading

Chapter 6: Unsupervised Models—Structured Data

Chapter Overview

Clustering

Hierarchical Clustering

Use Case: Product Segmentation

Centroid-based Clustering (k-means Clustering)

Use Case: Customer Segmentation

Self-organizing Maps

Use Case Study: Insolvent Behavior

Cluster Evaluation

Cluster Profiling

Additional Topics

Summary

Additional Reading

Chapter 7: Unsupervised Models—Semi Structured Data

Chapter Overview

Association Rules Analysis

Market Basket Analysis

Confidence and Support Measures

Use Case: Product Bundle Example

Expected Confidence and Lift Measures

Association Rules Analysis Evaluation

Use Case: Product Acquisition

Sequence Analysis

Use Case: Next Best Offer

Link Analysis

Use Case: Product Relationships

Path Analysis

Use Case Study: Online Experience

Text Analytics

Use Case Study: Call Center Categorization

Summary

Additional Reading

Chapter 8: Advanced Topics in Unsupervised Models

Chapter Overview

Network Analysis

Network Subgraphs

Network Metrics

Use Case: Social Network Analysis to Reduce Churn in Telecommunications

Network Optimization

Network Algorithms

Use Case: Smart Cities – Improving Commuting Routes

Summary

Chapter 9: Model Assessment and Model Deployment

Chapter Overview

Methods to Evaluate Model Performance

Speed of Training

Speed of Scoring

Business Knowledge

Fit Statistics

Data Splitting

K-fold Cross-validation

Goodness-of-fit Statistics

Confusion Matrix

ROC Curve

Model Evaluation

Model Deployment

Challenger Models

Monitoring

Model Operationalization

Summary

About This Book

What Does This Book Cover?

This book gives an overview of the statistical and machine learning methods used in data science projects, with an emphasis on the applicability to business problem solving. No software is shown, and the mathematical details are kept to a minimum. The book describes the tasks associated with all stages of the analytical life cycle, including data preparation and data exploration, feature engineering and selection, analytical modeling considering supervised and unsupervised techniques, and model assessment and deployment. It describes the techniques and provides real-world case studies to exemplify the techniques. Readers will learn the most important techniques and methods related to data science and when to apply them for different business problems. The book provides a comprehensive overview about the statistical and machine learning techniques associated with data science initiatives and guides readers through the necessary steps to successfully deploy data science projects.

This book covers the most important data science skills, the types of different data science applications, the phases in the data science lifecycle, the techniques assigned to the data preparation steps for data science, some of the most common techniques associated to supervised machine learning models (linear and logistic regression, decision tree, forest, gradient boosting, neural networks, support vector machines, and factorization machines), advanced supervised modeling methods like ensemble models and two-stage models, the most important techniques associated to unsupervised machine learning models (clustering, association rules, sequence analysis, link analysis, path analysis, network analysis, and network optimization), the method and fits statistics to assess model results, different approaches to deploy analytical models in production, and the main topics related to the model operationalization process.

This book does not cover the techniques for data engineering in depth. It also does not provide any programming code for the supervised and unsupervised models, nor does it show in practice how to deploy models in production.

Is This Book for You?

The audience of this book is data scientists, data analysts, data engineers, business analysts, market analysts, or computer scientists. However, anyone who wants to learn more about data science skills could benefit from reading this book.

What Are the Prerequisites for This Book?

There are no prerequisites for this book.

We Want to Hear from You

SAS Press books are written by SAS Users for SAS Users. We welcome your participation in their development and your feedback on SAS Press books that you are using. Please visit sas.com/books to do the following:

Recommend a topic

Request information about how to become a SAS Press author

Provide feedback on a book

About These Authors

Dr. Carlos Pinheiro is a Principal Data Scientist at SAS and a Visiting Professor at Data ScienceTech Institute in France. He has been working in analytics since 1996 for some of the largest telecommunications providers in Brazil in multiple roles from technical to executive. He worked as a Senior Data Scientist for EMC in Brazil on network analytics, optimization, and text analytics projects, and as a Lead Data Scientist for Teradata on machine learning projects. Dr. Pinheiro has a BSc in Applied Mathematics and Computer Science, an MSc in Computing, and a DSc in Engineering from the Federal University of Rio de Janeiro. Carlos has completed a series of postdoctoral research terms in different fields, including Dynamic Systems at IMPA, Brazil; Social Network Analysis at Dublin City University, Ireland; Transportation Systems at Université de Savoie, France; Dynamic Social Networks and Human Mobility at Katholieke Universiteit Leuven, Belgium; and Urban Mobility and Multi-modal Traffic at Fundação Getúlio Vargas, Brazil. He has published several papers in international journals and conferences, and he is author of Social Network Analysis in Telecommunications and Heuristics in Analytics: A Practical Perspective of What Influence Our Analytical World, both published by John Wiley Sons, Inc.

Michael Patetta has been a statistical instructor for SAS since 1994. He teaches a variety of courses including Supervised Machine Learning Procedures Using SAS® Viya® in SAS® Studio, Predictive Modeling Using Logistic Regression, Introduction to Data Science Statistical Methods, and Regression Methods Using SAS Viya. Before coming to SAS, Michael worked in the North Carolina State Health Department for 10 years as a health statistician and program manager. He has authored or co-authored 10 published papers since 1983. Michael has a BA from the University of Notre Dame and a MA from the University of North Carolina at Chapel Hill. In his spare time, he loves to hike in National Parks.

Learn more about these authors by visiting their author pages, where you can download free book excerpts, access example code and data, read the latest reviews, get updates, and more:

http://support.sas.com/pinheiro

http://support.sas.com/patetta

To Daniele, Lucas and Maitê.

Acknowledgments

I joined SAS on December 7th, 2015, but many people believed I had worked for SAS before. Not officially. But indeed, I have a long story with SAS.

I started using SAS in 2002 when I was working for Brasil Telecom, where I created a very active data mining group, developing supervised and unsupervised models across the entire corporation. In 2008, I moved to Dublin, Ireland, to perform a postdoc at Dublin City University. For two years I used SAS for social network analysis. I deployed SNA models at Eircom as a result of my research. After that, I spent six months at SAS Ireland using the brand new OPTGRAPH procedure. I developed some models to detect fraud in auto insurance and taxpayers.

In 2010, I returned to Brazil, and I had the opportunity to create an Analytics Lab at Oi. The Lab focused on developing innovative analytics for marketing, fraud, finance, collecting and engineering. SAS was a big sponsor/partner of it.

At the beginning of 2012, I worked for few months with SAS Turkey creating some network analysis projects for communications companies, and thereafter I moved to Annecy, France, to perform a postdoc at Université de Savoie, France. The research was focused on transportation systems, and I used SAS to develop network models. In 2013, I moved to Leuven, Belgium, to perform a postdoc at KU Leuven. The research was focused on dynamic network analysis, and I also used SAS for the model development. Back to Brazil in 2014, I worked as a data scientist for EMC² and Teradata, but most of the time I was still using SAS, sometimes with open-source packages. In 2014/2015, I performed a postdoc at Fundação Getúlio Vargas. The research was focused on human mobility and guess what, I used SAS.

Finally, thanks to Cat Truxillo, I found my place at SAS. I joined the Advanced Analytics group in Education. I have learned so much working at this group. It was a big challenge to keep up with such brilliant minds. I would like to thank each and every person in the Education group who has taught me over those years, but I would like to name a few of them specifically: Chris Daman, Robert Blanchard, Jeff Thompson, Terry Woodfield, and Chip Wells. To all of you, many thanks!

A special thanks to Jeff Thompson and Tarek Elnaccash for a relentless review. Both were instrumental in getting this book done.

Thanks to Suzanne Morgen for being an amazing editor and walking us through this process so smoothly.

Carlos Andre Reis Pinheiro

The idea for this book originated with Carlos Pinheiro. His experience as a data scientist has always impressed me, and this book highlights many of Carlos’s success stories. Therefore, I would like to give thanks to Carlos for the inspiration for this book. I would like to thanks to the reviewers, Jeff Thompson, Tarek Elnaccash, and Cat Truxillo, for their diligent work to make the book technically accurate. Finally, I would like to give thanks to Suzanne Morgen, whose edits made the book flow as smoothly as possible.

Michael James Patetta

Foreword

The book you have open in front of you provides a taste of many data science techniques, interspersed with tales of real-world implementations and discoveries. The idea for this book originated when my team and I were designing the SAS Academy for Data Science. We designed a fairly ambitious training and certification program, assuming that people who enroll in the academy would have several years’ experience working with data and analytics before they get started.

In 2015, the SAS Academy for Data Science was launched as a self-paced e-learning program. Designing the academy’s curriculum required research into the state of data science, discussions with faculty training the next generation of data scientists, and shadowing consultants who bring the data to life for their clients. Those topics shift and evolve over time, and today, it is one of the top data science training programs in the world. The curriculum has been adopted by university graduate programs on every continent except Antarctica.

What we have found in practice, however, is that there is a considerably broader audience who want to enroll in the academy, including smart people who have experience in a different area, but do not have the benefit of several years’ data analysis to guide their thinking of how they can apply analytics in their own fields.

For learners like these, where to begin? Carlos Pinheiro and Mike Patetta had the idea to create a short course that provides an overview of data science methods and lots of first-hand experiences as working data scientists.

Carlos Andre Reis de Pinheiro has written extensively in data science, including a Business Knowledge Series course (and later, his book) on Social Network Analysis. It was through this course that Carlos and I started working together. The first thing you notice about Carlos is that he is a born storyteller. The second thing you notice is that he loves soccer—I mean he really, really loves soccer. Over time I got to know more about this soccer-crazy professor who can keep everyone’s attention with amazing stories from his data science research. Carlos has lived and worked in (at least) six different countries, and he is fluent in (at least) four languages. Here is a person with unstoppable curiosity and drive for growth. In 2016, he joined my colleagues and me in the Advanced Analytics Education department at SAS, where he has contributed his relentless hard work and ingenuity to solve business problems with data and analytics. Today he takes a direct, hands-on approach to showing companies what is possible with some data management elbow-grease, some well-trained models, and curiosity.

Mike Patetta has been a friend and colleague for over 20 years. In fact, he was the first person who interviewed me, in 1999, when I applied to work at SAS. Mike has a natural gift for educating others. He is someone who can dive into an unfamiliar topic in statistics and distill a shelf-full of books and journal articles down to a few learner-friendly hour-long lectures. The partnership between these two authors resulted in a course—and now, a book—that is rich with detailed information, written in an easy, comfortable style, with ample use cases from the authors’ own experiences.

Data science is fun, or that’s what recruiters would have you believe. Data science entails coaxing patterns, meanings, and insights from large and diverse volumes of messy data. In practice, that means spending more time than you might like on getting access to data, determining what is in a record, how records are represented in files, how the file is structured, and how to combine the information in a meaningful way with other files. That is, for many of us, most of the work a data scientist does. So where is the fun?

The reward of data science work comes when the data are organized, cleaned, and arranged for analysis. That first batch of visualizations, the feature engineering, the modeling—that is what makes data science such rewarding work. More than almost any other career, data scientists get to ask question after question, the answers leading to subsequent questions. From one day to another, your work can be completely different. You don’t get to tell the data what to say—the data will speak to you, if you have the tools and curiosity to listen.

This book (and its accompanying course) provide a framework for doing project work, the analytics lifecycle. The analytics lifecycle acknowledges and addresses all members of the data science team—IT, computer engineers, statisticians, and executive stakeholders—and makes clear how the work and responsibilities are divided through the entire lifecycle of a data science project. The emphasis of this book is on making sense—of data, of models, and of results from deployed models. You might say that the ideal audience for this book is a Citizen Data Scientist (to use Gartner’s term) or a statistical business analyst. This is not a book that teaches about writing scripts to pull

Enjoying the preview?

Page 1 of 1

Introduction to Statistical and Machine Learning Methods for Data Science

About this ebook

Carlos Andre Reis Pinheiro

Related authors

Related to Introduction to Statistical and Machine Learning Methods for Data Science

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Introduction to Statistical and Machine Learning Methods for Data Science

What did you think?

Book preview

Introduction to Statistical and Machine Learning Methods for Data Science - Carlos Andre Reis Pinheiro

Introduction to

Statistical and Machine Learning Methods for Data Science

Carlos Andre Reis Pinheiro

Mike Patetta

Contents

About This Book

What Does This Book Cover?

Is This Book for You?

What Are the Prerequisites for This Book?

We Want to Hear from You

About These Authors

Acknowledgments

Foreword