Machine Learning with Spark - Second Edition

Ebook693 pages6 hours

Machine Learning with Spark - Second Edition

Name: Machine Learning with Spark - Second Edition
Author: Nick Pentreath
ISBN: 9781785886423

By Nick Pentreath, Manpreet Singh Ghotra and Dua Rajdeep

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book

Get to the grips with the latest version of Apache Spark
Utilize Spark's machine learning library to implement predictive analytics
Leverage Spark’s powerful tools to load, analyze, clean, and transform your data

Who This Book Is For

If you have a basic knowledge of machine learning and want to implement various machine-learning concepts in the context of Spark ML, this book is for you. You should be well versed with the Scala and Python languages.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateApr 28, 2017

ISBN9781785886423

Author

Nick Pentreath

Related authors

Skip carousel

Related to Machine Learning with Spark - Second Edition

Related ebooks

Skip carousel

Real-Time Big Data Analytics
Ebook
Real-Time Big Data Analytics
byShilpi
Rating: 5 out of 5 stars
5/5
Advanced Machine Learning with Python
Ebook
Advanced Machine Learning with Python
byJohn Hearty
Rating: 0 out of 5 stars
0 ratings
Apache Spark Graph Processing
Ebook
Apache Spark Graph Processing
byRamamonjison Rindra
Rating: 0 out of 5 stars
0 ratings
Hands-On Machine Learning Recommender Systems with Apache Spark
Ebook
Hands-On Machine Learning Recommender Systems with Apache Spark
byErnesto Lee
Rating: 0 out of 5 stars
0 ratings
Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch
Ebook
Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch
byIvan Vasilev
Rating: 0 out of 5 stars
0 ratings
Mastering Machine Learning on AWS: Advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow
Ebook
Mastering Machine Learning on AWS: Advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow
byDr. Saket S.R. Mengle
Rating: 0 out of 5 stars
0 ratings
Python High Performance - Second Edition
Ebook
Python High Performance - Second Edition
byGabriele Lanaro
Rating: 0 out of 5 stars
0 ratings
Hands-On Deep Learning Algorithms with Python: Master deep learning algorithms with extensive math by implementing them using TensorFlow
Ebook
Hands-On Deep Learning Algorithms with Python: Master deep learning algorithms with extensive math by implementing them using TensorFlow
bySudharsan Ravichandiran
Rating: 0 out of 5 stars
0 ratings
Apache Cassandra Essentials
Ebook
Apache Cassandra Essentials
byPadalia Nitin
Rating: 4 out of 5 stars
4/5
Mastering Java for Data Science
Ebook
Mastering Java for Data Science
byAlexey Grigorev
Rating: 5 out of 5 stars
5/5
Beginning with Deep Learning Using TensorFlow: A Beginners Guide to TensorFlow and Keras for Practicing Deep Learning Principles and Applications
Ebook
Beginning with Deep Learning Using TensorFlow: A Beginners Guide to TensorFlow and Keras for Practicing Deep Learning Principles and Applications
byMohan Kumar Silaparasetty
Rating: 0 out of 5 stars
0 ratings
Reinforcement Learning Algorithms with Python: Learn, understand, and develop smart algorithms for addressing AI challenges
Ebook
Reinforcement Learning Algorithms with Python: Learn, understand, and develop smart algorithms for addressing AI challenges
byAndrea Lonza
Rating: 0 out of 5 stars
0 ratings
Apache Hive Essentials
Ebook
Apache Hive Essentials
byDayong Du
Rating: 0 out of 5 stars
0 ratings
Distributed Computing in Java 9
Ebook
Distributed Computing in Java 9
byRaja Malleswara Rao Pattamsetti
Rating: 0 out of 5 stars
0 ratings
Introducing Data Science: Big data, machine learning, and more, using Python tools
Ebook
Introducing Data Science: Big data, machine learning, and more, using Python tools
byDavy Cielen
Rating: 5 out of 5 stars
5/5
Mastering Spark for Data Science
Ebook
Mastering Spark for Data Science
byAndrew Morgan
Rating: 0 out of 5 stars
0 ratings
Operationalizing Machine Learning Pipelines: Building Reusable and Reproducible Machine Learning Pipelines Using MLOps
Ebook
Operationalizing Machine Learning Pipelines: Building Reusable and Reproducible Machine Learning Pipelines Using MLOps
byVishwajyoti Pandey
Rating: 0 out of 5 stars
0 ratings
Hadoop Real-World Solutions Cookbook - Second Edition
Ebook
Hadoop Real-World Solutions Cookbook - Second Edition
byDeshpande Tanmay
Rating: 0 out of 5 stars
0 ratings
Machine Learning Systems: Designs that scale
Ebook
Machine Learning Systems: Designs that scale
byJeffrey Smith
Rating: 0 out of 5 stars
0 ratings
Data Science with Python and Dask
Ebook
Data Science with Python and Dask
byJesse Daniel
Rating: 0 out of 5 stars
0 ratings
Learning Apache Spark 2
Ebook
Learning Apache Spark 2
byMuhammad Asif Abbasi
Rating: 0 out of 5 stars
0 ratings
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
Ebook
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
byJohn Wolohan
Rating: 0 out of 5 stars
0 ratings
Apache Spark for Data Science Cookbook
Ebook
Apache Spark for Data Science Cookbook
byPadma Priya Chitturi
Rating: 0 out of 5 stars
0 ratings
Apache Hive Cookbook
Ebook
Apache Hive Cookbook
byShrey Mehrotra
Rating: 0 out of 5 stars
0 ratings
MLOps Engineering at Scale
Ebook
MLOps Engineering at Scale
byCarl Osipov
Rating: 0 out of 5 stars
0 ratings
Python Text Processing with NLTK 2.0 Cookbook: LITE
Ebook
Python Text Processing with NLTK 2.0 Cookbook: LITE
byJacob Perkins
Rating: 4 out of 5 stars
4/5
Practical Machine Learning for Data Analysis Using Python
Ebook
Practical Machine Learning for Data Analysis Using Python
byAbdulhamit Subasi
Rating: 0 out of 5 stars
0 ratings
Real-World Machine Learning
Ebook
Real-World Machine Learning
byHenrik Brink
Rating: 0 out of 5 stars
0 ratings
Learning Data Mining with Python
Ebook
Learning Data Mining with Python
byRobert Layton
Rating: 0 out of 5 stars
0 ratings
Graph-Powered Machine Learning
Ebook
Graph-Powered Machine Learning
byAlessandro Negro
Rating: 0 out of 5 stars
0 ratings

Data Modeling & Design For You

Skip carousel

Python Data Analysis - Second Edition
Ebook
Python Data Analysis - Second Edition
byArmando Fandango
Rating: 0 out of 5 stars
0 ratings
The Secrets of ChatGPT Prompt Engineering for Non-Developers
Ebook
The Secrets of ChatGPT Prompt Engineering for Non-Developers
byCea West
Rating: 5 out of 5 stars
5/5
Minding the Machines: Building and Leading Data Science and Analytics Teams
Ebook
Minding the Machines: Building and Leading Data Science and Analytics Teams
byJeremy Adamson
Rating: 0 out of 5 stars
0 ratings
AI and UX: Why Artificial Intelligence Needs User Experience
Ebook
AI and UX: Why Artificial Intelligence Needs User Experience
byGavin Lew
Rating: 0 out of 5 stars
0 ratings
Data Analytics for Beginners: Introduction to Data Analytics
Ebook
Data Analytics for Beginners: Introduction to Data Analytics
byAnthony S. Williams
Rating: 4 out of 5 stars
4/5
Supercharge Power BI: Power BI is Better When You Learn To Write DAX
Ebook
Supercharge Power BI: Power BI is Better When You Learn To Write DAX
byMatt Allington
Rating: 5 out of 5 stars
5/5
Programmable Logic Controllers
Ebook
Programmable Logic Controllers
byWilliam Bolton
Rating: 4 out of 5 stars
4/5
DAX Patterns: Second Edition
Ebook
DAX Patterns: Second Edition
byMarco Russo
Rating: 5 out of 5 stars
5/5
Living in Data: A Citizen's Guide to a Better Information Future
Ebook
Living in Data: A Citizen's Guide to a Better Information Future
byJer Thorp
Rating: 4 out of 5 stars
4/5
Python: Master the Art of Design Patterns
Ebook
Python: Master the Art of Design Patterns
byDusty Phillips
Rating: 4 out of 5 stars
4/5
Python Data Analysis
Ebook
Python Data Analysis
byIvan Idris
Rating: 4 out of 5 stars
4/5
Learn T-SQL Querying: A guide to developing efficient and elegant T-SQL code
Ebook
Learn T-SQL Querying: A guide to developing efficient and elegant T-SQL code
byPedro Lopes
Rating: 0 out of 5 stars
0 ratings
Raspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps
Ebook
Raspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps
byJason Scotts
Rating: 3 out of 5 stars
3/5
Learning Cypher
Ebook
Learning Cypher
byOnofrio Panzarino
Rating: 0 out of 5 stars
0 ratings
Data Analytics with Python: Data Analytics in Python Using Pandas
Ebook
Data Analytics with Python: Data Analytics in Python Using Pandas
byFrank Millstein
Rating: 3 out of 5 stars
3/5
Graph Databases in Action: Examples in Gremlin
Ebook
Graph Databases in Action: Examples in Gremlin
byJosh Perryman
Rating: 0 out of 5 stars
0 ratings
Quality metrics for semantic interoperability in Health Informatics
Ebook
Quality metrics for semantic interoperability in Health Informatics
byAlberto Moreno Conde
Rating: 0 out of 5 stars
0 ratings
The Esri Guide to GIS Analysis, Volume 3: Modeling Suitability, Movement, and Interaction
Ebook
The Esri Guide to GIS Analysis, Volume 3: Modeling Suitability, Movement, and Interaction
byAndy Mitchell
Rating: 0 out of 5 stars
0 ratings
Mastering VB.NET: A Comprehensive Guide to Visual Basic .NET Programming
Ebook
Mastering VB.NET: A Comprehensive Guide to Visual Basic .NET Programming
byKameron Hussain
Rating: 0 out of 5 stars
0 ratings
Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples
Ebook
Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples
byMichael Blake
Rating: 5 out of 5 stars
5/5
Principles of Data Science
Ebook
Principles of Data Science
bySinan Ozdemir
Rating: 4 out of 5 stars
4/5
Think Like a Data Scientist: Tackle the data science process step-by-step
Ebook
Think Like a Data Scientist: Tackle the data science process step-by-step
byBrian Godsey
Rating: 0 out of 5 stars
0 ratings
Deep Learning: An Essential Guide to Deep Learning for Beginners Who Want to Understand How Deep Neural Networks Work and Relate to Machine Learning and Artificial Intelligence
Ebook
Deep Learning: An Essential Guide to Deep Learning for Beginners Who Want to Understand How Deep Neural Networks Work and Relate to Machine Learning and Artificial Intelligence
byHerbert Jones
Rating: 5 out of 5 stars
5/5
Tableau Desktop Certified Associate: Exam Guide: Develop your Tableau skills and prepare for Tableau certification with tips from industry experts
Ebook
Tableau Desktop Certified Associate: Exam Guide: Develop your Tableau skills and prepare for Tableau certification with tips from industry experts
byDmitry Anoshin
Rating: 0 out of 5 stars
0 ratings
Kafka in Action
Ebook
Kafka in Action
byDylan Scott
Rating: 0 out of 5 stars
0 ratings
What Makes Us Smart: The Computational Logic of Human Cognition
Ebook
What Makes Us Smart: The Computational Logic of Human Cognition
bySamuel J. Gershman
Rating: 0 out of 5 stars
0 ratings
Data Visualization: a successful design process
Ebook
Data Visualization: a successful design process
byAndy Kirk
Rating: 4 out of 5 stars
4/5
Thinking in Algorithms: Strategic Thinking Skills, #2
Ebook
Thinking in Algorithms: Strategic Thinking Skills, #2
byAlbert Rutherford
Rating: 5 out of 5 stars
5/5
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
Ebook
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
byalasdair gilchrist
Rating: 0 out of 5 stars
0 ratings
Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI & Power Pivot in Excel 2010-2016
Ebook
Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI & Power Pivot in Excel 2010-2016
byRob Collie
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Managing non-REST APIs like GraphQL and gRPC with Nandan Sridhar and David Feuer: Alexandrina Garcia-Verdin and Stephanie Wong host this week's episode all about managing non-REST APIs.
Podcast episode
Managing non-REST APIs like GraphQL and gRPC with Nandan Sridhar and David Feuer: Alexandrina Garcia-Verdin and Stephanie Wong host this week's episode all about managing non-REST APIs.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
#54 Women in Data Science
Podcast episode
#54 Women in Data Science
byDataFramed
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
#70 Beyond the Language Wars: R & Python for the Modern Data Scientist
Podcast episode
#70 Beyond the Language Wars: R & Python for the Modern Data Scientist
byDataFramed
0 ratings
0% found this document useful
This Week In Machine Learning & AI - 5/20/16: AI at Google I/O, Amazon's Deep Learning DSSTNE: This Week In Machine Learning & AI - May 20, 2016…
Podcast episode
This Week In Machine Learning & AI - 5/20/16: AI at Google I/O, Amazon's Deep Learning DSSTNE: This Week In Machine Learning & AI - May 20, 2016…
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
Podcast episode
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
This Week In Machine Learning & AI - 5/27/16: The White House on AI & Aggressive Self-Driving Cars: This Week in Machine Learning & AI brings you the…
Podcast episode
This Week In Machine Learning & AI - 5/27/16: The White House on AI & Aggressive Self-Driving Cars: This Week in Machine Learning & AI brings you the…
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
Podcast episode
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
byData Engineering Podcast
0 ratings
0% found this document useful
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
Podcast episode
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
#42 Full Stack Data Science
Podcast episode
#42 Full Stack Data Science
byDataFramed
0 ratings
0% found this document useful
#51 Francois Chollet - Intelligence and Generalisation
Podcast episode
#51 Francois Chollet - Intelligence and Generalisation
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
[MINI] Long Short Term Memory: Thanks to our sponsor brilliant.org/dataskeptics A Long Short Term Memory (LSTM) is a neural unit, often used in Recurrent Neural Network (RNN) which attempts to provide the network the capacity to store information for longer periods of time. An...
Podcast episode
[MINI] Long Short Term Memory: Thanks to our sponsor brilliant.org/dataskeptics A Long Short Term Memory (LSTM) is a neural unit, often used in Recurrent Neural Network (RNN) which attempts to provide the network the capacity to store information for longer periods of time. An...
byData Skeptic
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
Three Must Read Data and Analytics Books with Tim Harford, Zhamak Dehghani, and Brent Dykes: It is once again that time of year when our host, Cindi Howson shares her favorite data and analytics book recommendations. In this special annual episode, we feature three of the industry’s top data writers, thinkers, and fellow podcasters. Tim Harford comes to the conversation with his new book, The Data Detective, and big-picture ideas about how traits like curiosity serve data scientists so well. Zhamak Dehghani shares her concept of The Data Mesh, especially as it relates to sharing data across business verticals. Finally, in his book, Effective Data Storytelling, Brent Dykes compels readers to think carefully about the way they craft the message or narrative around the data they’re interpreting.
Podcast episode
Three Must Read Data and Analytics Books with Tim Harford, Zhamak Dehghani, and Brent Dykes: It is once again that time of year when our host, Cindi Howson shares her favorite data and analytics book recommendations. In this special annual episode, we feature three of the industry’s top data writers, thinkers, and fellow podcasters. Tim Harford comes to the conversation with his new book, The Data Detective, and big-picture ideas about how traits like curiosity serve data scientists so well. Zhamak Dehghani shares her concept of The Data Mesh, especially as it relates to sharing data across business verticals. Finally, in his book, Effective Data Storytelling, Brent Dykes compels readers to think carefully about the way they craft the message or narrative around the data they’re interpreting.
byThe Data Chief
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Crafting Interpreters With Bob Nystrom: Bob Nystrom is the author of Crafting Interpreters. I speak with Nystrom about building a programming language and an interpreter implementation for it. We talk about parsing, the difference between compiler and interpreters and a lot more. If you are...
Podcast episode
Crafting Interpreters With Bob Nystrom: Bob Nystrom is the author of Crafting Interpreters. I speak with Nystrom about building a programming language and an interpreter implementation for it. We talk about parsing, the difference between compiler and interpreters and a lot more. If you are...
byCoRecursive: Coding Stories
0 ratings
0% found this document useful
#10 Data Science, the Environment and MOOCs: Air pollution, the environment and data science: where do these intersect? Find out in this episode of DataFramed, in which Hugo speaks with Roger Peng, Professor in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health...
Podcast episode
#10 Data Science, the Environment and MOOCs: Air pollution, the environment and data science: where do these intersect? Find out in this episode of DataFramed, in which Hugo speaks with Roger Peng, Professor in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health...
byDataFramed
0 ratings
0% found this document useful
GKE Cost Optimization with Kaslin Fields and Anthony Bushong: This week on the podcast, fellow Googlers Kaslin Fields and Anthony Bushong chat with hosts Mark Mirchandani and Stephanie Wong about how to budget and optimize spending with Google Kubernetes Engine.
Podcast episode
GKE Cost Optimization with Kaslin Fields and Anthony Bushong: This week on the podcast, fellow Googlers Kaslin Fields and Anthony Bushong chat with hosts Mark Mirchandani and Stephanie Wong about how to budget and optimize spending with Google Kubernetes Engine.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
#50 Weapons of Math Destruction
Podcast episode
#50 Weapons of Math Destruction
byDataFramed
0 ratings
0% found this document useful
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
Podcast episode
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
Podcast episode
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
byData Engineering Podcast
0 ratings
0% found this document useful
MLA 021 Databricks: Discussing Databricks with Ming Chang from (part of )
Podcast episode
MLA 021 Databricks: Discussing Databricks with Ming Chang from (part of )
byMachine Learning Guide
0 ratings
0% found this document useful
25: Selenium, pytest, Mozilla – Dave Hunt: Interview with Dave Hunt @davehunt82. We Cover: Selenium Driver: http://www.seleniumhq.org/ pytest: http://docs.pytest.org/ pytest plugins: pytest-selenium: http://pytest-selenium.readthedocs.io/ pytest-html: https://pypi.python.
Podcast episode
25: Selenium, pytest, Mozilla – Dave Hunt: Interview with Dave Hunt @davehunt82. We Cover: Selenium Driver: http://www.seleniumhq.org/ pytest: http://docs.pytest.org/ pytest plugins: pytest-selenium: http://pytest-selenium.readthedocs.io/ pytest-html: https://pypi.python.
byTest and Code
0 ratings
0% found this document useful
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
Podcast episode
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
Recurrent Neural Nets: This week, we're doing a crash course in recurren…
Podcast episode
Recurrent Neural Nets: This week, we're doing a crash course in recurren…
byLinear Digressions
0 ratings
0% found this document useful
#40 Becoming a Data Scientist
Podcast episode
#40 Becoming a Data Scientist
byDataFramed
100%
100% found this document useful

Skip carousel

Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read
Tensor Flow 101
APC
Article
Tensor Flow 101
Jan 27, 2020
4 min read
The Fundamental Limits of Machine Learning
Nautilus
Article
The Fundamental Limits of Machine Learning
Sep 20, 2016
5 min read
An easy-to-Understand Overview of Popular extended BPF Tools: BCC, Falco, and More
Techfastly
Article
An easy-to-Understand Overview of Popular extended BPF Tools: BCC, Falco, and More
Apr 1, 2022
7 min read
Understanding ELT & ETL
Techfastly
Article
Understanding ELT & ETL
Apr 1, 2021
8 min read
The Big Idea Behind Big Data
NPR
Article
The Big Idea Behind Big Data
Nov 17, 2017
As we find our way in a world shaped by Big Data, it's not the reams of information we gather but the networks they illuminate that's the newest addition to science's index of things, says Adam Frank.
6 min read
PyScript – Bring Python Coding To The Web
APC
Article
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
Cloudy With No Chance Of Erp
Architectural Review Asia Pacific
Article
Cloudy With No Chance Of Erp
Nov 11, 2019
ERP (enterprise resource planning) was born around the time the first ‘[Something] for Dummies’ book was published*. It’s typically inflexible, uncompromising software designed for large businesses, like banks, large corporations, manufacturing and s
2 min read
Next Month
PC Pro Magazine
Article
Next Month
Aug 10, 2023
1 min read
22 Awesome Open-source Programs That Do Everything You Need
PCWorld
Article
22 Awesome Open-source Programs That Do Everything You Need
Oct 30, 2023
6 min read
Branding Reimagined With Real-time Graphics
3D World
Article
Branding Reimagined With Real-time Graphics
Nov 2, 2021
4 min read
In Conversation with Surbhi Rathore
Techfastly
Article
In Conversation with Surbhi Rathore
Oct 1, 2021
4 min read
Newsdesk
Linux Format
Article
Newsdesk
Mar 5, 2024
11 min read
FLASK Web Frameworks
Linux Format
Article
FLASK Web Frameworks
Jun 4, 2019
The main focus of Python has always been to get you cracking on with your coding – the language was never made for web programming. However, this has just made it more interesting to extend the language for the web, or to create an interface to web-b
9 min read
Three Low-code Options
PC Pro Magazine
Article
Three Low-code Options
Nov 12, 2020
Counting Intel, Vodafone and VW among its customers, OutSystems helps businesses create cloudbased, on-premises and hybrid applications for mobile and web. Its development environment is predominantly drag-and-drop, with views for processes, data and
3 min read
Craft A Perfect Personal Document Library
iCreate
Article
Craft A Perfect Personal Document Library
Jan 26, 2023
There are countless apps available that can be used to organise your notes and also many word processors designed to help you create smart-looking documents, but Craft aims to do both in style. The idea is to include a huge number of advanced documen
1 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
The Big Tech Boost
Business Today
Article
The Big Tech Boost
Jan 5, 2024
5 min read
Newsdesk
Linux Format
Article
Newsdesk
Dec 13, 2022
OPEN SOURCE FUNDING GitHub, Fastly and Mozilla are all looking for new projects to back, giving a boost to open source development. Small open source projects might be created solely by enthusiasts but most make use of outside developers, often paid
9 min read
Manage Your Apps!
Linux Format
Article
Manage Your Apps!
Nov 14, 2023
17 min read
Best Password Managers For Your Android Device
Android Advisor
Article
Best Password Managers For Your Android Device
Jul 5, 2023
7 min read
Top 10 Essential Open-source Apps
iCreate
Article
Top 10 Essential Open-source Apps
Apr 21, 2022
3 min read
COMPETITIVE ADVANTAGE THROUGH SOFTWARE: Contrasting Enterprises & Startups
The European Business Review
Article
COMPETITIVE ADVANTAGE THROUGH SOFTWARE: Contrasting Enterprises & Startups
Feb 4, 2019
6 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
TechLife News
Article
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
Dec 16, 2023
4 min read
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
AppleMagazine
Article
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
Dec 15, 2023
4 min read
Automotive Grade Linux
Linux Format
Article
Automotive Grade Linux
Nov 16, 2021
9 min read
An Expert Speaks Up on What You Should Know About Programming Languages
Entrepreneur
Article
An Expert Speaks Up on What You Should Know About Programming Languages
Oct 1, 2015
1 min read

Related categories

Skip carousel

Reviews for Machine Learning with Spark - Second Edition

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Machine Learning with Spark - Second Edition - Nick Pentreath

Title Page

Machine Learning with Spark

Second Edition

Develop intelligent machine learning systems with Spark 2.x

Rajdeep Dua

Manpreet Singh Ghotra

Nick Pentreath

BIRMINGHAM - MUMBAI

Copyright

Machine Learning with Spark

Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: February 2015

Second edition: April 2017

Production reference: 1270417

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78588-993-6

www.packtpub.com

Credits

About the Authors

Rajdeep Dua has over 16 years of experience in the Cloud and Big Data space. He worked in the advocacy team for Google's big data tools, BigQuery. He worked on the Greenplum big data platform at VMware in the developer evangelist team. He also worked closely with a team on porting Spark to run on VMware's public and private cloud as a feature set. He has taught Spark and Big Data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and College of Engineering Pune.

Currently, he leads the developer relations team at Salesforce India. He also works with the data pipeline team at Salesforce, which uses Hadoop and Spark to expose big data processing tools for developers.

He has published Big Data and Spark tutorials at http://www.clouddatalab.com. He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad (http://wwwconference.org/proceedings/www2011/schedule/www2011_Program.pdf). He led the developer relations teams at Google, VMware, and Microsoft, and he has spoken at hundreds of other conferences on the cloud. Some of the other references to his work can be seen at http://yourstory.com/2012/06/vmware-hires-rajdeep-dua-to-lead-the-developer-relations-in-india/ and http://dl.acm.org/citation.cfm?id=2624641.

His contributions to the open source community are related to Docker, Kubernetes, Android, OpenStack, and cloudfoundry.

You can connect with him on LinkedIn at https://www.linkedin.com/in/rajdeepd.

Manpreet Singh Ghotra has more than 12 years of experience in software development for both enterprise and big data software. He is currently working on developing a machine learning platform using Apache Spark at Salesforce. He has worked on a sentiment analyzer using the Apache stack and machine learning.

He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout and the R Recommendation system using Apache Mahout.

With a master's and postgraduate degree in machine learning, he has contributed to and worked for the machine learning community.

His GitHub profile is https://github.com/badlogicmanpreet and you can find him on LinkedIn at https://in.linkedin.com/in/msghotra.

Nick Pentreath has a background in financial markets, machine learning, and software development. He has worked at Goldman Sachs Group, Inc., as a research scientist at the online ad targeting start-up, Cognitive Match Limited, London, and led the data science and analytics team at Mxit, Africa's largest social network.

He is a cofounder of Graphflow, a big data and machine learning company focused on user-centric recommendations and customer intelligence. He is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add value to the bottom line.

Nick is a member of the Apache Spark Project Management Committee.

About the Reviewer

Brian O'Neill is the principal architect at Monetate, Inc. Monetate's personalization platform leverages Spark and machine learning algorithms to process millions of events per second, leveraging real-time context and analytics to create personalized brand experiences at scale. Brian is a perennial Datastax Cassandra MVP and has also won InfoWorld’s Technology Leadership award. Previously, he was CTO for Health Market Science (HMS), now a LexisNexis company. He is a graduate of Brown University and holds patents in artificial intelligence and data management.

Prior to this publication, Brian authored a book on distributed computing, Storm Blueprints: Patterns for Distributed Real-time Computation, and contributed to Learning Cassandra for Administrators.

All the thanks in the world to my wife, Lisa, and my sons, Collin and Owen, for their understanding, patience, and support. They know all my shortcomings and love me anyway. Together always and forever, I love you more than you know and more than I will ever be able to express.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://goo.gl/5LgUpI.

If you'd like to join our team of regular reviewers, you can e-mail us at customerreviews@packtpub.com. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Getting Up and Running with Spark

Installing and setting up Spark locally

Spark clusters

The Spark programming model

SparkContext and SparkConf

SparkSession

The Spark shell

Resilient Distributed Datasets

Creating RDDs

Spark operations

Caching RDDs

Broadcast variables and accumulators

SchemaRDD

Spark data frame

The first step to a Spark program in Scala

The first step to a Spark program in Java

The first step to a Spark program in Python

The first step to a Spark program in R

SparkR DataFrames

Getting Spark running on Amazon EC2

Launching an EC2 Spark cluster

Configuring and running Spark on Amazon Elastic Map Reduce

UI in Spark

Supported machine learning algorithms by Spark

Benefits of using Spark ML as compared to existing libraries

Spark Cluster on Google Compute Engine - DataProc

Hadoop and Spark Versions

Creating a Cluster

Submitting a Job

Summary

Math for Machine Learning

Linear algebra

Setting up the Scala environment in Intellij

Setting up the Scala environment on the Command Line

Fields

Real numbers

Complex numbers

Vectors

Vector spaces

Vector types

Vectors in Breeze

Vectors in Spark

Vector operations

Hyperplanes

Vectors in machine learning

Matrix

Types of matrices

Matrix in Spark

Distributed matrix in Spark

Matrix operations

Determinant

Eigenvalues and eigenvectors

Singular value decomposition

Matrices in machine learning

Functions

Function types

Functional composition

Hypothesis

Gradient descent

Prior, likelihood, and posterior

Calculus

Differential calculus

Integral calculus

Lagranges multipliers

Plotting

Summary

Designing a Machine Learning System

What is Machine Learning?

Introducing MovieStream

Business use cases for a machine learning system

Personalization

Targeted marketing and customer segmentation

Predictive modeling and analytics

Types of machine learning models

The components of a data-driven machine learning system

Data ingestion and storage

Data cleansing and transformation

Model training and testing loop

Model deployment and integration

Model monitoring and feedback

Batch versus real time

Data Pipeline in Apache Spark

An architecture for a machine learning system

Spark MLlib

Performance improvements in Spark ML over Spark MLlib

Comparing algorithms supported by MLlib

Classification

Clustering

Regression

MLlib supported methods and developer APIs

Spark Integration

MLlib vision

MLlib versions compared

Spark 1.6 to 2.0

Summary

Obtaining, Processing, and Preparing Data with Spark

Accessing publicly available datasets

The MovieLens 100k dataset

Exploring and visualizing your data

Exploring the user dataset

Count by occupation

Movie dataset

Exploring the rating dataset

Rating count bar chart

Distribution of number ratings

Processing and transforming your data

Filling in bad or missing data

Extracting useful features from your data

Numerical features

Categorical features

Derived features

Transforming timestamps into categorical features

Extract time of day

Text features

Simple text feature extraction

Sparse Vectors from Titles

Normalizing features

Using ML for feature normalization

Using packages for feature extraction

TFID

IDF

Word2Vector

Skip-gram model

Standard scalar

Summary

Building a Recommendation Engine with Spark

Types of recommendation models

Content-based filtering

Collaborative filtering

Matrix factorization

Explicit matrix factorization

Implicit Matrix Factorization

Basic model for Matrix Factorization

Alternating least squares

Extracting the right features from your data

Extracting features from the MovieLens 100k dataset

Training the recommendation model

Training a model on the MovieLens 100k dataset

Training a model using Implicit feedback data

Using the recommendation model

ALS Model recommendations

User recommendations

Generating movie recommendations from the MovieLens 100k dataset

Inspecting the recommendations

Item recommendations

Generating similar movies for the MovieLens 100k dataset

Inspecting the similar items

Evaluating the performance of recommendation models

ALS Model Evaluation

Mean Squared Error

Mean Average Precision at K

Using MLlib's built-in evaluation functions

RMSE and MSE

MAP

FP-Growth algorithm

FP-Growth Basic Sample

FP-Growth Applied to Movie Lens Data

Summary

Building a Classification Model with Spark

Types of classification models

Linear models

Logistic regression

Multinomial logistic regression

Visualizing the StumbleUpon dataset

Extracting features from the Kaggle/StumbleUpon evergreen classification dataset

StumbleUponExecutor

Linear support vector machines

The naive Bayes model

Decision trees

Ensembles of trees

Random Forests

Gradient-Boosted Trees

Multilayer perceptron classifier

Extracting the right features from your data

Training classification models

Training a classification model on the Kaggle/StumbleUpon evergreen classification dataset

Using classification models

Generating predictions for the Kaggle/StumbleUpon evergreen classification dataset

Evaluating the performance of classification models

Accuracy and prediction error

Precision and recall

ROC curve and AUC

Improving model performance and tuning parameters

Feature standardization

Additional features

Using the correct form of data

Tuning model parameters

Linear models

Iterations

Step size

Regularization

Decision trees

Tuning tree depth and impurity

The naive Bayes model

Cross-validation

Summary

Building a Regression Model with Spark

Types of regression models

Least squares regression

Decision trees for regression

Evaluating the performance of regression models

Mean Squared Error and Root Mean Squared Error

Mean Absolute Error

Root Mean Squared Log Error

The R-squared coefficient

Extracting the right features from your data

Extracting features from the bike sharing dataset

Training and using regression models

BikeSharingExecutor

Training a regression model on the bike sharing dataset

Generalized linear regression

Decision tree regression

Ensembles of trees

Random forest regression

Gradient boosted tree regression

Improving model performance and tuning parameters

Transforming the target variable

Impact of training on log-transformed targets

Tuning model parameters

Creating training and testing sets to evaluate parameters

Splitting data for Decision tree

The impact of parameter settings for linear models

Iterations

Step size

L2 regularization

L1 regularization

Intercept

The impact of parameter settings for the decision tree

Tree depth

Maximum bins

The impact of parameter settings for the Gradient Boosted Trees

Iterations

MaxBins

Summary

Building a Clustering Model with Spark

Types of clustering models

k-means clustering

Initialization methods

Mixture models

Hierarchical clustering

Extracting the right features from your data

Extracting features from the MovieLens dataset

K-means - training a clustering model

Training a clustering model on the MovieLens dataset

K-means - interpreting cluster predictions on the MovieLens dataset

Interpreting the movie clusters

K-means - evaluating the performance of clustering models

Internal evaluation metrics

External evaluation metrics

Computing performance metrics on the MovieLens dataset

Effect of iterations on WSSSE

Bisecting KMeans

Bisecting K-means - training a clustering model

WSSSE and iterations

Gaussian Mixture Model

Clustering using GMM

Plotting the user and item data with GMM clustering

GMM - effect of iterations on cluster boundaries

Summary

Dimensionality Reduction with Spark

Types of dimensionality reduction

Principal components analysis

Singular value decomposition

Relationship with matrix factorization

Clustering as dimensionality reduction

Extracting the right features from your data

Extracting features from the LFW dataset

Exploring the face data

Visualizing the face data

Extracting facial images as vectors

Loading images

Converting to grayscale and resizing the images

Extracting feature vectors

Normalization

Training a dimensionality reduction model

Running PCA on the LFW dataset

Visualizing the Eigenfaces

Interpreting the Eigenfaces

Using a dimensionality reduction model

Projecting data using PCA on the LFW dataset

The relationship between PCA and SVD

Evaluating dimensionality reduction models

Evaluating k for SVD on the LFW dataset

Singular values

Summary

Advanced Text Processing with Spark

What's so special about text data?

Extracting the right features from your data

Term weighting schemes

Feature hashing

Extracting the tf-idf features from the 20 Newsgroups dataset

Exploring the 20 Newsgroups data

Applying basic tokenization

Improving our tokenization

Removing stop words

Excluding terms based on frequency

A note about stemming

Feature Hashing

Building a tf-idf model

Analyzing the tf-idf weightings

Using a tf-idf model

Document similarity with the 20 Newsgroups dataset and tf-idf features

Training a text classifier on the 20 Newsgroups dataset using tf-idf

Evaluating the impact of text processing

Comparing raw features with processed tf-idf features on the 20 Newsgroups dataset

Text classification with Spark 2.0

Word2Vec models

Word2Vec with Spark MLlib on the 20 Newsgroups dataset

Word2Vec with Spark ML on the 20 Newsgroups dataset

Summary

Real-Time Machine Learning with Spark Streaming

Online learning

Stream processing

An introduction to Spark Streaming

Input sources

Transformations

Keeping track of state

General transformations

Actions

Window operators

Caching and fault tolerance with Spark Streaming

Creating a basic streaming application

The producer application

Creating a basic streaming application

Streaming analytics

Stateful streaming

Online learning with Spark Streaming

Streaming regression

A simple streaming regression program

Creating a streaming data producer

Creating a streaming regression model

Streaming K-means

Online model evaluation

Comparing model performance with Spark Streaming

Structured Streaming

Summary

Pipeline APIs for Spark ML

Introduction to pipelines

DataFrames

Pipeline components

Transformers

Estimators

How pipelines work

Machine learning pipeline with an example

StumbleUponExecutor

Summary

Preface

In recent years, the volume of data being collected, stored, and analyzed has exploded, in particular in relation to activity on the Web and mobile devices, as well as data from the physical world collected via sensor networks. While large-scale data storage, processing, analysis, and modeling were previously the domain of the largest institutions, such as Google, Yahoo!, Facebook, Twitter, and Salesforce, increasingly, many organizations are being faced with the challenge of how to handle a massive amount of data.

When faced with this quantity of data and the common requirement to utilize it in real time, human-powered systems quickly become infeasible. This has led to a rise in so-called big data and machine learning systems that learn from this data to make automated decisions.

In answer to the challenge of dealing with ever larger-scale data without any prohibitive cost, new open source technologies emerged at companies such as Google, Yahoo!, Amazon, and Facebook, which aimed at making it easier to handle massive data volumes by distributing data storage and computation across a cluster of computers.

The most widespread of these is Apache Hadoop, which made it significantly easier and cheaper to both store large amounts of data (via the Hadoop Distributed File System, or HDFS) and run computations on this data (via Hadoop MapReduce, a framework to perform computation tasks in parallel across many nodes in a computer cluster).

However, MapReduce has some important shortcomings, including high overheads to launch each job and reliance on storing intermediate data and results of the computation to disk, both of which make Hadoop relatively ill-suited for use cases of an iterative or low-latency nature. Apache Spark is a new framework for distributed computing that is designed from the ground up to be optimized for low-latency tasks and to store intermediate data and results in memory, thus addressing some of the major drawbacks of the Hadoop framework. Spark provides a clean, functional, and easy-to-understand API to write applications, and is fully compatible with the Hadoop ecosystem.

Furthermore, Spark provides native APIs in Scala, Java, Python, and R. The Scala and Python APIs allow all the benefits of the Scala or Python language, respectively, to be used directly in Spark applications, including using the relevant interpreter for real-time, interactive exploration. Spark itself now provides a toolkit (Spark MLlib in 1.6 and Spark ML in 2.0) of distributed machine learning and data mining models that is under heavy development and already contains high-quality, scalable, and efficient algorithms for many common machine learning tasks, some of which we will delve into in this book.

Applying machine learning techniques to massive datasets is challenging, primarily because most well-known machine learning algorithms are not designed for parallel architectures. In many cases, designing such algorithms is not an easy task. The nature of machine learning models is generally iterative, hence the strong appeal of Spark for this use case. While there are many competing frameworks for parallel computing, Spark is one of the few that combines speed, scalability, in-memory processing, and fault tolerance with ease of programming and a flexible, expressive, and powerful API design.

Throughout this book, we will focus on real-world applications of machine learning technology. While we may briefly delve into some theoretical aspects of machine learning algorithms and required maths for machine learning, the book will generally take a practical, applied approach with a focus on using examples and code to illustrate how to effectively use the features of Spark and MLlib, as well as other well-known and freely available packages for machine learning and data analysis, to create a useful machine learning system.

What this book covers

Chapter 1, Getting Up and Running with Spark, shows how to install and set up a local development environment for the Spark framework, as well as how to create a Spark cluster in the cloud using Amazon EC2. The Spark programming model and API will be introduced and a simple Spark application will be created using Scala, Java, and Python.

Chapter 2, Math for Machine Learning, provides a mathematical introduction to machine learning. Understanding math and many of its techniques is important to get a good hold on the inner workings of the algorithms and to get the best results.

Chapter 3, Designing a Machine Learning System, presents an example of a real-world use case for a machine learning system. We will design a high-level architecture for an intelligent system in Spark based on this illustrative use case.

Chapter 4, Obtaining, Processing, and Preparing Data with Spark, details how to go about obtaining data for use in a machine learning system, in particular from various freely and publicly available sources. We will learn how to process, clean, and transform the raw data into features that may be used in machine learning models, using available tools, libraries, and Spark's functionality.

Chapter 5, Building a Recommendation Engine with Spark, deals with creating a recommendation model based on the collaborative filtering approach. This model will be used to recommend items to a given user, as well as create lists of items that are similar to a given item. Standard metrics to evaluate the performance of a recommendation model will be covered here.

Chapter 6, Building a Classification Model with Spark, details how to create a model for binary classification, as well as how to utilize standard performance-evaluation metrics for classification tasks.

Chapter 7, Building a Regression Model with Spark, shows how to create a model for regression, extending the classification model created in Chapter 6, Building a Classification Model with Spark. Evaluation metrics for the performance of regression models will be detailed here.

Chapter 8, Building a Clustering Model with Spark, explores how to create a clustering model and how to use related evaluation methodologies. You will learn how to analyze and visualize the clusters that are generated.

Chapter 9, Dimensionality Reduction with Spark, takes us through methods to extract the underlying structure from, and reduce the dimensionality of, our data. You will learn some common dimensionality-reduction techniques and how to apply and analyze them. You will also see how to use the resulting data representation as an input to another machine learning model.

Chapter 10, Advanced Text Processing with Spark, introduces approaches to deal with large-scale text data, including techniques for feature extraction from text and dealing with the very high-dimensional features typical in text data.

Chapter 11, Real-Time Machine Learning with Spark Streaming, provides an overview of Spark Streaming and how it fits in with the online and incremental learning approaches to apply machine learning on data streams.

Chapter 12, Pipeline APIs for Spark ML, provides a uniform set of APIs that are built on top of Data Frames and help the user to create and tune machine learning pipelines.

What you need for this book

Throughout this book, we assume that you have some basic experience with programming in Scala or Python and have some basic knowledge of machine learning, statistics, and data analysis.

Who this book is for

This book is aimed at entry-level to intermediate data scientists, data analysts, software engineers, and practitioners involved in machine learning or data mining with an interest in large-scale machine learning approaches, but who are not necessarily familiar with Spark. You may have some experience of statistics or machine learning software (perhaps including MATLAB, scikit-learn, Mahout, R, Weka, and so on) or distributed systems (including some exposure to Hadoop).

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Spark places user scripts to run Spark in the bin directory.

A block of code is set as follows:

Any command-line input or output is written as follows:

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: These can be obtained from the AWS homepage by clicking Account | Security Credentials | Access Credentials.

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply

e-mail feedback@packtpub.com, and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Hover the mouse pointer on the SUPPORT tab at the top.

Click on Code Downloads & Errata.

Enter the name of the book in the Search box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Machine-Learning-with-Spark-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MachineLearningwithSparkSecondEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address the problem.

Getting Up and Running with Spark

Apache Spark is a framework for distributed computing; this framework aims to make it simpler to write programs that run in parallel across many nodes in a cluster of computers or virtual machines. It tries to abstract the tasks of resource scheduling, job submission, execution, tracking, and communication between nodes as well as the low-level operations that are inherent in parallel data processing. It also provides a higher level API to work with distributed data. In this way, it is similar to other distributed processing frameworks such as Apache Hadoop; however, the underlying architecture is somewhat different.

Spark began as a research project at the AMP lab in University of California, Berkeley (https://amplab.cs.berkeley.edu/projects/spark-lightning-fast-cluster-computing/). The university was focused on the use case of distributed machine learning algorithms. Hence, it is designed from the ground up for high performance in applications of an iterative nature, where the same data is accessed multiple times. This performance is achieved primarily through caching datasets in memory combined with low latency and overhead to launch parallel computation tasks. Together with other features such as fault tolerance, flexible distributed-memory data structures, and a powerful functional API, Spark has proved to be broadly useful for a wide range of large-scale data processing tasks, over and above machine learning and iterative analytics.

For more information, you can visit:

http://spark.apache.org/community.html

http://spark.apache.org/community.html#history

Performance wise, Spark is much faster than Hadoop for related workloads. Refer to the following graph:

Source: https://amplab.cs.berkeley.edu/wp-content/uploads/2011/11/spark-lr.png

Spark runs in four modes:

The standalone local mode, where all Spark processes are run within the same Java Virtual Machine (JVM) process

The standalone cluster mode, using Spark's own built-in, job-scheduling framework

Using Mesos, a popular open source cluster-computing framework

Using YARN (commonly referred to as NextGen MapReduce), Hadoop

In this chapter, we will do the following:

Download the Spark binaries and set up a development environment that runs in Spark's standalone local mode. This environment will be used throughout the book to run the example code.

Explore Spark's programming model and API using Spark's interactive console.

Write our first Spark program in Scala, Java, R, and Python.

Set up a Spark cluster using Amazon's Elastic Cloud Compute (EC2) platform, which can be used for large-sized data and heavier computational requirements, rather than running in the local mode.

Set up a Spark Cluster using Amazon Elastic Map Reduce

If you have previous experience in setting up Spark and are familiar with the basics of writing a Spark program, feel free to skip this chapter.

Installing and setting up Spark locally

Spark can be run using the built-in standalone cluster scheduler in the local mode. This means that all the Spark processes are run within the same JVM-effectively, a single, multithreaded instance of Spark. The local mode is very used for prototyping, development, debugging, and testing. However, this mode can also be useful in real-world scenarios to perform parallel computation across multiple cores on a single computer.

As Spark's local mode is fully compatible with the cluster mode; programs written and tested locally can be run on a cluster with just a few additional steps.

The first step in setting up Spark locally is to download the latest version http://spark.apache.org/downloads.html, which contains links to download various versions of Spark as well as to obtain the latest source code via GitHub.

The documents/docs available at http://spark.apache.org/docs/latest/ are a comprehensive resource to learn more about Spark. We highly recommend that you explore it!

Spark needs to be built against a specific version of Hadoop in order to access Hadoop Distributed File System (HDFS) as well as standard and custom Hadoop input sources Cloudera's Hadoop Distribution, MapR's Hadoop distribution, and Hadoop 2 (YARN). Unless you wish to build Spark against a specific Hadoop version, we recommend that you download the prebuilt Hadoop 2.7 package from an Apache mirror from http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz.

Spark requires the Scala programming language (version 2.10.x or 2.11.x at the time of writing this book) in order to run. Fortunately, the prebuilt binary

Enjoying the preview?

Page 1 of 1

Machine Learning with Spark - Second Edition

About this ebook

Nick Pentreath

Related authors

Related to Machine Learning with Spark - Second Edition

Related ebooks

Data Modeling & Design For You

Related podcast episodes

Related articles

Related categories

Reviews for Machine Learning with Spark - Second Edition

What did you think?

Book preview

Machine Learning with Spark - Second Edition - Nick Pentreath

Title Page

Machine Learning with Spark

Second Edition

Develop intelligent machine learning systems with Spark 2.x

Rajdeep Dua

Manpreet Singh Ghotra

Nick Pentreath

Copyright

Machine Learning with Spark

Second Edition

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Table of Contents

Preface

What this book covers

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Getting Up and Running with Spark

Installing and setting up Spark locally