Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library

Ebook646 pages3 hours

Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library

Name: Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library
Author: Hien Luu
ISBN: 9781484273838

By Hien Luu

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Take a journey toward discovering, learning, and using Apache Spark 3.0. In this book, you will gain expertise on the powerful and efficient distributed data processing engine inside of Apache Spark; its user-friendly, comprehensive, and flexible programming model for processing data in batch and streaming; and the scalable machine learning algorithms and practical utilities to build machine learning applications.

Beginning Apache Spark 3 begins by explaining different ways of interacting with Apache Spark, such as Spark Concepts and Architecture, and Spark Unified Stack. Next, it offers an overview of Spark SQL before moving on to its advanced features. It covers tips and techniques for dealing with performance issues, followed by an overview of the structured streaming processing engine. It concludes with a demonstration of how to develop machine learning applications using Spark MLlib and how to manage the machine learning development lifecycle. This book is packed with practical examples and code snippets to help you master concepts and features immediately after they are covered in each section.

After reading this book, you will have the knowledge required to build your own big data pipelines, applications, and machine learning applications.

What You Will Learn

Master the Spark unified data analytics engine and its various components
Work in tandem to provide a scalable, fault tolerant and performant data processing engine
Leverage the user-friendly and flexible programming model to perform simple to complex data analytics using dataframe and Spark SQL
Develop machine learning applications using Spark MLlib
Manage the machine learning development lifecycle using MLflow

Who This Book Is For

Data scientists, data engineers and software developers.

Skip carousel

LanguageEnglish

PublisherApress

Release dateOct 22, 2021

ISBN9781484273838

Author

Hien Luu

Related authors

Skip carousel

Related to Beginning Apache Spark 3

Related ebooks

Skip carousel

Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions
Ebook
Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions
bySudhir Rawat
Rating: 0 out of 5 stars
0 ratings
Practical API Architecture and Development with Azure and AWS: Design and Implementation of APIs for the Cloud
Ebook
Practical API Architecture and Development with Azure and AWS: Design and Implementation of APIs for the Cloud
byThurupathan Vijayakumar
Rating: 0 out of 5 stars
0 ratings
Fast Data Processing with Spark 2 - Third Edition
Ebook
Fast Data Processing with Spark 2 - Third Edition
byKrishna Sankar
Rating: 0 out of 5 stars
0 ratings
Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library
Ebook
Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library
byHien Luu
Rating: 0 out of 5 stars
0 ratings
Machine Learning with Spark - Second Edition
Ebook
Machine Learning with Spark - Second Edition
byNick Pentreath
Rating: 0 out of 5 stars
0 ratings
Machine Learning with Microsoft Technologies: Selecting the Right Architecture and Tools for Your Project
Ebook
Machine Learning with Microsoft Technologies: Selecting the Right Architecture and Tools for Your Project
byLeila Etaati
Rating: 0 out of 5 stars
0 ratings
Spark: Big Data Cluster Computing in Production
Ebook
Spark: Big Data Cluster Computing in Production
byIlya Ganelin
Rating: 0 out of 5 stars
0 ratings
Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets
Ebook
Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets
byEd Elliott
Rating: 0 out of 5 stars
0 ratings
Splunk Certified Study Guide: Prepare for the User, Power User, and Enterprise Admin Certifications
Ebook
Splunk Certified Study Guide: Prepare for the User, Power User, and Enterprise Admin Certifications
byDeep Mehta
Rating: 0 out of 5 stars
0 ratings
Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud
Ebook
Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud
byRobert Ilijason
Rating: 0 out of 5 stars
0 ratings
DevOps for Azure Applications: Deploy Web Applications on Azure
Ebook
DevOps for Azure Applications: Deploy Web Applications on Azure
bySuren Machiraju
Rating: 0 out of 5 stars
0 ratings
Demystifying Azure AI: Implementing the Right AI Features for Your Business
Ebook
Demystifying Azure AI: Implementing the Right AI Features for Your Business
byKasam Shaikh
Rating: 0 out of 5 stars
0 ratings
Microsoft Azure Architect Technologies Study Companion: Hands-on Preparation and Practice for Exam AZ-300 and AZ-303
Ebook
Microsoft Azure Architect Technologies Study Companion: Hands-on Preparation and Practice for Exam AZ-300 and AZ-303
byRahul Sahay
Rating: 0 out of 5 stars
0 ratings
Spark for Data Science
Ebook
Spark for Data Science
bySrinivas Duvvuri
Rating: 0 out of 5 stars
0 ratings
Getting Started with Dynamics 365 Portals: Best Practices and Solutions for Enterprises
Ebook
Getting Started with Dynamics 365 Portals: Best Practices and Solutions for Enterprises
bySanjaya Yapa
Rating: 0 out of 5 stars
0 ratings
Real-Time Big Data Analytics
Ebook
Real-Time Big Data Analytics
byShilpi
Rating: 5 out of 5 stars
5/5
Advanced Data Analytics Using Python: With Machine Learning, Deep Learning and NLP Examples
Ebook
Advanced Data Analytics Using Python: With Machine Learning, Deep Learning and NLP Examples
bySayan Mukhopadhyay
Rating: 0 out of 5 stars
0 ratings
Azure Data Factory by Example: Practical Implementation for Data Engineers
Ebook
Azure Data Factory by Example: Practical Implementation for Data Engineers
byRichard Swinbank
Rating: 0 out of 5 stars
0 ratings
Managing PeopleSoft on the Oracle Cloud: Best Practices with PeopleSoft Cloud Manager
Ebook
Managing PeopleSoft on the Oracle Cloud: Best Practices with PeopleSoft Cloud Manager
byAaron Engelsrud
Rating: 0 out of 5 stars
0 ratings
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Ebook
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
byDebananda Ghosh
Rating: 0 out of 5 stars
0 ratings
Practical hapi: Build Your Own hapi Apps and Learn from Industry Case Studies
Ebook
Practical hapi: Build Your Own hapi Apps and Learn from Industry Case Studies
byKanika Sud
Rating: 0 out of 5 stars
0 ratings
Learn R for Applied Statistics: With Data Visualizations, Regressions, and Statistics
Ebook
Learn R for Applied Statistics: With Data Visualizations, Regressions, and Statistics
byEric Goh Ming Hui
Rating: 0 out of 5 stars
0 ratings
The Definitive Guide to JSF in Java EE 8: Building Web Applications with JavaServer Faces
Ebook
The Definitive Guide to JSF in Java EE 8: Building Web Applications with JavaServer Faces
byBauke Scholtz
Rating: 0 out of 5 stars
0 ratings
Deep Learning for Numerical Applications with SAS
Ebook
Deep Learning for Numerical Applications with SAS
byHenry Bequet
Rating: 0 out of 5 stars
0 ratings
Beginning DAX with Power BI: The SQL Pro’s Guide to Better Business Intelligence
Ebook
Beginning DAX with Power BI: The SQL Pro’s Guide to Better Business Intelligence
byPhilip Seamark
Rating: 0 out of 5 stars
0 ratings
Microservices in SAP HANA XSA: A Guide to REST APIs Using Node.js
Ebook
Microservices in SAP HANA XSA: A Guide to REST APIs Using Node.js
bySergio Guerrero
Rating: 0 out of 5 stars
0 ratings
Apache Spark Machine Learning Blueprints
Ebook
Apache Spark Machine Learning Blueprints
byAlex Liu
Rating: 0 out of 5 stars
0 ratings
Introducing Play Framework: Java Web Application Development
Ebook
Introducing Play Framework: Java Web Application Development
byPrem Kumar Karunakaran
Rating: 0 out of 5 stars
0 ratings
Learning Azure DocumentDB
Ebook
Learning Azure DocumentDB
byBecker Riccardo
Rating: 0 out of 5 stars
0 ratings
Hands-on Azure Boards: Configuring and Customizing Process Workflows in Azure DevOps Services
Ebook
Hands-on Azure Boards: Configuring and Customizing Process Workflows in Azure DevOps Services
byChaminda Chandrasekara
Rating: 0 out of 5 stars
0 ratings

Intelligence (AI) & Semantics For You

Skip carousel

Rise of Generative AI and ChatGPT: Understand how Generative AI and ChatGPT are transforming and reshaping the business world (English Edition)
Ebook
Rise of Generative AI and ChatGPT: Understand how Generative AI and ChatGPT are transforming and reshaping the business world (English Edition)
byUtpal Chakraborty
Rating: 0 out of 5 stars
0 ratings
Midjourney Mastery - The Ultimate Handbook of Prompts
Ebook
Midjourney Mastery - The Ultimate Handbook of Prompts
byAndreea Todinca
Rating: 5 out of 5 stars
5/5
AI for Educators: AI for Educators
Ebook
AI for Educators: AI for Educators
byMatt Miller
Rating: 5 out of 5 stars
5/5
101 Midjourney Prompt Secrets
Ebook
101 Midjourney Prompt Secrets
byMarcus Byrne
Rating: 3 out of 5 stars
3/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
Mastering ChatGPT: Unlock the Power of AI for Enhanced Communication and Relationships: English
Ebook
Mastering ChatGPT: Unlock the Power of AI for Enhanced Communication and Relationships: English
byVasyl Kolomiiets
Rating: 0 out of 5 stars
0 ratings
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
Ebook
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
ChatGPT For Fiction Writing: AI for Authors
Ebook
ChatGPT For Fiction Writing: AI for Authors
byNova Leigh
Rating: 5 out of 5 stars
5/5
Dancing with Qubits: How quantum computing works and how it can change the world
Ebook
Dancing with Qubits: How quantum computing works and how it can change the world
byRobert S. Sutor
Rating: 5 out of 5 stars
5/5
ChatGPT For Dummies
Ebook
ChatGPT For Dummies
byPam Baker
Rating: 0 out of 5 stars
0 ratings
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
Ebook
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
byHadelin de Ponteves
Rating: 0 out of 5 stars
0 ratings
Artificial Intelligence: A Guide for Thinking Humans
Ebook
Artificial Intelligence: A Guide for Thinking Humans
byMelanie Mitchell
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
Ebook
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
byTJ Books
Rating: 3 out of 5 stars
3/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
A Quickstart Guide To Becoming A ChatGPT Millionaire: The ChatGPT Book For Beginners (Lazy Money Series®)
Ebook
A Quickstart Guide To Becoming A ChatGPT Millionaire: The ChatGPT Book For Beginners (Lazy Money Series®)
byS M Howard
Rating: 4 out of 5 stars
4/5
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
Ebook
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
bySebastian Raschka
Rating: 5 out of 5 stars
5/5
Discovery Writing with ChatGPT: AI-Powered Storytelling: Three Story Method, #6
Ebook
Discovery Writing with ChatGPT: AI-Powered Storytelling: Three Story Method, #6
byJ. Thorn
Rating: 0 out of 5 stars
0 ratings
The Secrets of ChatGPT Prompt Engineering for Non-Developers
Ebook
The Secrets of ChatGPT Prompt Engineering for Non-Developers
byCea West
Rating: 5 out of 5 stars
5/5
What Makes Us Human: An Artificial Intelligence Answers Life's Biggest Questions
Ebook
What Makes Us Human: An Artificial Intelligence Answers Life's Biggest Questions
byJasmine Wang
Rating: 5 out of 5 stars
5/5
ChatGPT
Ebook
ChatGPT
byRobert Conway
Rating: 1 out of 5 stars
1/5
Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures
Ebook
Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures
byThe Passive Income Strategist
Rating: 4 out of 5 stars
4/5
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
Ebook
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
byMaximus Wilson
Rating: 0 out of 5 stars
0 ratings
ChatGPT for Beginners: How to Make Money Online and 10x Your Productivity Using ChatGPT Even if You’re an Absolute Beginner (The Complete Up-to-Date ChatGPT Guide)
Ebook
ChatGPT for Beginners: How to Make Money Online and 10x Your Productivity Using ChatGPT Even if You’re an Absolute Beginner (The Complete Up-to-Date ChatGPT Guide)
byMatthew Hayes
Rating: 0 out of 5 stars
0 ratings
TensorFlow in 1 Day: Make your own Neural Network
Ebook
TensorFlow in 1 Day: Make your own Neural Network
byKrishna Rungta
Rating: 4 out of 5 stars
4/5
ChatGPT for Marketing: A Practical Guide
Ebook
ChatGPT for Marketing: A Practical Guide
byJuanjo Ramos
Rating: 3 out of 5 stars
3/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
THE CHATGPT MILLIONAIRE'S HANDBOOK: UNLOCKING WEALTH THROUGH AI AUTOMATION
Ebook
THE CHATGPT MILLIONAIRE'S HANDBOOK: UNLOCKING WEALTH THROUGH AI AUTOMATION
byLogan Rivers
Rating: 5 out of 5 stars
5/5
The Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications
Ebook
The Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications
byKavita Ganesan
Rating: 0 out of 5 stars
0 ratings
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Podcast episode
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
Prasad Kawthekar, Co-Founder & CEO of Dashworks, on enterprise AI search tools
Podcast episode
Prasad Kawthekar, Co-Founder & CEO of Dashworks, on enterprise AI search tools
byAI and the Future of Work
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Cloud Dataflow with Frances Perry: Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data so we asked Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it.
Podcast episode
Cloud Dataflow with Frances Perry: Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data so we asked Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
Podcast episode
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
byData Engineering Podcast
0 ratings
0% found this document useful
Oracle Data Lakehouse: With each passing day, more and more data sources are sending greater volumes of data across the globe. For any organization, this combination of structured and unstructured data continues to be a challenge. Data lakehouses link, correlate, and...
Podcast episode
Oracle Data Lakehouse: With each passing day, more and more data sources are sending greater volumes of data across the globe. For any organization, this combination of structured and unstructured data continues to be a challenge. Data lakehouses link, correlate, and...
byOracle University Podcast
0 ratings
0% found this document useful
Autonomous Database Tools: In this episode, hosts Lois Houston and Nikita Abraham speak with Oracle Database experts about the various tools you can use with Autonomous Database, including Oracle Application Express (APEX), Oracle Machine Learning, and more. Oracle...
Podcast episode
Autonomous Database Tools: In this episode, hosts Lois Houston and Nikita Abraham speak with Oracle Database experts about the various tools you can use with Autonomous Database, including Oracle Application Express (APEX), Oracle Machine Learning, and more. Oracle...
byOracle University Podcast
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Podcast episode
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
Machine in Production = Data Engineering + ML + Software Engineering // Satish Chandra Gupta // MLOps Coffee Sessions #16
Podcast episode
Machine in Production = Data Engineering + ML + Software Engineering // Satish Chandra Gupta // MLOps Coffee Sessions #16
byMLOps.community
0 ratings
0% found this document useful
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
Podcast episode
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
byData Engineering Podcast
0 ratings
0% found this document useful
End-to-End Data Science to Drive Business Decisions at LinkedIn with Burcu Baran - TWiML Talk #256: In this episode of our Strata Data conference series, we’re joined by Burcu Baran, Senior Data Scientist at LinkedIn. At Strata, Burcu, along with a few members of her team, delivered the presentation “Using the full spectrum of data science to...
Podcast episode
End-to-End Data Science to Drive Business Decisions at LinkedIn with Burcu Baran - TWiML Talk #256: In this episode of our Strata Data conference series, we’re joined by Burcu Baran, Senior Data Scientist at LinkedIn. At Strata, Burcu, along with a few members of her team, delivered the presentation “Using the full spectrum of data science to...
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
The Changing Faces of Data and Analytics
Podcast episode
The Changing Faces of Data and Analytics
byInsights Tomorrow
0 ratings
0% found this document useful
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
Podcast episode
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
byData Engineering Podcast
0 ratings
0% found this document useful
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
Podcast episode
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
byData Engineering Podcast
0 ratings
0% found this document useful
Adding An Easy Mode For The Modern Data Stack With 5X: The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.
Podcast episode
Adding An Easy Mode For The Modern Data Stack With 5X: The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.
byData Engineering Podcast
0 ratings
0% found this document useful
Oracle Machine Learning: There is so much data available today. But it only makes a difference when you transform that data into actionable intelligence. In this episode, hosts Lois Houston and Nikita Abraham, along with Nick Commisso, discuss how you can harness the...
Podcast episode
Oracle Machine Learning: There is so much data available today. But it only makes a difference when you transform that data into actionable intelligence. In this episode, hosts Lois Houston and Nikita Abraham, along with Nick Commisso, discuss how you can harness the...
byOracle University Podcast
0 ratings
0% found this document useful
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Podcast episode
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
byData Engineering Podcast
0 ratings
0% found this document useful
Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine: Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.
Podcast episode
Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine: Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.
byData Engineering Podcast
0 ratings
0% found this document useful
Understanding Time-Series Database Patterns
Podcast episode
Understanding Time-Series Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
Building ETL Pipelines With Generative AI: Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.
Podcast episode
Building ETL Pipelines With Generative AI: Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.
byData Engineering Podcast
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
Surveying The Market Of Database Products: Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.
Podcast episode
Surveying The Market Of Database Products: Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.
byData Engineering Podcast
0 ratings
0% found this document useful
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
Podcast episode
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
byData Engineering Podcast
0 ratings
0% found this document useful
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
Modern Customer Data Platform Principles: Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).
Podcast episode
Modern Customer Data Platform Principles: Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).
byData Engineering Podcast
0 ratings
0% found this document useful
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
Podcast episode
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
byData Engineering Podcast
0 ratings
0% found this document useful
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
Podcast episode
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
byData Engineering Podcast
0 ratings
0% found this document useful
The ERP Minute Episode 85 - April 25, 2023
Podcast episode
The ERP Minute Episode 85 - April 25, 2023
byThe ERP Advisor
0 ratings
0% found this document useful
Low Code And High Quality Data Engineering For The Whole Organization With Prophecy: An interview with Raj Bains about how the Prophecy platform provides a smooth experience for the whole organization to build high quality data engineering workflows with a unified model that brings engineers and business users together in one experience.
Podcast episode
Low Code And High Quality Data Engineering For The Whole Organization With Prophecy: An interview with Raj Bains about how the Prophecy platform provides a smooth experience for the whole organization to build high quality data engineering workflows with a unified model that brings engineers and business users together in one experience.
byData Engineering Podcast
0 ratings
0% found this document useful
Managing Oracle Database with REST APIs and ADB Built-in Tools: In this episode, Lois Houston and Nikita Abraham are joined by Cloud Engineer Nick Commisso to talk about managing Oracle Database with REST APIs. They also look at Autonomous Database built-in tools, which are pre-assembled, pre-configured,...
Podcast episode
Managing Oracle Database with REST APIs and ADB Built-in Tools: In this episode, Lois Houston and Nikita Abraham are joined by Cloud Engineer Nick Commisso to talk about managing Oracle Database with REST APIs. They also look at Autonomous Database built-in tools, which are pre-assembled, pre-configured,...
byOracle University Podcast
0 ratings
0% found this document useful

Skip carousel

Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
FLASK Web Frameworks
Linux Format
Article
FLASK Web Frameworks
Jun 4, 2019
The main focus of Python has always been to get you cracking on with your coding – the language was never made for web programming. However, this has just made it more interesting to extend the language for the web, or to create an interface to web-b
9 min read
Three Low-code Options
PC Pro Magazine
Article
Three Low-code Options
Nov 12, 2020
Counting Intel, Vodafone and VW among its customers, OutSystems helps businesses create cloudbased, on-premises and hybrid applications for mobile and web. Its development environment is predominantly drag-and-drop, with views for processes, data and
3 min read
News
APC
Article
News
Feb 20, 2023
Be careful who you buy from. Some Chinese crypto miners are doing everything they can to offload their heavily used mining GPUs now that they’ve nothing left to mine. They are going to interesting lengths to sell their inventory, like repainting them
4 min read
Salesforce Adding Einstein Analytics Al To Tableau Platform
Techfastly
Article
Salesforce Adding Einstein Analytics Al To Tableau Platform
Feb 4, 2021
3 min read
Artificial Intelligence Rules Of The Road
Linux Format
Article
Artificial Intelligence Rules Of The Road
Nov 14, 2023
AI FOR ALL! Anyone who works with computers needs to understand that AI will undoubtedly change how work is executed. That said, I don’t think we are anywhere near the much bleated “Everyone will lose their jobs!” IT-related jobs will change but they
2 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
The Meeting whiz
Business Today
Article
The Meeting whiz
Nov 25, 2017
2 min read
News
APC
Article
News
Feb 20, 2023
Be careful who you buy from. Some Chinese crypto miners are doing everything they can to offload their heavily used mining GPUs now that they’ve nothing left to mine. They are going to interesting lengths to sell their inventory, like repainting the
4 min read
A.i. Coding
Linux Format
Article
A.i. Coding
Aug 22, 2023
16 min read
Automate Tasks In Excel – But Only In Microsoft 365
Computeractive
Article
Automate Tasks In Excel – But Only In Microsoft 365
Sep 28, 2022
1 min read
PyScript – Bring Python Coding To The Web
APC
Article
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
Darq
PC Pro Magazine
Article
Darq
Jul 9, 2022
3 min read
AI As A Service
PC Pro Magazine
Article
AI As A Service
Jul 9, 2020
2 min read
Code An Admin Back-end In Django
Linux Format
Article
Code An Admin Back-end In Django
Dec 13, 2022
Credit: www.djangoproject.com OUR EXPERT Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://
6 min read
Craft A Perfect Personal Document Library
iCreate
Article
Craft A Perfect Personal Document Library
Jan 26, 2023
There are countless apps available that can be used to organise your notes and also many word processors designed to help you create smart-looking documents, but Craft aims to do both in style. The idea is to include a huge number of advanced documen
1 min read
Supercomputer On A Platter
Business Today
Article
Supercomputer On A Platter
Apr 1, 2022
CHENNAI-HEADQUARTERED automobile major TVS Motor Company uses high-performance computing (HPC) for running R&D simulations and testing the aero-dynamics of two-wheelers, which allows it to make the vehicles stable at speed and more efficient, cool en
7 min read
Newsdesk
Linux Format
Article
Newsdesk
Mar 5, 2024
11 min read
Seven Questions About Chatgpt Answered
NZBusiness and Management
Article
Seven Questions About Chatgpt Answered
Apr 18, 2023
3 min read
22 Awesome Open-source Programs That Do Everything You Need
PCWorld
Article
22 Awesome Open-source Programs That Do Everything You Need
Oct 30, 2023
6 min read
Best Password Managers For Your Android Device
Android Advisor
Article
Best Password Managers For Your Android Device
Jul 5, 2023
7 min read
Use Katana For Lookdev And Lighting
3D World
Article
Use Katana For Lookdev And Lighting
Sep 7, 2021
3 min read
The Big Tech Boost
Business Today
Article
The Big Tech Boost
Jan 5, 2024
5 min read
03 microsoft
Fast Company
Article
03 microsoft
Mar 19, 2024
1 min read
AMD’s New Ryzen 8000 Laptop CPUs Are Built For An AI Future
PCWorld
Article
AMD’s New Ryzen 8000 Laptop CPUs Are Built For An AI Future
Jan 3, 2024
5 min read
Newsdesk
Linux Format
Article
Newsdesk
Dec 13, 2022
OPEN SOURCE FUNDING GitHub, Fastly and Mozilla are all looking for new projects to back, giving a boost to open source development. Small open source projects might be created solely by enthusiasts but most make use of outside developers, often paid
9 min read
Drill Down Deeper
MacLife
Article
Drill Down Deeper
Aug 16, 2022
2 min read
01 Ready Or Not, AI Is Here To Assist You
HWM Singapore
Article
01 Ready Or Not, AI Is Here To Assist You
Jul 11, 2023
4 min read
Smartproxy
Linux Format
Article
Smartproxy
Jan 9, 2024
2 min read

Related categories

Skip carousel

Reviews for Beginning Apache Spark 3

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Beginning Apache Spark 3 - Hien Luu

H. LuuBeginning Apache Spark 3https://doi.org/10.1007/978-1-4842-7383-8_1

1. Introduction to Apache Spark

Hien Luu¹

(1)

SAN JOSE, CA, USA

There is no better time to learn Apache Spark than now. It has become one of the critical components in the big data stack due to its ease of use, speed, and flexibility. Over the years, it has established itself as the unified engine for multiple workload types, such as big data processing, data analytics, data science, and machine learning. Companies in many industries widely adopt this scalable data processing system, including Facebook, Microsoft, Netflix, and LinkedIn. Moreover, it has steadily improved through each major release.

The more recent version of Apache Spark is 3.0, which was released in June 2020, marking Spark’s tenth anniversary as an open source project. This release includes enhancements to many areas of Spark. The notable enhancements are the innovative just-in-time performance optimization techniques to speed up Spark applications and help reduce the time and effort it takes developers to tune their Spark applications.

This chapter provides a high-level overview of Spark, including the core concepts, architecture, and the various components inside the Apache Spark stack.

Overview

Spark is a general distributed data processing engine built for speed, ease of use, and flexibility. The combination of these three properties is what makes Spark so popular and widely adopted in the industry.

The Apache Spark website claims that it can run certain data processing jobs up to 100 times faster than Hadoop MapReduce. In fact, in 2014, Spark won the Daytona GraySort contest, which is an industry benchmark to see how fast a system can sort 100TB of data (1 trillion records). The submission from Databricks claimed Spark could sort 100 TB of data three times faster using ten times fewer resources than the previous world record set by Hadoop MapReduce.

Ease of use has been one of the main focuses of the Spark creators since the inception of the Spark project. It offers over 80 high-level, commonly needed data processing operators to make it easy for developers, data scientists, and analysts to use to build all kinds of interesting data applications. In addition, these operators are available in multiple languages: Scala, Java, Python, and R. Software engineers, data scientists, and data analysts can pick and choose their favorite language to solve large-scale data processing problems with Spark.

In terms of flexibility, Spark offers a single unified data processing stack that can solve multiple types of data processing workloads, including batch applications, interactive queries, machine learning algorithms that require many iterations, and real-time streaming applications to extract actionable insights in near real time. Before the existence of Spark, each of these types of workloads requires a different solution and technology. Now companies can just leverage Spark for all their data processing needs, and it dramatically reduces the operational cost and resources.

The big data ecosystem consists of many pieces of technology, including Hadoop Distributed File System (HDFS), a distributed storage engine and cluster management system that efficiently manages a cluster of machines and different file formats to store a large amount of data in binary and columnar formats. Spark integrates well with the big data ecosystem. This is another reason why Spark adoption has been growing at a fast pace.

Another cool thing about Spark is it is open source. Therefore, anyone can download the source code to examine the code, figure out how a certain feature was implemented, and extend its functionalities. In some cases, it can dramatically help reduce the time to debug problems.

History

Spark started as a research project at the University of California, Berkeley, AMPLab in 2009. At that time, the researchers of this project observed the inefficiencies of the Hadoop MapReduce framework in handling interactive and iterative data processing use cases, so they came up with ways to overcome those inefficiencies by introducing ideas like in-memory storage and an efficient way of dealing with fault recovery. Once this research project has proven to be a viable solution that outperforms MapReduce. It was open sourced in 2010 and became the Apache top-level project in 2013.

Many researchers who worked on this research project founded a company called Databricks, and they raised over $43 million in 2013. Databricks is the primary commercial steward behind Spark. In 2015, IBM announced a major investment in building a Spark technology center to advance Apache Spark by working closely with the open source community and build Spark into the core of the company’s analytics and commerce platforms.

Two popular research papers on Spark are Spark: Cluster Computing with Working Sets (http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf) and Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf). These papers are well received at academic conferences and provide good foundations for anyone that would like to learn and understand Spark.

Since its inception, the Spark open source project has been a very active project and community. The number of contributors has increased by more than 1000, and there are over 200 thousand Apache Spark meetups. The number of Apache Spark contributors has exceeded the number of contributors of the widely popular Apache Hadoop.

The creators of Spark picked Scala programming language for their project due to the combinations of Scala’s conciseness and static typing. Now Spark is considered one of the largest applications written in Scala and its popularity certainly has helped Scala become a mainstream programming language.

Spark Core Concepts and Architecture

Before diving into the details of Spark, it is important to have a high-level understanding of the core concepts and the various core components. This section covers the following.

Spark clusters

Resource management system

Spark applications

Spark drivers

Spark executors

Spark Cluster and Resource Management System

Spark is essentially a distributed system designed to process large volumes of data efficiently and quickly. This distributed system is typically deployed onto a collection of machines, known as a Spark cluster. A cluster can be as small as a few machines or as large as thousands of machines. According to the Spark FAQ at https://spark.apache.org/faq.html, the world’s largest Spark cluster has more than 8000 machines.

Companies rely on a resource management system like Apache YARN or Apache Meso to efficiently and intelligently manage a collection of machines. The two main components in a typical resource management system are cluster manager and worker. The master knows where the slaves are located, how much memory, and the number of CPU cores each one has. One of the main responsibilities of the cluster manager is to orchestrate work by assigning work to workers. Each worker offers resources (memory, CPU, etc.) to the cluster manager and performs the assigned work. An example of this type of work is to launch a particular process and monitor its health. Spark is designed to easily interoperate with these systems. In recent years, most companies adopting big data technologies have a YARN cluster to run MapReduce jobs or other data processing frameworks like Apache Pig or Apache Hive.

Startup companies that fully adopt Spark can just use the out-of-the-box Spark cluster manager to manage a set of machines dedicated to performing data processing using Spark.

Spark Applications

A Spark application consists of two parts. One is the data processing logic expressed using Spark APIs, and the other is the driver. Data processing logic can be as simple as a few lines of code to perform a few data processing operations that solve a specific data problem or as complex as training a complicated machine learning model that requires many iterations and runs many hours to complete. A Spark driver is effectively the central coordinator of a Spark application to interact with a cluster manager to figure out which machines to run the data processing logic. For each of those machines, a driver requests a cluster manager to launch a process known as an executor .

Another very important job of the Spark driver is managing and distributing Spark tasks onto each executor on behalf of the application. If the data processing logic requires the Spark driver to collect the computed results to present to a user, it coordinates with each Spark executor to collect the computed result and merge them together before presenting them to the user. A Spark driver performs tasks through a component called SparkSession.

Spark Drivers and Executors

Each Spark executor is a JVM process and is dedicated to a specific Spark application. The life span of a Spark executor is the duration of a Spark application, which could be minutes or days. There was a conscious design decision not to share a Spark executor between different multiple Spark applications. This has the benefit of isolating each application from each other. Still, it is not easy to share data between different applications without writing that data to an external storage system like HDFS.

In short, Spark employs a master/slave architecture, where the driver is the master, and the executor is the slave. Each of these components runs as an independent process on a Spark cluster. A Spark application consists of one driver and one or more executors. Playing the slave role, a Spark executor does what is being told, which is to execute the data processing logic in the form of tasks. Each task is executed on a separate CPU core. This is how Spark parallelly processes data to speed things up. In addition, each Spark executor is responsible for caching a portion of the data in memory and/or on disk when it is told to do so by the application logic.

When launching a Spark application, you can specify the number of executors the application needs, and the amount of memory and the number of CPU cores each executor should have.

Figure 1-1 shows interactions between a Spark application and cluster manager.

../images/419951_2_En_1_Chapter/419951_2_En_1_Fig1_HTML.jpg

Figure 1-1

Interactions between a Spark application and the cluster manager

../images/419951_2_En_1_Chapter/419951_2_En_1_Fig2_HTML.jpg

Figure 1-2

A Spark cluster that consists of one driver and three executors

Spark Unified Stack

Unlike its predecessors, Spark provides a unified data processing engine known as the Spark stack. Like other well-designed systems, this stack is built on a strong foundation called Spark Core, which provides all the necessary functionality to manage and run distributed applications like scheduling, coordination, and handling fault tolerance. In addition, it provides a powerful and generic programming abstraction for data processing called resilient distributed datasets (RDDs). On top of this strong foundation is a collection of libraries where each one is designed for a specific data processing workload. Spark SQL specializes in interactive data processing. Spark Streaming is real-time data processing. Spark GraphX is for graph processing. Spark MLlib is for machine learning. Spark R runs machine learning tasks using the R shell.

This unified engine brings several important benefits to building the next generation of big data applications. First, applications are simpler to develop and deploy because they use a unified set of APIs and run on a single engine. Second, combining different types of data processing (batch, streaming, etc.) is far more efficient because Spark can run those different sets of APIs over the same data without writing the intermediate data out to storage.

Finally, the most exciting benefit is that Spark enables brand-new applications made possible due to the ease of composing different sets of data processing types; for example, running interactive queries on the results of machine learning predictions of real-time data streams. An analogy that everyone can relate to is a smartphone, consisting of a powerful camera, cellphone, and GPS device. By combining the functions of these components, smartphones enable innovative applications like Waze, a traffic and navigation application.

../images/419951_2_En_1_Chapter/419951_2_En_1_Fig3_HTML.jpg

Figure 1-3

Spark unified stack

Spark Core

Spark Core is the bedrock of the Spark distributed data processing engine. It consists of an RDD, a distributed computing infrastructure and programming abstraction.

The distributed computing infrastructure is responsible for distributing, coordinating, and scheduling computing tasks across many machines in the cluster. This enables the ability to perform parallel data processing of large volumes of data efficiently and quickly on a large cluster of machines. Two other important responsibilities of the distributed computing infrastructure are handling computing task failures and the efficient way of moving data across machines, known as data shuffling. Advanced Spark users should have intimate knowledge of Spark distributed computing infrastructure to effectively design high-performance Spark applications.

The RDD key programming abstraction is something that every Spark user should learn and effectively use the various provided APIs. An RDD is a fault-tolerant collection of objects partitioned across a cluster that can be manipulated in parallel. Essentially it provides a set of APIs for Spark application developers to easily and efficiently perform large-scale data processing without worrying where data resides on the cluster and machine failures. The RDD APIs are exposed to multiple programming languages, including Scala, Java, and Python. They allow users to pass local functions to run on the cluster, which is very powerful and unique. RDDs are covered in detail in a later chapter.

The rest of the components in the Spark stack are designed to run on top of Spark Core. Therefore, any improvement or optimization done in the Spark Core between versions of Spark is automatically available to the other components.

Spark SQL

Spark SQL is a module built on top of Spark Core, and it is designed for structured data processing at scale. Its popularity has skyrocketed since its inception since it brings a new level of flexibility, ease of use, and performance.

Structured Query Language (SQL) has been the lingua franca for data processing because it is easy for users to express their intent. The execution engine then performs intelligent optimizations. Spark SQL brings that to the world of data processing at the petabytes level. Spark users now can issue SQL queries to perform data processing or use the high-level abstraction exposed through the DataFrame API. A DataFrame is effectively a distributed collection of data organized into named columns. This is not a new idea. It is inspired by data frames in R and Python. An easier way to think about a DataFrame is that it is conceptually equivalent to a table in a relational database.

Behind the scenes, the Spark SQL Catalyst optimizer performs optimizations commonly done in many analytical database engines.

Another Spark SQL feature that elevates Spark’s flexibility is the ability to read and write data to and from various structured formats and storage systems, such as JavaScript Object Notation (JSON), comma-separated values (CSV), Parquet or ORC files, relational databases, Hive, and others.

According to the 2021 Spark survey, Spark SQL was the fastest-growing component. This makes sense because Spark SQL enables a wider audience beyond big data engineers to leverage the power of distributed data processing—that is, data analysts or anyone familiar with SQL.

The motto for Spark SQL is to write less code, read less data, and the optimizer does the hard work.

Spark Structured Streaming

It has been said that data in motion has equal or greater value than historical data. The ability to process data as they arrive has become a competitive advantage for many companies in highly competitive industries. The Spark Structured Streaming module enables the ability to process real-time streaming data from various data sources in a high-throughput and fault-tolerant manner. Data can be ingested from sources like Kafka, Flume, Kinesis, Twitter, HDFS, or TCP socket.

Spark’s main abstraction for processing streaming data is a discretized stream (DStream), which implements an incremental stream processing model by splitting the input data into small batches (based on a time interval) that can regularly combine the current processing state to produce new results.

Stream processing sometimes involves joining with data at rest, and Spark makes it very easy. In other words, combining batch and interactive queries with stream processing can be easily done in Spark due to the unified Spark stack.

A new scalable and fault-tolerant stream processing engine called Structured Streaming was introduced in Spark version 2.1. This engine further simplifies stream processing app developers’ lives by treating streaming computation the same way as you express a batch computation on static data. This new engine automatically executes the stream processing logic incrementally and continuously and produces the result as new streaming data arrives. Another unique feature in the Structured Streaming engine is the guarantee of end-to-end exactly-once support, which makes big data engineer’s life much easier than before in terms of saving data to a storage system like a relational database or a NoSQL database.

As this new engine matures, it enables a new class of stream processing applications that are easy to develop and maintain.

According to Reynold Xin, Databricks’ chief architect, the simplest way to perform streaming analytics is not having to reason about streaming.

Spark MLlib

MLlib is Spark’s machine learning library. It provides more than 50 common machine learning algorithms and abstractions for managing and simplifying model-building tasks, such as featurization, a pipeline for constructing, an evaluating and tuning model, and the persistence of models to help move models from development to production.

Starting with Spark 2.0 version, the MLlib APIs are based on DataFrames to take advantage of the user-friendliness and many optimizations provided by the Catalyst and Tungsten components in the Spark SQL engine.

Machine learning algorithms are iterative, meaning they run through many iterations until the desired objective is achieved. Spark makes it extremely easy to implement those algorithms and run them in a scalable manner through a cluster of machines. Commonly used machine learning algorithms such as classification, regression, clustering, and collaborative filtering are available out of the box for data scientists and engineers to use.

Spark GraphX

Graph processing operates on a data structure consisting of vertices and edges connecting them. A graph data structure is often used to represent real-life networks of interconnected entities, including professional social networks on LinkedIn, a network of connected web pages on the Internet, and so on. Spark GraphX is a library that enables graph-parallel computations by providing an abstraction of a directed multi-graph with properties attached to each vertex and edge. GraphX includes a collection of common graph processing algorithms, including page ranks, connected components, shortest paths, and others.

SparkR

SparkR is an R package that provides a lightweight frontend to use Apache Spark. R is a popular statistical programming language that supports data processing and machine learning tasks. However, R was not designed to handle large datasets that cannot fit on a single machine. SparkR leverages Spark’s distributed computing engine to enable large-scale data analysis using familiar R shell and popular APIs that many data scientists love.

Apache Spark 3.0

The 3.0 release has new features and enhancements to most of the components in the Spark stack. However, about 60% of the enhancements went into Spark SQL and Spark Core components. Query performance optimization was one of the major themes in Spark 3.0, so the bulk of the focus and development was in the Spark SQL component. Based on the TPC-DS 30 TB benchmark done by Databricks, Spark 3.0 is roughly two times faster than Spark 2.4. This section highlights a few notable features that are related to performance optimization.

Adaptive Query Execution Framework

As the name suggests, the query execution framework adapts the execution plan at runtime based on the most recent statistics about data size, the number of partitions, and so forth. As a result, Spark can dynamically switch join strategies, automatically optimize skew joins, and adjust the number of partitions. All these intelligent optimizations lead to improving the query performance of Spark applications.

Dynamic Partition Pruning (DPP)

The primary idea behind DPP is simple, which is to avoid reading unnecessary data. It is designed specifically for use cases when querying data using joins against fact tables and dimension tables in a star schema scheme. It can dramatically improve the join performance by reducing the number of rows in the fact table that need to join with the dimension tables based on the given filtering conditions. Based on a TPC-DS benchmark, this optimization technique can speed up the performance of 60% of the queries in the range of 2x to 18x.

Accelerator-aware Scheduler

More and more Spark users are leveraging Spark for both big data processing and machine learning workload. The latter type of workload often needs GPU to speed up the machine learning model training process. This enhancement enables Spark users to describe and request GPU resources for their complex workloads that involve machine learning.

Apache Spark Applications

Spark is a versatile, fast, and scalable data processing engine. It was designed to be a general engine since the beginning days and has proven that it can be used to solve many use cases. As a result, many companies in various industries are using Spark to solve many real-life use cases. The following is a small list of applications that were developed using Spark.

Customer intelligence application

Data warehouse solutions

Real-time streaming solutions

Recommendation engines

Log processing

User-facing services

Fraud detection

Spark Example Applications

In the world of big data processing, the canonical example application is the word count application. This tradition started with the introduction of the MapReduce framework. Since then, every big data processing technology-related book must follow this unwritten tradition by including this canonical example. The problem space in the word count example application is easy for everyone to understand since all it does is count how many times a particular word appears in each set of documents, whether that is a chapter of a book or hundreds of terabytes of web pages from the Internet.

Listing 1-1 is a word count example application in Spark in the Scala language.

val textFiles = sc.textFile(hdfs://)

val words = textFiles.flatMap(line => line.split( ))

val wordTuples = words.map(word => (word, 1))

val wordCounts = wordTuples.reduceByKey(_ + _)

wordCounts.saveAsTextFile(hdfs://)

Listing 1-1

The Word Count Spark Example Application Written in Scala Language

A lot is going on behind these five lines of code. The first line is responsible for reading the text files under the specified folder. The second line iterates through each line in each of the files, then each line is tokenized into an array of words and finally flattens each array into one word per line. The third line attaches a count of 1 to each word to count the number of words across all documents. The fourth line performs the summation of the count of each word. Finally, the last line saves the result in the specified folder. Hopefully, this gives you a general sense of the ease of use of Spark to perform data processing. Future chapters go into more detail about what each of those lines of code does.

Apache Spark Ecosystem

In the realm of big data, innovation doesn’t stand still. As time goes on, the best practices and architectures emerge. The Spark ecosystem is expanding and evolving to address some of the emerging needs in data lakes, helping data scientists be more productive at interacting with the vast amount of data and speeding up the machine learning development life cycle. This section highlights a few of the exciting and recent innovations in the Spark ecosystem.

Delta Lake

At this point, most companies recognize the value of data and have some form of strategy to ingest, store, process, and extract insights from their data. The idea behind Delta Lake is to leverage a distributed storage solution to store both structured and unstructured data for various data consumers such as data scientists, data engineers, and business analysts. To ensure the data in Delta Lake is usable, there must be oversights in the data catalog, data discovery, data quality, access control, and data consistency semantics. Data consistency semantics presents many challenges, and companies have invented tricks or Band-Aid solutions.

Delta Lake is an open source solution for data consistency semantics that provides an open data storage format with transactional guarantees and schema enforcement and evolution support. Delta Lake is further discussed later.

Koalas

For years, data scientists have been using the Python pandas library to perform data manipulation in their machine learning–related tasks. The pandas library (https://pandas.pydata.org) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool built on top of Python programming language. pandas is widely popular and has become the de facto library due to its powerful and flexible abstraction called a DataFrame for data manipulation. However, pandas is designed to run on a single machine only. To perform parallel computing in Python, you can explore an open source project called Dask (https://docs.dask.org).

Koalas marries the best of both worlds, the powerful and flexible DataFrame abstraction and Spark’s distributed data processing engine by implementing the pandas DataFrame API on top of Apache Spark.

This innovation enables data scientists to leverage their pandas knowledge to interact with much bigger datasets than in the past.

Koalas version 1.0 was released in June 2020 with 80% coverage of the pandas APIs. Koalas aims to enable data science projects to leverage large datasets instead of being blocked by them.

MLflow

The field of machine learning has been around a long time. Recently, it has become more approachable due to advancements in algorithms, ease of access to a large collection of useful datasets such as images and a large corpus of text, and the availability of educational resources. However, applying machine learning to business problems has proven to be a challenge because it is more of a software engineering problem to manage the machine learning life cycle.

MLflow is an open source project. It was conceived in 2018 to provide a platform to help with managing the machine learning life cycle. It consists of the following components to address the various needs in each step of the life cycle.

Tracking records and compares machine learning experiments.

Projects provides a consistent format of organizing machine learning projects to share and reproduce machine learning models easily.

Models provides a standardized format to package machine learning models, a consistent API for working with machine learning models, such as loading and deploying them.

Registry is a model store that hosts machine learning models and tracks their lineage, version, and deployment state transitions.

Summary

Apache Spark has certainly produced many sparks since its inception. It has created much excitement and opportunities in the world of big data. And more importantly, it allows you to create many new and innovative big data applications to solve a diverse set of data processing problems of data applications.

The three important properties of Spark to note are ease of use, speed, and flexibility.

The Spark distributed computing infrastructure employs a master and slave architecture. Each Spark application consists of a driver and one or more executors to process the data in parallel. Parallelism is the key enabler to process massive amounts of data in a short amount of time.

Spark provides a unified scalable and distributed data processing engine that can be used for batch processing, interactive and exploratory data processing, real-time stream processing, building machine learning models and predictions, and graph processing.

Spark applications can be written in multiple programming languages, including Scala, Java, Python, or R.

H. LuuBeginning Apache Spark 3https://doi.org/10.1007/978-1-4842-7383-8_2

2. Working with Apache Spark

Hien Luu¹

(1)

SAN JOSE, CA, USA

When it comes to working with Spark or building Spark applications, there are many options. This chapter describes the three common options, including using Spark shell, submitting a Spark application from the command line, and using a hosted cloud platform called Databricks. The last part of this chapter is geared toward software engineers who want to set up Apache Spark source code on a local machine to study Spark source code and learn how certain features were implemented.

Downloading and Installation

To learn or experiment with Spark, it is convenient to have it installed locally on your computer. This way, you can easily try out certain features or test your data processing logic with small datasets. Having Spark locally installed on your laptop lets you learn it from anywhere, including your comfortable living room, the beach, or at a bar in Mexico.

Spark is written in Scala.

Enjoying the preview?

Page 1 of 1

Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library

About this ebook

Hien Luu

Related authors

Related to Beginning Apache Spark 3

Related ebooks

Intelligence (AI) & Semantics For You

Related podcast episodes

Related articles

Related categories

Reviews for Beginning Apache Spark 3

What did you think?

Book preview

Beginning Apache Spark 3 - Hien Luu

1. Introduction to Apache Spark

Overview

History

Spark Core Concepts and Architecture

Spark Cluster and Resource Management System

Spark Applications

Spark Drivers and Executors

Spark Unified Stack

Apache Spark 3.0

Adaptive Query Execution Framework

Dynamic Partition Pruning (DPP)

Accelerator-aware Scheduler

Apache Spark Applications

Spark Example Applications

Apache Spark Ecosystem

Delta Lake

Koalas

MLflow

Summary

2. Working with Apache Spark

Downloading and Installation