Spark: Big Data Cluster Computing in Production

Ebook348 pages3 hours

Spark: Big Data Cluster Computing in Production

Name: Spark: Big Data Cluster Computing in Production
Author: Ilya Ganelin
ISBN: 9781119254058

By Ilya Ganelin, Ema Orhian, Kai Sasaki and Brennon York

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Production-targeted Spark guidance with real-world use cases

Spark: Big Data Cluster Computing in Production goes beyond general Spark overviews to provide targeted guidance toward using lightning-fast big-data clustering in production. Written by an expert team well-known in the big data community, this book walks you through the challenges in moving from proof-of-concept or demo Spark applications to live Spark in production. Real use cases provide deep insight into common problems, limitations, challenges, and opportunities, while expert tips and tricks help you get the most out of Spark performance. Coverage includes Spark SQL, Tachyon, Kerberos, ML Lib, YARN, and Mesos, with clear, actionable guidance on resource scheduling, db connectors, streaming, security, and much more.

Spark has become the tool of choice for many Big Data problems, with more active contributors than any other Apache Software project. General introductory books abound, but this book is the first to provide deep insight and real-world advice on using Spark in production. Specific guidance, expert tips, and invaluable foresight make this guide an incredibly useful resource for real production settings.

Review Spark hardware requirements and estimate cluster size
Gain insight from real-world production use cases
Tighten security, schedule resources, and fine-tune performance
Overcome common problems encountered using Spark in production

Spark works with other big data tools including MapReduce and Hadoop, and uses languages you already know like Java, Scala, Python, and R. Lightning speed makes Spark too good to pass up, but understanding limitations and challenges in advance goes a long way toward easing actual production implementation. Spark: Big Data Cluster Computing in Production tells you everything you need to know, with real-world production insight and expert guidance, tips, and tricks.

Skip carousel

Computers

LanguageEnglish

PublisherWiley

Release dateMar 28, 2016

ISBN9781119254058

Author

Ilya Ganelin

Related authors

Skip carousel

Related to Spark

Related ebooks

Skip carousel

Practical Oracle Cloud Infrastructure: Infrastructure as a Service, Autonomous Database, Managed Kubernetes, and Serverless
Ebook
Practical Oracle Cloud Infrastructure: Infrastructure as a Service, Autonomous Database, Managed Kubernetes, and Serverless
byMichał Tomasz Jakóbczyk
Rating: 0 out of 5 stars
0 ratings
Practical Entity Framework: Database Access for Enterprise Applications
Ebook
Practical Entity Framework: Database Access for Enterprise Applications
byBrian L. Gorman
Rating: 0 out of 5 stars
0 ratings
Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets
Ebook
Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets
byEd Elliott
Rating: 0 out of 5 stars
0 ratings
Fast Data Processing with Spark 2 - Third Edition
Ebook
Fast Data Processing with Spark 2 - Third Edition
byKrishna Sankar
Rating: 0 out of 5 stars
0 ratings
Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library
Ebook
Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library
byHien Luu
Rating: 0 out of 5 stars
0 ratings
Instant Apache Camel Messaging System
Ebook
Instant Apache Camel Messaging System
byEvgeniy Sharapov
Rating: 0 out of 5 stars
0 ratings
Oracle Database Transactions and Locking Revealed: Building High Performance Through Concurrency
Ebook
Oracle Database Transactions and Locking Revealed: Building High Performance Through Concurrency
byDarl Kuhn
Rating: 0 out of 5 stars
0 ratings
Splunk Certified Study Guide: Prepare for the User, Power User, and Enterprise Admin Certifications
Ebook
Splunk Certified Study Guide: Prepare for the User, Power User, and Enterprise Admin Certifications
byDeep Mehta
Rating: 0 out of 5 stars
0 ratings
Beginning Scala
Ebook
Beginning Scala
byVishal Layka
Rating: 3 out of 5 stars
3/5
OpenStack Sahara Essentials
Ebook
OpenStack Sahara Essentials
byOmar Khedher
Rating: 0 out of 5 stars
0 ratings
Learning ELK Stack
Ebook
Learning ELK Stack
byChhajed Saurabh
Rating: 0 out of 5 stars
0 ratings
Modern API Design with ASP.NET Core 2: Building Cross-Platform Back-End Systems
Ebook
Modern API Design with ASP.NET Core 2: Building Cross-Platform Back-End Systems
byFanie Reynders
Rating: 0 out of 5 stars
0 ratings
MySQL Connector/Python Revealed: SQL and NoSQL Data Storage Using MySQL for Python Programmers
Ebook
MySQL Connector/Python Revealed: SQL and NoSQL Data Storage Using MySQL for Python Programmers
byJesper Wisborg Krogh
Rating: 0 out of 5 stars
0 ratings
Distributed Computing with Python
Ebook
Distributed Computing with Python
byFrancesco Pierfederici
Rating: 0 out of 5 stars
0 ratings
Java: Tips and Tricks to Programming Code with Java
Ebook
Java: Tips and Tricks to Programming Code with Java
byCharlie Masterson
Rating: 0 out of 5 stars
0 ratings
Java: Tips and Tricks to Programming Code with Java: Java Computer Programming, #2
Ebook
Java: Tips and Tricks to Programming Code with Java: Java Computer Programming, #2
byCharlie Masterson
Rating: 0 out of 5 stars
0 ratings
Scala for Java Developers
Ebook
Scala for Java Developers
byThomas Alexandre
Rating: 5 out of 5 stars
5/5
Learning Salesforce Development with Apex: Learn to Code, Run and Deploy Apex Programs for Complex Business Process and Critical Business Logic - 2nd Edition
Ebook
Learning Salesforce Development with Apex: Learn to Code, Run and Deploy Apex Programs for Complex Business Process and Critical Business Logic - 2nd Edition
byPaul Battisson
Rating: 0 out of 5 stars
0 ratings
Pro Spring Security: Securing Spring Framework 5 and Boot 2-based Java Applications
Ebook
Pro Spring Security: Securing Spring Framework 5 and Boot 2-based Java Applications
byCarlo Scarioni
Rating: 0 out of 5 stars
0 ratings
Building REST APIs with Flask: Create Python Web Services with MySQL
Ebook
Building REST APIs with Flask: Create Python Web Services with MySQL
byKunal Relan
Rating: 0 out of 5 stars
0 ratings
PolyBase Revealed: Data Virtualization with SQL Server, Hadoop, Apache Spark, and Beyond
Ebook
PolyBase Revealed: Data Virtualization with SQL Server, Hadoop, Apache Spark, and Beyond
byKevin Feasel
Rating: 0 out of 5 stars
0 ratings
Learning Apache Spark 2
Ebook
Learning Apache Spark 2
byMuhammad Asif Abbasi
Rating: 0 out of 5 stars
0 ratings
Learning Splunk Web Framework
Ebook
Learning Splunk Web Framework
byVincent Sesto
Rating: 0 out of 5 stars
0 ratings
Spring Boot 2 Recipes: A Problem-Solution Approach
Ebook
Spring Boot 2 Recipes: A Problem-Solution Approach
byMarten Deinum
Rating: 0 out of 5 stars
0 ratings
Kafka Streams - Real-time Streams Processing
Ebook
Kafka Streams - Real-time Streams Processing
byPrashant Kumar Pandey
Rating: 5 out of 5 stars
5/5
Advanced ASP.NET Core 3 Security: Understanding Hacks, Attacks, and Vulnerabilities to Secure Your Website
Ebook
Advanced ASP.NET Core 3 Security: Understanding Hacks, Attacks, and Vulnerabilities to Secure Your Website
byScott Norberg
Rating: 0 out of 5 stars
0 ratings
React Components
Ebook
React Components
byChristopher Pitt
Rating: 0 out of 5 stars
0 ratings
SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform
Ebook
SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform
byBenjamin Weissman
Rating: 0 out of 5 stars
0 ratings
Implementing Cloud Design Patterns for AWS
Ebook
Implementing Cloud Design Patterns for AWS
byMarcus Young
Rating: 0 out of 5 stars
0 ratings
Getting Started with Zurb Foundation 5
Ebook
Getting Started with Zurb Foundation 5
byRyan Flores
Rating: 3 out of 5 stars
3/5

Computers For You

Skip carousel

Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
Ebook
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
byHadelin de Ponteves
Rating: 0 out of 5 stars
0 ratings
The Mega Box: The Ultimate Guide to the Best Free Resources on the Internet
Ebook
The Mega Box: The Ultimate Guide to the Best Free Resources on the Internet
byChris Mason
Rating: 4 out of 5 stars
4/5
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
Ebook
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
byMaximus Wilson
Rating: 0 out of 5 stars
0 ratings
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
The Best Hacking Tricks for Beginners
Ebook
The Best Hacking Tricks for Beginners
byRAJ TYAGI
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Practical Lock Picking: A Physical Penetration Tester's Training Guide
Ebook
Practical Lock Picking: A Physical Penetration Tester's Training Guide
byDeviant Ollam
Rating: 5 out of 5 stars
5/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
The Designer's Web Handbook: What You Need to Know to Create for the Web
Ebook
The Designer's Web Handbook: What You Need to Know to Create for the Web
byPatrick McNeil
Rating: 0 out of 5 stars
0 ratings
Learning the Chess Openings
Ebook
Learning the Chess Openings
byJef Kaan
Rating: 5 out of 5 stars
5/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
Ebook
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
byTommy Swindali
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
Ebook
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
byPatrick McNeil
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Remote/WebCam Notarization : Basic Understanding
Ebook
Remote/WebCam Notarization : Basic Understanding
byJeannie Eunice Franks
Rating: 3 out of 5 stars
3/5
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
Ebook
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
byTriumph Books
Rating: 5 out of 5 stars
5/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
Podcast episode
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Serverless Data APIs
Podcast episode
Serverless Data APIs
byThe Cloudcast
0 ratings
0% found this document useful
Autonomous Database Tools: In this episode, hosts Lois Houston and Nikita Abraham speak with Oracle Database experts about the various tools you can use with Autonomous Database, including Oracle Application Express (APEX), Oracle Machine Learning, and more. Oracle...
Podcast episode
Autonomous Database Tools: In this episode, hosts Lois Houston and Nikita Abraham speak with Oracle Database experts about the various tools you can use with Autonomous Database, including Oracle Application Express (APEX), Oracle Machine Learning, and more. Oracle...
byOracle University Podcast
0 ratings
0% found this document useful
#06 - Tech stack of Open Podcast: Which database is best?
Podcast episode
#06 - Tech stack of Open Podcast: Which database is best?
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60: Tackling Apache Spark From The Data Engineer's Perspective (Interview)
Podcast episode
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60: Tackling Apache Spark From The Data Engineer's Perspective (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
MLOps Meetup #34: Streaming Machine Learning with Apache Kafka and Tiered Storage // Kai Waehner, Confluent
Podcast episode
MLOps Meetup #34: Streaming Machine Learning with Apache Kafka and Tiered Storage // Kai Waehner, Confluent
byMLOps.community
0 ratings
0% found this document useful
Real-World SRE Perspectives
Podcast episode
Real-World SRE Perspectives
byThe Cloudcast
0 ratings
0% found this document useful
Cloud Dataflow with Frances Perry: Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data so we asked Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it.
Podcast episode
Cloud Dataflow with Frances Perry: Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data so we asked Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Is Flink the answer to the ETL problem? (with Robert Metzger)
Podcast episode
Is Flink the answer to the ETL problem? (with Robert Metzger)
byDeveloper Voices
0 ratings
0% found this document useful
S2 E19 - Dynamic Module + Component Loading Using any Observables: Single Page Applications (SPA) have many advantages, including increased interactivity, responsiveness, and user experience. However, a SPA often requires sending large chunks of JavaScript code to the client. This code must be downloaded and parsed...
Podcast episode
S2 E19 - Dynamic Module + Component Loading Using any Observables: Single Page Applications (SPA) have many advantages, including increased interactivity, responsiveness, and user experience. However, a SPA often requires sending large chunks of JavaScript code to the client. This code must be downloaded and parsed...
byThe Angular Plus Show
0 ratings
0% found this document useful
Oracle Data Lakehouse: With each passing day, more and more data sources are sending greater volumes of data across the globe. For any organization, this combination of structured and unstructured data continues to be a challenge. Data lakehouses link, correlate, and...
Podcast episode
Oracle Data Lakehouse: With each passing day, more and more data sources are sending greater volumes of data across the globe. For any organization, this combination of structured and unstructured data continues to be a challenge. Data lakehouses link, correlate, and...
byOracle University Podcast
0 ratings
0% found this document useful
89: Reducing the Friction in Your Flow: We talk about the business value and personal happiness received when reducing the friction in our applications. It starts with the application design and LiveView plays a big role!
Podcast episode
89: Reducing the Friction in Your Flow: We talk about the business value and personal happiness received when reducing the friction in our applications. It starts with the application design and LiveView plays a big role!
byThinking Elixir Podcast
0 ratings
0% found this document useful
The Cloudcast #203 - Docker Networking: Aaron and Brian talk to John Willis (@botchagulpe; VP of Customer Enablement @Docker) and Madhu Venugopal (@MadhuVenugopal, Sr.Director Networking @Docker) about the evolution from Socketplane to Docker Networking, the new plugin architecture in v1.7, ...
Podcast episode
The Cloudcast #203 - Docker Networking: Aaron and Brian talk to John Willis (@botchagulpe; VP of Customer Enablement @Docker) and Madhu Venugopal (@MadhuVenugopal, Sr.Director Networking @Docker) about the evolution from Socketplane to Docker Networking, the new plugin architecture in v1.7, ...
byThe Cloudcast
0 ratings
0% found this document useful
Ep. 24 - How to run a successful development process (even if you're not technical): This episode is for anyone who wants to effectively orchestrate a development process without becoming the butt of their team’s water-cooler jokes. It's more attainable than you think, because it's all about process. Don't be a Bill Lumbergh - be...
Podcast episode
Ep. 24 - How to run a successful development process (even if you're not technical): This episode is for anyone who wants to effectively orchestrate a development process without becoming the butt of their team’s water-cooler jokes. It's more attainable than you think, because it's all about process. Don't be a Bill Lumbergh - be...
byfreeCodeCamp Podcast
0 ratings
0% found this document useful
Hasty Treat - Effortless Custom GraphQL with GraphQL Codegen: In this Hasty Treat, Scott and Wes talk about GraphQL tooling, and specifically a couple tools we use that will change your experience with GraphQL. .TECH Domains - Sponsor .TECH is taking the tech industry by storm. A domain that shows the world...
Podcast episode
Hasty Treat - Effortless Custom GraphQL with GraphQL Codegen: In this Hasty Treat, Scott and Wes talk about GraphQL tooling, and specifically a couple tools we use that will change your experience with GraphQL. .TECH Domains - Sponsor .TECH is taking the tech industry by storm. A domain that shows the world...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Acorns for AWS
Podcast episode
Acorns for AWS
byThe Cloudcast
0 ratings
0% found this document useful
2022 Look Ahead, Developer Careers
Podcast episode
2022 Look Ahead, Developer Careers
byThe Cloudcast
0 ratings
0% found this document useful
The Cloudcast #250 - A Platform View of Application Migrations: Aaron talks with Sinclair Schuller (@sschuller; CEO/Founder of @Apprenda) about news at Apprenda, the Kubernetes community, legacy apps vs. cloud native apps and why architecture is critical in a PaaS platform. Show Links: ...
Podcast episode
The Cloudcast #250 - A Platform View of Application Migrations: Aaron talks with Sinclair Schuller (@sschuller; CEO/Founder of @Apprenda) about news at Apprenda, the Kubernetes community, legacy apps vs. cloud native apps and why architecture is critical in a PaaS platform. Show Links: ...
byThe Cloudcast
0 ratings
0% found this document useful
Episode 60. All your Containers Are Belong to Us (An intro to Docker): So you have heard about it, and probably ran into it already. Docker is a super cool tech that let us create / manage and deploy applications (It is really what would come out if Devs and Ops decided to have a kid). Come hear how you can too master...
Podcast episode
Episode 60. All your Containers Are Belong to Us (An intro to Docker): So you have heard about it, and probably ran into it already. Docker is a super cool tech that let us create / manage and deploy applications (It is really what would come out if Devs and Ops decided to have a kid). Come hear how you can too master...
byJava Pub House
0 ratings
0% found this document useful
Episode 97. Hey there Scala 3! Looking good with those new Features!: So while Java is the "main" language of the JVM, it is by no means the "only" language. And one of the purely functional programming languages is getting a new facelift! Scala has been going through a revamp on the syntax and the features, and if you...
Podcast episode
Episode 97. Hey there Scala 3! Looking good with those new Features!: So while Java is the "main" language of the JVM, it is by no means the "only" language. And one of the purely functional programming languages is getting a new facelift! Scala has been going through a revamp on the syntax and the features, and if you...
byJava Pub House
0 ratings
0% found this document useful
Troubleshooting Kafka In Production: Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.
Podcast episode
Troubleshooting Kafka In Production: Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.
byData Engineering Podcast
0 ratings
0% found this document useful
38: Aurelia.io: Rob Eisenberg (@EisenbergEffect) recently released a framework that focuses on standardization & swappable modules. Rob is no stranger to framework building, having created the popular JavaScript framework Durandal.js and more recently having...
Podcast episode
38: Aurelia.io: Rob Eisenberg (@EisenbergEffect) recently released a framework that focuses on standardization & swappable modules. Rob is no stranger to framework building, having created the popular JavaScript framework Durandal.js and more recently having...
byThe Web Platform Podcast
0 ratings
0% found this document useful
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Podcast episode
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
DevOps and Incident Response Evolution
Podcast episode
DevOps and Incident Response Evolution
byThe Cloudcast
0 ratings
0% found this document useful
Episode 87: Software Components: In this episode, Michael and Markus talk about software components. We first looked at a couple of attempts at defining what a component is. We then provided our own definition that will be used in the rest of the episode.
Podcast episode
Episode 87: Software Components: In this episode, Michael and Markus talk about software components. We first looked at a couple of attempts at defining what a component is. We then provided our own definition that will be used in the rest of the episode.
bySoftware Engineering Radio - the podcast for professional software developers
0 ratings
0% found this document useful
The DIY Lab - ASW #185: Lots of web hacking can be done directly from the browser. Throw in a proxy like Burp plus the browser's developer tools window and you've got a nearly complete toolkit. But nearly complete means there's still room for improvement. We'll talk about...
Podcast episode
The DIY Lab - ASW #185: Lots of web hacking can be done directly from the browser. Throw in a proxy like Burp plus the browser's developer tools window and you've got a nearly complete toolkit. But nearly complete means there's still room for improvement. We'll talk about...
bySecurity Weekly Podcast Network (Audio)
0 ratings
0% found this document useful
Oracle NoSQL Database Cloud Service: High availability, data model flexibility, elastic scalability… If these words have piqued your interest, then this is the episode for you! Join Lois Houston and Nikita Abraham, along with Autumn Black, as they discuss how Oracle NoSQL...
Podcast episode
Oracle NoSQL Database Cloud Service: High availability, data model flexibility, elastic scalability… If these words have piqued your interest, then this is the episode for you! Join Lois Houston and Nikita Abraham, along with Autumn Black, as they discuss how Oracle NoSQL...
byOracle University Podcast
0 ratings
0% found this document useful
The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph
Podcast episode
The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
JSJ 455: Introducing and Understanding Svelte and Sapper with Mark Volkmann
Podcast episode
JSJ 455: Introducing and Understanding Svelte and Sapper with Mark Volkmann
byJavaScript Jabber
0 ratings
0% found this document useful
The Cloudcast #355 - Exploring IoT Edge
Podcast episode
The Cloudcast #355 - Exploring IoT Edge
byThe Cloudcast
0 ratings
0% found this document useful

Skip carousel

FLASK Web Frameworks
Linux Format
Article
FLASK Web Frameworks
Jun 4, 2019
The main focus of Python has always been to get you cracking on with your coding – the language was never made for web programming. However, this has just made it more interesting to extend the language for the web, or to create an interface to web-b
9 min read
Observability Of The Kernel And Containers
Linux Format
Article
Observability Of The Kernel And Containers
Apr 4, 2023
Mihalis Tsoukalos is currently working on Time Series. You can reach him at: @mactsouk. For our final delve into eBPF, we’re tackling applications, the kernel and Docker containers. At the end of the day, all Linux machines execute code for applicat
10 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Use Katana For Lookdev And Lighting
3D World
Article
Use Katana For Lookdev And Lighting
Sep 7, 2021
3 min read
HotPicks
Linux Format
Article
HotPicks
Jun 27, 2023
12 min read
Code A Cataloguing Application In Python
Linux Format
Article
Code A Cataloguing Application In Python
Nov 15, 2022
Credit: www.djangoproject.com Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://github.com/mat
8 min read
MacOS
MacLife
Article
MacOS
Apr 23, 2024
Why can’t Spotlight find words in the content of some of my PDF documents? PDF docs can contain two notional layers, the first containing images perhaps from printed pages that were originally scanned in, and a second containing laid–out text that mi
3 min read
Code An Admin Back-end In Django
Linux Format
Article
Code An Admin Back-end In Django
Dec 13, 2022
Credit: www.djangoproject.com OUR EXPERT Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://
6 min read
Mark Your Media Files
Linux Format
Article
Mark Your Media Files
Nov 17, 2020
When using an embedded system as a media centre, you need to know about NFO files. The files keep track of the meta data for your titles. They’re XML files that contain information on the song, movie or other media that you’re accessing. For example,
6 min read
MacOS
MacFormat
Article
MacOS
Apr 2, 2024
How to copy the pathname of a file in the Finder? > Select the item using Ctrl-click or twofinger tap, and hold for the contextual menu. Hold Opt to see the command “Copy [name] as Pathname”. In Terminal you can add the path by dragging the file and
3 min read
Liz Rice Chief Open Source Officer at Isovalent
Techfastly
Article
Liz Rice Chief Open Source Officer at Isovalent
Apr 1, 2022
5 min read
Direct From the Source
Linux Format
Article
Direct From the Source
Oct 22, 2019
8 min read
Craft A Perfect Personal Document Library
iCreate
Article
Craft A Perfect Personal Document Library
Jan 26, 2023
There are countless apps available that can be used to organise your notes and also many word processors designed to help you create smart-looking documents, but Craft aims to do both in style. The idea is to include a huge number of advanced documen
1 min read
Usability
Linux Format
Article
Usability
Oct 19, 2021
3 min read
Get Packing!
Linux Format
Article
Get Packing!
May 3, 2022
8 min read
Get Packing!
Linux Format
Article
Get Packing!
May 3, 2022
8 min read
User Interface And Ease Of Use
Linux Format
Article
User Interface And Ease Of Use
Jan 11, 2022
3 min read
The Verdict Static Site Generators
Linux Format
Article
The Verdict Static Site Generators
Oct 19, 2021
2 min read
Build A Pi-powered Network Storage Device
Linux Format
Article
Build A Pi-powered Network Storage Device
Dec 14, 2021
10 min read
Drill Down Deeper
MacLife
Article
Drill Down Deeper
Aug 16, 2022
2 min read
Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
Set Up A Production- Ready Web Server
APC
Article
Set Up A Production- Ready Web Server
Nov 4, 2019
8 min read
Build A Dynamic App Security Pipeline
Linux Format
Article
Build A Dynamic App Security Pipeline
Sep 21, 2021
8 min read
ArcoLinux 19.06
Linux Format
Article
ArcoLinux 19.06
Jul 30, 2019
2 min read
Join the Pod, Man!
Linux Format
Article
Join the Pod, Man!
May 30, 2023
8 min read
Improve your APPS
iCreate
Article
Improve your APPS
Nov 3, 2022
There are many advantages to using the core Apple apps on macOS and iOS, which include reliability, the best possible continuity with other apps and devices, and the knowledge that if there is a glitch it will be fixed quickly because so many people
8 min read
Supercharge Your Mac
iCreate
Article
Supercharge Your Mac
Sep 9, 2021
10 min read
DJANGO Create A Database-driven Website
Linux Format
Article
DJANGO Create A Database-driven Website
Jun 4, 2019
The Django web framework was named after the famous guitarist Django Reinhardt and was first created by web developers at a small newspaper in Kansas. The main goals of Django is to enable fast development of complex websites with database needs. It
7 min read
Build A Static Project Website On GitHub
Linux Format
Article
Build A Static Project Website On GitHub
Jul 25, 2023
10 min read
Also Consider…
Linux Format
Article
Also Consider…
Nov 17, 2020
There are quite a few pen-testing distros out there and we’ve already covered the ones that are actively maintained in this Roundup. A better bet instead would be to look for pen-testing tools inside security-focused distros that might also include t
1 min read

Related categories

Skip carousel

Reviews for Spark

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Spark - Ilya Ganelin

Introduction

Apache Spark is a distributed compute framework for easy, at-scale, computation. Some refer to it as a compute grid or a compute framework—these terms are also correct within the underlying premise that Spark makes it easy for developers to gain access and insight into vast quantities of data.

Apache Spark was created by Matei Zaharia as a research project inside of the University of California, Berkeley in 2009. It was donated to the open source community in 2010. In 2013 Spark was added into the Apache Software Foundation as an Incubator project and graduated into a Top Level Project (TLP) in 2014, where it remains today.

Who This Book Is For

If you’ve picked up this book we presume that you already have an extended fascination with Apache Spark. We consider the intended audience for this book to be one of a developer, a project lead for a Spark application, or a system administrator (or DevOps) who needs to prepare to take a developed Spark application into a migratory path for a production workflow.

What This Book Covers

This book covers various methodologies, components, and best practices for developing and maintaining a production-grade Spark application. That said, we presume that you already have an initial or possible application scoped for production as well as a known foundation for Spark basics.

How This Book Is Structured

This book is divided into six chapters, with the aim of imparting readers with the following knowledge:

A deep understanding of the Spark internals as well as their implication on the production workflow

A set of guidelines and trade-offs on the various configuration parameters that can be used to tune Spark for high availability and fault tolerance

A complete picture of a production workflow and the various components necessary to migrate an application into a production workflow

What You Need to Use This Book

You should understand the basics of development and usage atop Apache Spark. This book will not be covering introductory material. There are numerous books, forums, and resources available that cover this topic and, as such, we assume all readers have basic Spark knowledge or, if duly lost, will read the interested topics to better understand the material presented in this book.

The source code for the samples is available for download from the Wiley website at: www.wiley.com/go/sparkbigdataclustercomputing.

Conventions

To help you get the most from the text and keep track of what’s happening, we’ve used a number of conventions throughout the book.

NOTE Notes indicate notes, tips, hints, tricks, or asides to the current discussion. As for styles in the text:

We highlight new terms and important words when we introduce them.

We show code within the text like so:persistence.properties.

Source Code

As you work through the examples in this book, you may choose either to type in all the code manually, or to use the source code files that accompany the book. All the source code used in this book is available for download at www.wiley.com. Specifically for this book, the code download is on the Download Code tab at www.wiley.com/go/sparkbigdataclustercomputing.

You can also search for the book at www.wiley.com by ISBN.

You can also find the files at https://github.com/backstopmedia/sparkbook.

NOTE Because many books have similar titles, you may find it easiest to search by ISBN; this book’s ISBN is 978-1-119-25401-0.

Once you download the code, just decompress it with your favorite compression tool.

CHAPTER 1

Finishing Your Spark Job

When you scale out a Spark application for the first time, one of the more common occurrences you will encounter is the application’s inability to merely succeed and finish its job. The Apache Spark framework’s ability to scale is tremendous, but it does not come out of the box with those properties. Spark was created, first and foremost, to be a framework that would be easy to get started and use. Once you have developed an initial application, however, you will then need to take the additional exercise of gaining deeper knowledge of Spark’s internals and configurations to take the job to the next stage.

In this chapter we lay the groundwork for getting a Spark application to succeed. We will focus primarily on the hardware and system-level design choices you need to set up and consider before you can work through the various Spark-specific issues to move an application into production.

We will begin by discussing the various ways you can install a production-grade cluster for Apache Spark. We will include the scaling efficiencies you will need depending on a given workload, the various installation methods, and the common setups. Next, we will take a look at the historical origins of Spark in order to better understand its design and to allow you to best judge when it is the right tool for your jobs. Following that, we will take a look at resource management: how memory, CPU, and disk usage come into play when creating and executing Spark applications. Next, we will cover storage capabilities within Spark and their external subsystems. Finally, we will conclude with a discussion of how to instrument and monitor a Spark application.

Installation of the Necessary Components

Before you can begin to migrate an application written in Apache Spark you will need an actual cluster to begin testing it on. You can download, compile, and install Spark in a number of different ways within its system (some will be easier than others), and we’ll cover the primary methods in this chapter.

Let’s begin by explaining how to configure a native installation, meaning one where only Apache Spark is installed, then we’ll move into the various Hadoop distributions (Cloudera and Hortonworks), and conclude by providing a brief explanation on how to deploy Spark on Amazon Web Services (AWS).

Before diving too far into the various ways you can install Spark, the obvious question that arises is, What type of hardware should I leverage for a Spark cluster? We can offer various possible answers to this question, but we’d like to focus on a few resounding truths of the Spark framework rather than necessitating a given layout.

It’s important to know that Apache Spark is an in-memory compute grid. Therefore, for maximum efficiency, it is highly recommended that the system, as a whole, maintain enough memory within the framework for the largest workload (or dataset) that will be conceivably consumed. We are not saying that you cannot scale a cluster later, but it is always better to plan ahead, especially if you work inside a larger organization where purchase orders might take weeks or months.

On the concept of memory it is necessary to understand that when computing the amount of memory you need to understand that the computation does not equate to a one-to-one fashion. That is to say, for a given 1TB dataset, you will need more than 1TB of memory. This is because when you create objects within Java from a dataset, the object is typically much larger than the original data element. Multiply that expansion times the number of objects created for a given dataset and you will have a much more accurate representation of the amount of memory a system will require to perform a given task.

To better attack this problem, Spark is, at the time of this writing, working on what Apache has called Project Tungsten, which will greatly reduce the memory overhead of objects by leveraging off heap memory. You don’t need to know more about Tungsten as you continue reading this book, but this information may apply to future Spark releases, because Tungsten is poised to become the de facto memory management system.

The second major component we want to highlight in this chapter is the number of CPU cores you will need per physical machine when you are determining hardware for Apache Spark. This is a much more fragmented answer in that, once the data load normalizes into memory, the application is typically network or CPU bound. That said, the easiest solution is to test your Spark application on a smaller dataset and measure its bounding case, be it either network or CPU, and then plan accordingly from there.

Native Installation Using a Spark Standalone Cluster

The simplest way to install Spark is to deploy a Spark Standalone cluster. In this mode, you deploy a Spark binary to each node in a cluster, update a small set of configuration files, and then start the appropriate processes on the master and slave nodes. In Chapter 2, we discuss this process in detail and present a simple scenario covering installation, deployment, and execution of a basic Spark job.

Because Spark is not tied to the Hadoop ecosystem, this mode does not have any dependencies aside from the Java JDK. Spark currently recommends the Java 1.7 JDK. If you wish to run alongside an existing Hadoop deployment, you can launch the Spark processes on the same machines as the Hadoop installation and configure the Spark environment variables to include the Hadoop configuration.

NOTE For more on a Cloudera installation of Spark try http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_spark_installation.html. For more on the Hortonworks installation try http://hortonworks.com/hadoop/spark/#section_6. And for more on an Amazon Web Services installation of Spark try http://aws.amazon.com/articles/4926593393724923.

The History of Distributed Computing That Led to Spark

We have introduced Spark as a distributed compute framework; however, we haven’t really discussed what this means. Until recently, most computer systems available to both individuals and enterprises were based around single machines. These single machines came in many shapes and sizes and differed dramatically in terms of their performance, as they do today.

We’re all familiar with the modern ecosystem of personal machines. At the low-end, we have tablets and mobile phones. We can think of these as relatively weak, un-networked computers. At the next level we have laptops and desktop computers. These are more powerful machines, with more storage and computational ability, and potentially, with one or more graphics cards (GPUs) that support certain types of massively parallel computations. Next are those machines that some people have networked with in their home, although generally these machines were not networked to share their computational ability, but rather to provide shared storage—for example, to share movies or music across a home network.

Within most enterprises, the picture today is still much the same. Although the machines used may be more powerful, most of the software they run, and most of the work they do, is still executed on a single machine. This fact limits the scale and the potential impact of the work they can do. Given this limitation, a few select organizations have driven the evolution of modern parallel computing to allow networked systems of computers to do more than just share data, and to collaboratively utilize their resources to tackle enormous problems.

In the public domain, you may have heard of the SETI at Home program from Berkeley or the Folding@Home program from Stanford. Both of these programs were early initiatives that let individuals dedicate their machines to solving parts of a massive distributed task. In the former case, SETI has been looking for unusual signals coming from outer space collected via radio telescope. In the latter, the Stanford program runs a piece of a program computing permutations of proteins—essentially building molecules—for medical research.

Because of the size of the data being processed, no single machine, not even the massive supercomputers available in certain universities or government agencies, have had the capacity to solve these problems within the scope of a project or even a lifetime. By distributing the workload to multiple machines, the problem became potentially tractable—solvable in the allotted time.

As these systems became more mature, and the computer science behind these systems was further developed, many organizations created clusters of machines—coordinated systems that could distribute the workload of a particular problem across many machines to extend the resources available. These systems first grew in research institutions and government agencies, but quickly moved into the public domain.

Enter the Cloud

The most well-known offering in this space is of course the proverbial cloud. Amazon introduced AWS (Amazon Web Services), which was later followed by comparable offerings from Google, Microsoft, and others. The purpose of a cloud is to provide users and organizations with scalable clusters of machines that can be started and expanded upon on-demand.

At about the same time, universities and certain companies were also building their own clusters in-house and continuing to develop frameworks that focused on the challenging problem of parallelizing arbitrary types of tasks and computations. Google was born out of its PageRank algorithm—an extension of the MapReduce framework that allowed a general class of problems to be solved in parallel on clusters built with commodity hardware.

This notion of building algorithms, that, while not the most efficient, could be massively parallelized and scaled to thousands of machines, drove the next stage of growth in this area. The idea that you could solve massive problems by building clusters, not of supercomputers, but of relatively weak and inexpensive machines, democratized distributed computing.

Yahoo, in a bid to compete with Google, developed, and later open-sourced under the Apache Foundation, the Hadoop platform—an ecosystem for distributed computing that includes a file system (HDFS), a computation framework (MapReduce), and a resource manager (YARN). Hadoop made it dramatically easier for any organization to not only create a cluster but to also create software and execute parallelizable programs on these clusters that can process huge amounts of distributed data on multiple machines.

Spark has subsequently evolved as a replacement for MapReduce by building on the idea of creating a framework to simplify the difficult task of writing parallelizable programs that efficiently solve problems at scale. Spark’s primary contribution to this space is that it provides a powerful and simple API for performing complex, distributed operations on distributed data. Users can write Spark programs as if they were writing code for a single machine, but under the hood this work is distributed across a cluster. Secondly, Spark leverages the memory of a cluster to reduce MapReduce’s dependency on the underlying distributed file system, leading to dramatic performance gains. By virtue of these improvements, Spark has achieved a substantial amount of success and popularity and has brought you here to learn more about how it accomplishes this.

Spark is not the right tool for every job. Because Spark is fundamentally designed around the MapReduce paradigm, its focus is on excelling at Extract, Transform, and Load (ETL) operations. This mode of processing is typically referred to as batch processing—processing large volumes of data efficiently in a distributed manner. The downside of batch processing is that it typically introduces larger latencies for any single piece of data. Although Spark developers have been dedicating a substantial amount of effort to improving the Spark Streaming mode, it remains fundamentally limited to computations on the order of seconds. Thus, for truly low-latency, high-throughput applications, Spark is not necessarily the right tool for the job. For a large set of use cases, Spark nonetheless excels at handling typical ETL workloads and provides substantial performance gains (as much as 100 times improvement) over traditional MapReduce.

Understanding Resource Management

In the chapter on cluster management you will learn more about how the operating system handles the allocation and distribution of resources amongst the processes on a single machine. However, in a distributed environment, the cluster manager handles this challenge. In general, we primarily focus on three types of resources within the Spark ecosystem. These are disk storage, CPU cores, and memory. Other resources exist, of course, such as more advanced abstractions like virtual memory, GPUs, and potentially different tiers of storage, but in general we don’t need to focus on those within the context of building Spark applications.

Disk Storage

The first type of resource, disk, is vital to any Spark application since it stores persistent data, the results of intermediate computations, and system state. When we refer to disk storage, we are referring to data stored on a hard drive of some kind, either the traditional rotating spindle, or newer SSDs and flash memory. Like any other resource, disk is finite. Disk storage is relatively cheap and most systems tend to have an abundance of physical storage, but in the world of big data, it’s actually quite common to use up even this cheap and abundant storage! We tend to enable replication of data for the sake of durability and to support more efficient parallel computation. Also, you’ll usually want to persist frequently used intermediate dataset(s) to disk to speed up long-running jobs. Thus, it generally pays to be cognizant of disk usage, and treat it as any other finite resource.

Interaction with physical disk storage on a single machine is abstracted away by the file system—a program that provides an API to read and write files. In a distributed environment, where data may be spread across multiple machines, but still needs to be accessed as a single logical entity, a distributed file system fulfills the same role. Managing the operation of the distributed file system and monitoring its state is typically the role of the cluster administrator, who tracks usage, quotas, and re-assigns resources as necessary. Cluster managers such as YARN or Mesos may also regulate access to the underlying file system to better distribute resources between simultaneously executing applications.

CPU Cores

The central processing unit (CPU) on a machine is the processor that actually executes all computations. Modern machines tend to have multiple CPU cores, meaning that they can execute multiple processes in parallel. In a cluster, we have multiple machines, each with multiple cores. On a single machine, the operating system handles communication and resource sharing between processes. In a distributed environment, the cluster manager handles the assignment of CPU resources (cores) to individual tasks and applications. In the chapter on cluster management, you’ll learn specifically how YARN and Mesos ensure that multiple applications running in parallel can have access to this pool of available CPUs and share it fairly.

When building Spark applications, it’s helpful to relate the number of CPU cores to the parallelism of your program, or how many tasks it can execute simultaneously. Spark is based around the resilient distributed dataset (RDD)—an abstraction that treats a distributed dataset as a single entity consisting of multiple partitions. In Spark, a single Spark task will processes a single partition of an RDD on a single CPU core.

Thus, the degree to which your data is partitioned—and the number of available cores—essentially dictates the parallelism of your program. If we consider a hypothetical Spark job consisting of five stages, each needing to run 500 tasks, if we only have five CPU cores available, this may take a long time to complete! In contrast, if we have 100 CPU cores available, and the data is sufficiently partitioned, for example into 200 partitions, Spark will be able to parallelize much more effectively, running 100 tasks simultaneously, completing the job much more quickly. By default, Spark only uses two cores with a single executor—thus when launching a Spark job for the first time, it may unexpectedly take a very long time. We discuss executor and core configuration in the next chapter.

Memory

Lastly, memory is absolutely critical to almost all Spark applications. Memory is used for internal Spark mechanisms such as the shuffle, and the JVM heap is used to persist RDDs in memory, minimizing disk I/O and providing dramatic performance gains. Spark acquires memory per executor—a worker abstraction that you’ll learn more about in the next chapter. The amount of memory that Spark requests per executor is a configurable parameter and it is the job of

Enjoying the preview?

Page 1 of 1

Spark: Big Data Cluster Computing in Production

About this ebook

Ilya Ganelin

Related authors

Related to Spark

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Spark

What did you think?

Book preview

Spark - Ilya Ganelin

Who This Book Is For

What This Book Covers

How This Book Is Structured

What You Need to Use This Book

Conventions

Source Code

Installation of the Necessary Components

Native Installation Using a Spark Standalone Cluster

The History of Distributed Computing That Led to Spark

Enter the Cloud

Understanding Resource Management