Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets

Ebook369 pages3 hours

Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets

Name: Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets
Author: Ed Elliott
ISBN: 9781484269923

By Ed Elliott

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Get started using Apache Spark via C# or F# and the .NET for Apache Spark bindings. This book is an introduction to both Apache Spark and the .NET bindings. Readers new to Apache Spark will get up to speed quickly using Spark for data processing tasks performed against large and very large datasets. You will learn how to combine your knowledge of .NET with Apache Spark to bring massive computing power to bear by distributed processing of extremely large datasets across multiple servers.
This book covers how to get a local instance of Apache Spark running on your developer machine and shows you how to create your first .NET program that uses the Microsoft .NET bindings for Apache Spark. Techniques shown in the book allow you to use Apache Spark to distribute your data processing tasks over multiple compute nodes. You will learn to process data using both batch mode and streaming mode so you can make the right choice depending on whether you are processing an existing dataset or are working against new records in micro-batches as they arrive. The goal of the book is leave you comfortable in bringing the power of Apache Spark to your favorite .NET language.

What You Will Learn

Install and configure Spark .NET on Windows, Linux, and macOS
Write Apache Spark programs in C# and F# using the .NET bindings
Access and invoke the Apache Spark APIs from .NET with the same high performance as Python, Scala, and R
Encapsulate functionality in user-defined functions
Transform and aggregate large datasets
Execute SQL queries against files through Apache Hive
Distribute processing of large datasets across multiple servers
Create your own batch, streaming, and machine learning programs

Who This Book Is For
.NETdevelopers who want to perform big data processing without having to migrate to Python, Scala, or R; and Apache Spark developers who want to run natively on .NET and take advantage of the C# and F# ecosystems

Skip carousel

LanguageEnglish

PublisherApress

Release dateApr 13, 2021

ISBN9781484269923

Author

Ed Elliott

Related authors

Skip carousel

Related to Introducing .NET for Apache Spark

Related ebooks

Skip carousel

Fast Data Processing with Spark 2 - Third Edition
Ebook
Fast Data Processing with Spark 2 - Third Edition
byKrishna Sankar
Rating: 0 out of 5 stars
0 ratings
Spark: Big Data Cluster Computing in Production
Ebook
Spark: Big Data Cluster Computing in Production
byIlya Ganelin
Rating: 0 out of 5 stars
0 ratings
Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library
Ebook
Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library
byHien Luu
Rating: 0 out of 5 stars
0 ratings
Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library
Ebook
Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library
byHien Luu
Rating: 0 out of 5 stars
0 ratings
OpenStack Sahara Essentials
Ebook
OpenStack Sahara Essentials
byOmar Khedher
Rating: 0 out of 5 stars
0 ratings
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Ebook
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
byEric Chou
Rating: 0 out of 5 stars
0 ratings
Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud
Ebook
Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud
byRobert Ilijason
Rating: 0 out of 5 stars
0 ratings
PolyBase Revealed: Data Virtualization with SQL Server, Hadoop, Apache Spark, and Beyond
Ebook
PolyBase Revealed: Data Virtualization with SQL Server, Hadoop, Apache Spark, and Beyond
byKevin Feasel
Rating: 0 out of 5 stars
0 ratings
Oracle Database Transactions and Locking Revealed: Building High Performance Through Concurrency
Ebook
Oracle Database Transactions and Locking Revealed: Building High Performance Through Concurrency
byDarl Kuhn
Rating: 0 out of 5 stars
0 ratings
Spark in Action: Covers Apache Spark 3 with Examples in Java, Python, and Scala
Ebook
Spark in Action: Covers Apache Spark 3 with Examples in Java, Python, and Scala
byJean-Georges Perrin
Rating: 0 out of 5 stars
0 ratings
Raspbian OS Programming with the Raspberry Pi: IoT Projects with Wolfram, Mathematica, and Scratch
Ebook
Raspbian OS Programming with the Raspberry Pi: IoT Projects with Wolfram, Mathematica, and Scratch
byAgus Kurniawan
Rating: 0 out of 5 stars
0 ratings
Beginning Jakarta EE Web Development: Using JSP, JSF, MySQL, and Apache Tomcat for Building Java Web Applications
Ebook
Beginning Jakarta EE Web Development: Using JSP, JSF, MySQL, and Apache Tomcat for Building Java Web Applications
byLuciano Manelli
Rating: 0 out of 5 stars
0 ratings
Learning Apache Spark 2
Ebook
Learning Apache Spark 2
byMuhammad Asif Abbasi
Rating: 0 out of 5 stars
0 ratings
Creating ASP.NET Core Web Applications: Proven Approaches to Application Design and Development
Ebook
Creating ASP.NET Core Web Applications: Proven Approaches to Application Design and Development
byDirk Strauss
Rating: 0 out of 5 stars
0 ratings
The Complete ASP.NET Core 3 API Tutorial: Hands-On Building, Testing, and Deploying
Ebook
The Complete ASP.NET Core 3 API Tutorial: Hands-On Building, Testing, and Deploying
byLes Jackson
Rating: 0 out of 5 stars
0 ratings
Tweak Your Mac Terminal: Command Line macOS
Ebook
Tweak Your Mac Terminal: Command Line macOS
byDaniel Platt
Rating: 0 out of 5 stars
0 ratings
Beginning Database Programming Using ASP.NET Core 3: With MVC, Razor Pages, Web API, jQuery, Angular, SQL Server, and NoSQL
Ebook
Beginning Database Programming Using ASP.NET Core 3: With MVC, Razor Pages, Web API, jQuery, Angular, SQL Server, and NoSQL
byBipin Joshi
Rating: 0 out of 5 stars
0 ratings
Modern API Design with ASP.NET Core 2: Building Cross-Platform Back-End Systems
Ebook
Modern API Design with ASP.NET Core 2: Building Cross-Platform Back-End Systems
byFanie Reynders
Rating: 0 out of 5 stars
0 ratings
Hadoop Blueprints
Ebook
Hadoop Blueprints
byAnurag Shrivastava
Rating: 0 out of 5 stars
0 ratings
Beginning Java MVC 1.0: Model View Controller Development to Build Web, Cloud, and Microservices Applications
Ebook
Beginning Java MVC 1.0: Model View Controller Development to Build Web, Cloud, and Microservices Applications
byPeter Späth
Rating: 0 out of 5 stars
0 ratings
Building Python Real-Time Applications with Storm
Ebook
Building Python Real-Time Applications with Storm
byBhatnagar Kartik
Rating: 0 out of 5 stars
0 ratings
Migrating a Two-Tier Application to Azure: A Hands-on Walkthrough of Azure Infrastructure, Platform, and Container Services
Ebook
Migrating a Two-Tier Application to Azure: A Hands-on Walkthrough of Azure Infrastructure, Platform, and Container Services
byPeter De Tender
Rating: 0 out of 5 stars
0 ratings
Developing Web Components with TypeScript: Native Web Development Using Thin Libraries
Ebook
Developing Web Components with TypeScript: Native Web Development Using Thin Libraries
byJörg Krause
Rating: 0 out of 5 stars
0 ratings
Java on the Raspberry Pi: Develop Java Programs to Control Devices for Robotics, IoT, and Beyond
Ebook
Java on the Raspberry Pi: Develop Java Programs to Control Devices for Robotics, IoT, and Beyond
byGreg Flurry
Rating: 0 out of 5 stars
0 ratings
CoffeeScript Application Development
Ebook
CoffeeScript Application Development
byIan Young
Rating: 0 out of 5 stars
0 ratings
DevOps for SharePoint: With Packer, Terraform, Ansible, and Vagrant
Ebook
DevOps for SharePoint: With Packer, Terraform, Ansible, and Vagrant
byOscar Medina
Rating: 0 out of 5 stars
0 ratings
Artificial Neural Networks with Java: Tools for Building Neural Network Applications
Ebook
Artificial Neural Networks with Java: Tools for Building Neural Network Applications
byIgor Livshin
Rating: 0 out of 5 stars
0 ratings
Simulations in Swift 5: Design and Implement with Swift Playgrounds
Ebook
Simulations in Swift 5: Design and Implement with Swift Playgrounds
byBeau Nouvelle
Rating: 0 out of 5 stars
0 ratings
Eclipse TEA Revealed: Building Plug-ins and Creating Extensions for Eclipse
Ebook
Eclipse TEA Revealed: Building Plug-ins and Creating Extensions for Eclipse
byMarkus Duft
Rating: 0 out of 5 stars
0 ratings
Learning PySpark
Ebook
Learning PySpark
byTomasz Drabas
Rating: 0 out of 5 stars
0 ratings

Programming For You

Skip carousel

Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
Ebook
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
byTimothy C. Needham
Rating: 4 out of 5 stars
4/5
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
Ebook
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
byJoseph Labrecque
Rating: 5 out of 5 stars
5/5
HTML & CSS: Learn the Fundaments in 7 Days
Ebook
HTML & CSS: Learn the Fundaments in 7 Days
byMichael Knapp
Rating: 4 out of 5 stars
4/5
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
Ebook
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
byJason Scotts
Rating: 4 out of 5 stars
4/5
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
Ebook
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
byJames Tudor
Rating: 5 out of 5 stars
5/5
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
Ebook
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
byBrady Ellison
Rating: 5 out of 5 stars
5/5
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
Ebook
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
byEric Vargas
Rating: 0 out of 5 stars
0 ratings
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
Ebook
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
byi Code Academy
Rating: 5 out of 5 stars
5/5
Coding All-in-One For Dummies
Ebook
Coding All-in-One For Dummies
byNikhil Abraham
Rating: 4 out of 5 stars
4/5
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
Ebook
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
byAnthony Adams
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Python Machine Learning By Example
Ebook
Python Machine Learning By Example
byYuxi (Hayden) Liu
Rating: 4 out of 5 stars
4/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
Ebook
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
byGwendolyn Faraday
Rating: 5 out of 5 stars
5/5
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
Ebook
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
byHeath Haskins
Rating: 5 out of 5 stars
5/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Linux: Learn in 24 Hours
Ebook
Linux: Learn in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
Ebook
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
bySebastian Raschka
Rating: 5 out of 5 stars
5/5
HTML & CSS QuickStart Guide: The Simplified Beginners Guide to Developing a Strong Coding Foundation, Building Responsive Websites, and Mastering the Fundamentals of Modern Web Design
Ebook
HTML & CSS QuickStart Guide: The Simplified Beginners Guide to Developing a Strong Coding Foundation, Building Responsive Websites, and Mastering the Fundamentals of Modern Web Design
byDavid DuRocher
Rating: 4 out of 5 stars
4/5
Pokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS
Ebook
Pokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS
byGame Guidez
Rating: 5 out of 5 stars
5/5
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
Ebook
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
byKevin Pitch
Rating: 5 out of 5 stars
5/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
SQL All-in-One For Dummies
Ebook
SQL All-in-One For Dummies
byAllen G. Taylor
Rating: 3 out of 5 stars
3/5
Modern C++ for Absolute Beginners: A Friendly Introduction to C++ Programming Language and C++11 to C++20 Standards
Ebook
Modern C++ for Absolute Beginners: A Friendly Introduction to C++ Programming Language and C++11 to C++20 Standards
bySlobodan Dmitrović
Rating: 0 out of 5 stars
0 ratings
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
Ebook
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
byPatrick McNeil
Rating: 4 out of 5 stars
4/5
101 Amazing Nintendo NES Facts: Includes facts about the Famicom
Ebook
101 Amazing Nintendo NES Facts: Includes facts about the Famicom
byJimmy Russell
Rating: 4 out of 5 stars
4/5
OneNote: The Ultimate Guide on How to Use Microsoft OneNote for Getting Things Done
Ebook
OneNote: The Ultimate Guide on How to Use Microsoft OneNote for Getting Things Done
byChris Will
Rating: 1 out of 5 stars
1/5
Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS
Ebook
Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS
byTravis Plunk
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
Podcast episode
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
byData Engineering Podcast
0 ratings
0% found this document useful
Cloud Dataflow with Frances Perry: Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data so we asked Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it.
Podcast episode
Cloud Dataflow with Frances Perry: Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data so we asked Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
gRPC at CoreOS with Brandon Philips: Brandon Philips, CTO of CoreOS, tells your cohosts Mark and Francesc why they chose gRPC for the newest version of etcd and how this improved its performance and development flow.
Podcast episode
gRPC at CoreOS with Brandon Philips: Brandon Philips, CTO of CoreOS, tells your cohosts Mark and Francesc why they chose gRPC for the newest version of etcd and how this improved its performance and development flow.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Autonomous Database Tools: In this episode, hosts Lois Houston and Nikita Abraham speak with Oracle Database experts about the various tools you can use with Autonomous Database, including Oracle Application Express (APEX), Oracle Machine Learning, and more. Oracle...
Podcast episode
Autonomous Database Tools: In this episode, hosts Lois Houston and Nikita Abraham speak with Oracle Database experts about the various tools you can use with Autonomous Database, including Oracle Application Express (APEX), Oracle Machine Learning, and more. Oracle...
byOracle University Podcast
0 ratings
0% found this document useful
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
Podcast episode
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
byData Engineering Podcast
0 ratings
0% found this document useful
89: Reducing the Friction in Your Flow: We talk about the business value and personal happiness received when reducing the friction in our applications. It starts with the application design and LiveView plays a big role!
Podcast episode
89: Reducing the Friction in Your Flow: We talk about the business value and personal happiness received when reducing the friction in our applications. It starts with the application design and LiveView plays a big role!
byThinking Elixir Podcast
0 ratings
0% found this document useful
S3E8: Step 1 is building a great app
Podcast episode
S3E8: Step 1 is building a great app
bySwift over Coffee
0 ratings
0% found this document useful
#06 - Tech stack of Open Podcast: Which database is best?
Podcast episode
#06 - Tech stack of Open Podcast: Which database is best?
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
Acorns for AWS
Podcast episode
Acorns for AWS
byThe Cloudcast
0 ratings
0% found this document useful
239: Admins All the Way Down: On this week's episode, Steph and Chris discuss what it really means to make a project "open source". Is it just about making the code publicly available, or should we be considering licenses and responsibility to update? They also discuss the need for breaks and structure now that everyone is working from home, revisit previous discussions around building functionality for admin users and the various admin systems out there, and they round out the conversation with a discussion around doubles vs spies in testing.
Podcast episode
239: Admins All the Way Down: On this week's episode, Steph and Chris discuss what it really means to make a project "open source". Is it just about making the code publicly available, or should we be considering licenses and responsibility to update? They also discuss the need for breaks and structure now that everyone is working from home, revisit previous discussions around building functionality for admin users and the various admin systems out there, and they round out the conversation with a discussion around doubles vs spies in testing.
byThe Bike Shed
0 ratings
0% found this document useful
Data Mechanics: Data Engineering with Jean-Yves Stephan: Apache Spark is a popular open source analytics engine for large-scale data processing. Applications can be written in Java, Scala, Python, R, and SQL. These applications have flexible options to run on like Kubernetes or in the cloud.
Podcast episode
Data Mechanics: Data Engineering with Jean-Yves Stephan: Apache Spark is a popular open source analytics engine for large-scale data processing. Applications can be written in Java, Scala, Python, R, and SQL. These applications have flexible options to run on like Kubernetes or in the cloud.
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
84: CircuitPython - Scott Shawcroft: Adafruit enables beginners to make amazing hardware/software projects. With CircuitPython, these projects can now use Python. In this episode, Scott Shawcroft, the project lead, talks about the past, present, and future of CircuitPython, and discusses the focus on the beginner.
Podcast episode
84: CircuitPython - Scott Shawcroft: Adafruit enables beginners to make amazing hardware/software projects. With CircuitPython, these projects can now use Python. In this episode, Scott Shawcroft, the project lead, talks about the past, present, and future of CircuitPython, and discusses the focus on the beginner.
byTest and Code
0 ratings
0% found this document useful
Hasty Treat - Effortless Custom GraphQL with GraphQL Codegen: In this Hasty Treat, Scott and Wes talk about GraphQL tooling, and specifically a couple tools we use that will change your experience with GraphQL. .TECH Domains - Sponsor .TECH is taking the tech industry by storm. A domain that shows the world...
Podcast episode
Hasty Treat - Effortless Custom GraphQL with GraphQL Codegen: In this Hasty Treat, Scott and Wes talk about GraphQL tooling, and specifically a couple tools we use that will change your experience with GraphQL. .TECH Domains - Sponsor .TECH is taking the tech industry by storm. A domain that shows the world...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
266: Spring is My Least Favorite Season: On this week's episode, Chris and Steph share mixed-feelings about Spring preloader and how to use Spring just for tests. They also dive into troubleshooting an OpenSSL error, Postgres generated columns, and creating moments of contentment.
Podcast episode
266: Spring is My Least Favorite Season: On this week's episode, Chris and Steph share mixed-feelings about Spring preloader and how to use Spring just for tests. They also dive into troubleshooting an OpenSSL error, Postgres generated columns, and creating moments of contentment.
byThe Bike Shed
0 ratings
0% found this document useful
185: InstructorEx for LLMs: Explore InstructorEx's approach to harnessing LLMs for structured JSON data and Elixir's role in refining AI interactions. Uncover strategies for enhancing tasks and integrating Python skills with Elixir potential, and more!
Podcast episode
185: InstructorEx for LLMs: Explore InstructorEx's approach to harnessing LLMs for structured JSON data and Elixir's role in refining AI interactions. Uncover strategies for enhancing tasks and integrating Python skills with Elixir potential, and more!
byThinking Elixir Podcast
0 ratings
0% found this document useful
111: Deploying a PR for Review: Deploying a PR to a temporary server sounds like a great way to iterate faster! Jason Axelson shares how he did that on Render.com. We cover benefits to the team, workflow, testing, doing it on Fly.io and more!
Podcast episode
111: Deploying a PR for Review: Deploying a PR to a temporary server sounds like a great way to iterate faster! Jason Axelson shares how he did that on Render.com. We cover benefits to the team, workflow, testing, doing it on Fly.io and more!
byThinking Elixir Podcast
0 ratings
0% found this document useful
Is Flink the answer to the ETL problem? (with Robert Metzger)
Podcast episode
Is Flink the answer to the ETL problem? (with Robert Metzger)
byDeveloper Voices
0 ratings
0% found this document useful
Serverless Data APIs
Podcast episode
Serverless Data APIs
byThe Cloudcast
0 ratings
0% found this document useful
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60: Tackling Apache Spark From The Data Engineer's Perspective (Interview)
Podcast episode
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60: Tackling Apache Spark From The Data Engineer's Perspective (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
#046 On Device Automation with Taha Yusuf (a.k.a. NetAutomator), Part 2: In this episode, we welcome back Taha, a.k.a NetAutomator, to unravel the often-misunderstood concept of onbox programmability and how it can transform your network device into a multi-purpose tool within your data center. If you have Cisco IOS-XE, IOS-XR, or NX-OS devices, chances are you can use it to run Python scripts, bash scripts, and event managers directly on the device. What does it mean for your environment? That is what our conversation is about. Don't miss this fascinating conversation! Connect with Taha on LinkedIn: https://www.linkedin.com/in/taha-yusuf/ Follow Taha on Twitter: https://twitter.com/NetAutomator Containers in Cisco IOS-XE, IOS-XR, and NX-OS: Orchestration and Operation, https://www.ciscopress.com/store/containers-in-cisco-ios-xe-ios-xr-and-nx-os-orchestration-9780135782972. Container Lab: https://containerlab.dev/ --- Stay in Touch with Us — Subscribe on YouTube: https://www.yout
Podcast episode
#046 On Device Automation with Taha Yusuf (a.k.a. NetAutomator), Part 2: In this episode, we welcome back Taha, a.k.a NetAutomator, to unravel the often-misunderstood concept of onbox programmability and how it can transform your network device into a multi-purpose tool within your data center. If you have Cisco IOS-XE, IOS-XR, or NX-OS devices, chances are you can use it to run Python scripts, bash scripts, and event managers directly on the device. What does it mean for your environment? That is what our conversation is about. Don't miss this fascinating conversation! Connect with Taha on LinkedIn: https://www.linkedin.com/in/taha-yusuf/ Follow Taha on Twitter: https://twitter.com/NetAutomator Containers in Cisco IOS-XE, IOS-XR, and NX-OS: Orchestration and Operation, https://www.ciscopress.com/store/containers-in-cisco-ios-xe-ios-xr-and-nx-os-orchestration-9780135782972. Container Lab: https://containerlab.dev/ --- Stay in Touch with Us — Subscribe on YouTube: https://www.yout
byNetwork Automation Nerds Podcast
0 ratings
0% found this document useful
Episode 492 - April News Roundup: Evan and Sujit catch up on a number of recent news items in the Azure space spanning Networking, Open AI and IBM Mainframes??? Media file: https://azpodcast.blob.core.windows.net/episodes/Episode492.mp3 YouTube: https://youtu.be/WslylOKwXDM   Resources: https://techcommunity.microsoft.com/t5/apps-on-azure-blog/announcing-app-service-multi-plan-subnet-join/ba-p/3971493 https://azure.microsoft.com/en-us/updates/general-availability-of-azure-logic-apps-connectors-for-ibm-mainframe-and-midranges/ https://azure.microsoft.com/en-us/updates/public-preview-of-azure-openai-and-ai-search-inapp-connectors-for-logic-apps-standard/   https://azure.microsoft.com/en-us/updates/alt-fairfax/ https://azure.microsoft.com/en-us/updates/azure-backup-vm-disk-access-public-preview/ https://learn.microsoft.com/en-us/azure/stream-analytics/write-to-delta-lake   https://azure.microsoft.com/en-us/blog/azure-high-performance-computing-leads-to-developing-
Podcast episode
Episode 492 - April News Roundup: Evan and Sujit catch up on a number of recent news items in the Azure space spanning Networking, Open AI and IBM Mainframes??? Media file: https://azpodcast.blob.core.windows.net/episodes/Episode492.mp3 YouTube: https://youtu.be/WslylOKwXDM   Resources: https://techcommunity.microsoft.com/t5/apps-on-azure-blog/announcing-app-service-multi-plan-subnet-join/ba-p/3971493 https://azure.microsoft.com/en-us/updates/general-availability-of-azure-logic-apps-connectors-for-ibm-mainframe-and-midranges/ https://azure.microsoft.com/en-us/updates/public-preview-of-azure-openai-and-ai-search-inapp-connectors-for-logic-apps-standard/   https://azure.microsoft.com/en-us/updates/alt-fairfax/ https://azure.microsoft.com/en-us/updates/azure-backup-vm-disk-access-public-preview/ https://learn.microsoft.com/en-us/azure/stream-analytics/write-to-delta-lake   https://azure.microsoft.com/en-us/blog/azure-high-performance-computing-leads-to-developing-
byThe Azure Podcast
0 ratings
0% found this document useful
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
Podcast episode
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
byData Engineering Podcast
0 ratings
0% found this document useful
Episode 272: Detain the bhyve | BSD Now 272: Byproducts of reading OpenBSD’s netcat code, learnings from porting your own projects to FreeBSD, OpenBSD’s unveil(), NetBSD’s Virtual Machine Monitor, what 'dependency' means in Unix init systems, jailing bhyve, and more.
Podcast episode
Episode 272: Detain the bhyve | BSD Now 272: Byproducts of reading OpenBSD’s netcat code, learnings from porting your own projects to FreeBSD, OpenBSD’s unveil(), NetBSD’s Virtual Machine Monitor, what 'dependency' means in Unix init systems, jailing bhyve, and more.
byBSD Now
0 ratings
0% found this document useful
Scalable Python for Everyone, Everywhere // Matthew Rocklin // MLOps Meetup #38
Podcast episode
Scalable Python for Everyone, Everywhere // Matthew Rocklin // MLOps Meetup #38
byMLOps.community
0 ratings
0% found this document useful
60: Offline First: Summary The Offline First Heroes, Jan Lehnardt (@janl), John Kleinschmidt (@jkleinsc), Alex Russell (@slightlylate), and Jake Archibald (@jaffathecake) join forces to chat on why web developers should be designing and building with offline...
Podcast episode
60: Offline First: Summary The Offline First Heroes, Jan Lehnardt (@janl), John Kleinschmidt (@jkleinsc), Alex Russell (@slightlylate), and Jake Archibald (@jaffathecake) join forces to chat on why web developers should be designing and building with offline...
byThe Web Platform Podcast
0 ratings
0% found this document useful
Database Caching as a Service
Podcast episode
Database Caching as a Service
byThe Cloudcast
0 ratings
0% found this document useful
Troubleshooting Kafka In Production: Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.
Podcast episode
Troubleshooting Kafka In Production: Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.
byData Engineering Podcast
0 ratings
0% found this document useful
024 jsAir - Progressive Web Apps with Henrik Joreteg, Ada Rose Edwards, Nolan Lawson, and Ben Kelly: Progressive Web Apps with Henrik Joreteg, Ada Rose Edwards, Nolan Lawson, and Ben Kelly Description: A Progressive Web App "uses modern web capabilities to deliver an app-like user experience. They evolve from pages in browser tabs to immersive, top-le...
Podcast episode
024 jsAir - Progressive Web Apps with Henrik Joreteg, Ada Rose Edwards, Nolan Lawson, and Ben Kelly: Progressive Web Apps with Henrik Joreteg, Ada Rose Edwards, Nolan Lawson, and Ben Kelly Description: A Progressive Web App "uses modern web capabilities to deliver an app-like user experience. They evolve from pages in browser tabs to immersive, top-le...
byJavaScript Air
0 ratings
0% found this document useful
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
Podcast episode
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
byData Engineering Podcast
0 ratings
0% found this document useful
Rainforest QA with Russell Smith: Russell Smith, cofounder and CTO of Rainforest QA, joins the podcast to explain how they power their analytics platform with BigQuery, streaming thousands of rows per second.
Podcast episode
Rainforest QA with Russell Smith: Russell Smith, cofounder and CTO of Rainforest QA, joins the podcast to explain how they power their analytics platform with BigQuery, streaming thousands of rows per second.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful

Skip carousel

FLASK Web Frameworks
Linux Format
Article
FLASK Web Frameworks
Jun 4, 2019
The main focus of Python has always been to get you cracking on with your coding – the language was never made for web programming. However, this has just made it more interesting to extend the language for the web, or to create an interface to web-b
9 min read
Observability Of The Kernel And Containers
Linux Format
Article
Observability Of The Kernel And Containers
Apr 4, 2023
Mihalis Tsoukalos is currently working on Time Series. You can reach him at: @mactsouk. For our final delve into eBPF, we’re tackling applications, the kernel and Docker containers. At the end of the day, all Linux machines execute code for applicat
10 min read
Set Up A Production- Ready Web Server
APC
Article
Set Up A Production- Ready Web Server
Nov 4, 2019
8 min read
Set Up A Production-ready Web Server
Linux Format
Article
Set Up A Production-ready Web Server
Sep 24, 2019
8 min read
Building A Better File Server With The Pi
Linux Format
Article
Building A Better File Server With The Pi
Sep 21, 2021
Running your own cloud storage server saves money, allows you to expand storage as necessary, and can be done with a device as small as a Raspberry Pi. Our previous guide to setting up a Nextcloud server on the Raspberry Pi (LXF280) covered everythin
4 min read
Liz Rice Chief Open Source Officer at Isovalent
Techfastly
Article
Liz Rice Chief Open Source Officer at Isovalent
Apr 1, 2022
5 min read
The Problem Solvers
APC
Article
The Problem Solvers
Feb 5, 2024
6 min read
A.I.-POWERED RASPBERRY Pi
Linux Format
Article
A.I.-POWERED RASPBERRY Pi
Sep 19, 2023
1 min read
Doctor
Maximum PC
Article
Doctor
Jan 2, 2024
> Raw file recovery > Sub-100W GPUs > AI graphics upscale A friend recently switched from Mac to PC, and made the mistake of somehow wiping his drive because it wasn’t visible in Windows (as it had been formatted in Apple’s own APFS format). He compo
6 min read
Sounds Like Trouble
Linux Format
Article
Sounds Like Trouble
Sep 19, 2023
$ sudo apt-get -qq install -y sox libsox-dev libpulse-dev make gcc g++ wget curl libc6:armhf $ sudo dpkg --add-architecture armhf $ sudo apt-get -qq install -y pulseaudio The product ships as a precompiled binary, which is only provided as a 32-bi
4 min read
Automatically Provision Devices With Ansible
Linux Format
Article
Automatically Provision Devices With Ansible
Nov 15, 2022
Matt Holder has worked in IT support for over a decade, and always tries to utilise Linux alongside other installed systems. C loud computing is a term that means a number of things. Software as a Service (SaaS) is one such example of what can be hos
9 min read
Build A Pi-powered Network Storage Device
Linux Format
Article
Build A Pi-powered Network Storage Device
Dec 14, 2021
10 min read
Building A Better File Server With The Pi
APC
Article
Building A Better File Server With The Pi
Dec 27, 2021
4 min read
The LXF Shell In… The Redirection Redemption
Linux Format
Article
The LXF Shell In… The Redirection Redemption
Feb 6, 2024
OUR EXPERT Ferenc Deák is sure that the usual suspects of programming languages fit for writing a shell have been exhausted, so he sticks to C++. The code for the LXF Shell can still be found at https://github. com/fritzone/lxf-shell. Previously in o
10 min read
Free All Your Files
Linux Format
Article
Free All Your Files
Jul 25, 2023
16 min read
Intelligent Machine Fun
Linux Format
Article
Intelligent Machine Fun
Apr 5, 2022
For our final project we’ll try something a bit more complicated. We’re going to leverage the extra grunt of the Pi 4 (this will work on a Pi 3 but it won’t be fun) and the TensorFlow machine learning software to enable the Pi, via a camera, to class
4 min read
Intelligent Machine Fun
Linux Format
Article
Intelligent Machine Fun
Apr 5, 2022
For our final project we’ll try something a bit more complicated. We’re going to leverage the extra grunt of the Pi 4 (this will work on a Pi 3 but it won’t be fun) and the TensorFlow machine learning software to enable the Pi, via a camera, to class
4 min read
Create Your Own Chromecast Device
APC
Article
Create Your Own Chromecast Device
May 16, 2022
8 min read
Hot Picks
Linux Format
Article
Hot Picks
Mar 9, 2021
13 min read
Install Nextcloud
Linux Format
Article
Install Nextcloud
May 4, 2021
4 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Developer Tools
MacLife
Article
Developer Tools
Mar 1, 2022
1 min read
Top 10 Programming Languages
PC Pro Magazine
Article
Top 10 Programming Languages
Jan 5, 2023
8 min read
Build A Linux Smart Home Office
APC
Article
Build A Linux Smart Home Office
Feb 22, 2021
18 min read
Art Beyond The Canvas
Linux Format
Article
Art Beyond The Canvas
May 2, 2023
9 min read
New Microcontrollers
The Shed
Article
New Microcontrollers
Aug 6, 2021
4 min read
Ensmartening Your Home
Maximum PC
Article
Ensmartening Your Home
Jun 25, 2019
2 min read
Create Your Own Chromecast Device
Linux Format
Article
Create Your Own Chromecast Device
Mar 8, 2022
Mats Tage Axelsson would love to broadcast about all things Linux 24/7. We’re sure that it would be must-watch TV. We all spend an inordinate amount of time consuming video and music. Many of these services are streamed via browsers, but that’s not a
9 min read
Create Your Own Chromecast Device
Linux Format
Article
Create Your Own Chromecast Device
Mar 8, 2022
Mats Tage Axelsson would love to broadcast about all things Linux 24/7. We’re sure that it would be must-watch TV. We all spend an inordinate amount of time consuming video and music. Many of these services are streamed via browsers, but that’s not a
9 min read
How To Build The Linux Format Server
Linux Format
Article
How To Build The Linux Format Server
Oct 19, 2021
10 min read

Related categories

Skip carousel

Reviews for Introducing .NET for Apache Spark

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Introducing .NET for Apache Spark - Ed Elliott

Part IGetting Started

E. ElliottIntroducing .NET for Apache Sparkhttps://doi.org/10.1007/978-1-4842-6992-3_1

1. Understanding Apache Spark

Ed Elliott¹

(1)

Sussex, UK

Apache Spark is a data analytics platform that has made big data accessible and brings large-scale data processing into the reach of every developer. With Apache Spark, it is as easy to read from a single CSV file on your local machine as it is to read from a million CSV files in a data lake.

An Example

Let us look at an example. The code in Listings 1-1 (C#) and 1-2 (the F# version) reads from a set of CSV files and counts how many records match a specific condition. The code reads all CSV files in a specific path, so the number of files we read from is practically limitless.

Although the examples in this chapter are fully functioning samples, they require a working Apache Spark instance, either locally or on a cluster. We cover setting up Apache Spark in Chapter 2 and running .NET for Apache Spark in Chapter 3.

using System;

using System.Linq;

using Microsoft.Spark.Sql;

using static Microsoft.Spark.Sql.Functions;

namespace Introduction_CSharp

{

class Program

{

static void Main(string[] args)

{

var path = args.FirstOrDefault();

var spark = SparkSession

.Builder()

.GetOrCreate();

var dataFrame = spark.Read().Option(header, true).Csv(path);

var count = dataFrame.Filter(Col(name) == Ed Elliott).Count();

Console.WriteLine($There are {count} row(s));

}

Listing 1-1

Counting how many rows match a filter in one or a million CSV files in C#

open Microsoft.Spark.Sql

[]

let main argv =

let path = argv.[0]

let spark = SparkSession.Builder().GetOrCreate()

spark.Read().Option(header, true).Csv(path)

|> fun dataframe -> dataframe.Filter(Functions.Col(name).EqualTo(Ed Elliott)).Count()

|> printfn There are %d row(s)

Listing 1-2

Counting how many rows match a filter in one or a million CSV files in F#

Executing either of these programs displays the number of rows matching the filter:

» dotnet run --project ./Listing0-1 /Users/ed/sample-data/1.csv

There are 1 row(s)

» dotnet run --project ./Listing0-2 /Users/ed/sample-data/1.csv

There are 1 row(s)

If we use this for a single file, then that is fine, and the code looks quite efficient, but when the same code can run, as is, across a cluster of many nodes and petabytes of data, efficiently, then you can see how powerful Apache Spark can be.

The Core Use Cases

Apache Spark is unique in the world of big data processing in that it allows for data processing, analytics, as well as machine learning. Typically, you can use Apache Spark:

To transform your data as part of your ETL or ELT data pipelines

To analyze datasets from one small file to petabytes of data across millions of files

To create machine learning (ML) applications to enable AI

Transform Your Data

Apache Spark can read and write from any file format or database that is supported by the Java Virtual Machine, so that means we can read from a JDBC connection and write to a file. Apache Spark comes out of the box with the ability to read from a wide range of file formats, such as CSV or Parquet. However, you can always reference additional JAR files to add support for additional file types, for example, the crealytics spark-excel plugin (https://github.com/crealytics/spark-excel) allows you to read and write from XLSX files in Apache Spark.

To show an example of how powerful Apache Spark is when processing and to show how it really was built for performance from the ground up, I worked on one project where we would read a huge parquet file that contained all the Adobe Clickstream data for a popular, international, website. In our case, the data was one single file that contains all the user’s actions on a website; for a well-visited website, the file can be multiple GB and contains a whole range of events, including invalid data. My team was tasked with efficiently reading the entire file of millions of rows and retrieving a minimal subset of one specific action. Before Apache Spark, we would have likely brought the entire file into a database and then filtered out the rows we didn’t want or use a tool such as Microsoft’s SSIS, which would have read in the entire file. When we implemented this in Apache Spark, we wrote a filter for the specific row type we wanted. Apache Spark read from the file and used predicate pushdown to pass the filter to the driver that read the parquet file, so, at the very earliest opportunity, invalid rows were filtered out. This project demonstrated to us that Apache Spark showed a level of performance and ease of use that our team had not witnessed before.

The code in Listings 1-3 (C#) and 1-4 (F#) will demonstrate how to read from a data source, filter the data to just the rows you require, and show how to write the data out to a new file, which couldn’t be more straightforward with Apache Spark.

using System;

using Microsoft.Spark.Sql;

namespace TransformingData_CSharp

{

class Program

{

static void Main(string[] args)

{

var spark = SparkSession

.Builder()

.GetOrCreate();

var filtered = spark.Read().Parquet(1.parquet)

.Filter(Functions.Col(event_type) == Functions.Lit(999));

filtered.Write().Mode(overwrite).Parquet(output.parquet);

Console.WriteLine($Wrote: {filtered.Count()} rows);

}

» dotnet run --project ./ Listing0-3

Wrote: 10 rows

Listing 1-3

Reading, filtering, and writing data back out again in C#

open Microsoft.Spark.Sql

open System

[]

let main argv =

let writeResults (x:DataFrame) =

x.Write().Mode(overwrite).Parquet(output.parquet)

printfn Wrote: %u rows (x.Count())

let spark = SparkSession.Builder().GetOrCreate()

spark.Read().Parquet(1.parquet)

|> fun p -> p.Filter(Functions.Col(Event_Type).EqualTo(Functions.Lit(999)))

|> fun filtered -> writeResults filtered

0 // return an integer exit code

» dotnet run --project ./ Listing0-4

Wrote: 10 rows

Listing 1-4

Reading, filtering, and writing data back out again in F#

Analyze Your Data

Apache Spark includes the data analytical abilities you would expect from a database such as aggregation, windowing, and SQL functions, which you can access using the public API such as data.GroupBy(Col(Name)).Count(). Interestingly, you can also write Spark SQL, which means you can use SQL queries to access your data. Spark SQL makes Apache Spark available to a much wider audience, which includes developers as well as analysts and data scientists. The ability to access the power of Apache Spark without needing to learn one of Scala, Python, Java, R, and now C# or F# is a compelling feature.

Listings 1-5 and 1-6 show another example where we generate three datasets, union the datasets together, and then aggregate and display the results in .NET, and then in Listing 1-7, we demonstrate the same result, but instead of using .NET code, we pass a SQL query to Apache Spark and execute that query to create a result set we can use; note that there are some Apache Spark environments like Databricks notebooks where we can write just SQL without any application code.

using System;

using Microsoft.Spark.Sql;

using static Microsoft.Spark.Sql.Functions;

namespace TransformingData_CSharp

{

class Program

{

static void Main(string[] args)

{

var spark = SparkSession

.Builder()

.GetOrCreate();

var data = spark.Range(100).WithColumn(Name, Lit(Ed))

.Union(spark.Range(100).WithColumn(Name, Lit(Bert)))

.Union(spark.Range(100).WithColumn(Name, Lit(Lillian)));

var counts = data.GroupBy(Col(Name)).Count();

counts.Show();

}

Listing 1-5

Create three datasets, union, aggregate, and count in C#

open Microsoft.Spark.Sql

open System

[]

let main argv =

let spark = SparkSession.Builder().GetOrCreate()

spark.Range(100L).WithColumn(Name, Functions.Lit(Ed))

|> fun d -> d.Union(spark.Range(100L).WithColumn(Name, Functions.Lit(Bert)))

|> fun d -> d.Union(spark.Range(100L).WithColumn(Name, Functions.Lit(Lillian)))

|> fun d -> d.GroupBy(Functions.Col(Name)).Count()

|> fun d -> d.Show()

Listing 1-6

Create three datasets, union, aggregate, and count in F#

Finally, in Listing 1-7, we will use Spark SQL to achieve the same result.

using System;

using Microsoft.Spark.Sql;

namespace TransformingData_SQL

{

class Program

{

static void Main(string[] args)

{

var spark = SparkSession

.Builder()

.GetOrCreate();

var data = spark.Sql(@"

WITH users

AS (

SELECT ID, 'Ed' as Name FROM Range(100)

UNION ALL

SELECT ID, 'Bert' as Name FROM Range(100)

UNION ALL

SELECT ID, 'Lillian' as Name FROM Range(100)

) SELECT Name, COUNT(*) FROM users GROUP BY Name

");

data.Show();

}

Listing 1-7

Create three datasets, union, aggregate, and count in Spark SQL

The code that is executed by Apache Spark is the same in all three instances and results in the following output:

» dotnet run --project ./Listing0-7

+-------+--------+

| Name|count(1)|

+-------+--------+

| Bert| 100|

|Lillian| 100|

| Ed| 100|

+-------+--------+

Machine Learning

The last core use case for Apache Spark is to write machine learning (ML) applications. Today, there are quite a few different environments to write ML applications such as Scikit-Learn, TensorFlow, and PyTorch. However, the advantage of using Apache Spark for your ML application is that if you already process your data with Apache Spark, then you get the same familiar API, and more importantly, you can reuse your existing infrastructure.

To see what sort of things you can do in Apache Spark with the ML API, see https://spark.apache.org/docs/latest/ml-guide.html.

.NET for Apache Spark

Apache Spark is written in Scala and runs on the Java Virtual Machine (JVM), but there are a large number of developers whose primary language is C# and, to a lesser extent, F#. The .NET for Apache Spark project aims to bring the full capabilities of Apache Spark to .NET developers. Microsoft started the project as an open source project, developing in the open and accepting pull requests, issues, and feature requests.

The .NET for Apache Spark project provides an interop layer between the .NET CLI code and the JVM. The way this works is that there is a Java class, written in Scala; the Java class called the DotnetRunner creates a TCP socket, and then the DotnetRunner runs a dotnet program, your program which creates a SparkSession. The SparkSession makes a connection to the TCP socket and forwards requests to the JVM and returns the response. You can think of the .NET for Apache Spark library as a proxy between your .NET code and the JVM.

The Microsoft team made an important early decision, which affects how we can use Apache Spark from .NET. Apache Spark originally started with what is called the RDD API. The RDD API allows users to access the underlying data structure used by Apache Spark. When Apache Spark version 2.0 was released, it included a new DataFrame API. The DataFrame API had several additional benefits such as a new catalyst optimizer, which meant that it was much more efficient to use the DataFrame API than the original RDD API. Letting Apache Spark optimize the query, rather than trying to optimize the calls yourself using the RDD API, was also a lot simpler. The DataFrame API brought performance parity to Python and R, and now .NET. The previous RDD API was considerably faster for Scala or Java code than it was with Python. With the new DataFrame API, it was just as fast, in most cases, for Python or R code as it was with Scala and Java code.

The Microsoft team decided only to provide support for the new DataFrame API, which means it isn’t possible, today, to use the RDD API from .NET for Apache Spark. I honestly do not see this as a significant issue, and it certainly is not a blocker for the adoption of .NET for Apache Spark. This condition of only supporting the later API flows through to the ML library, where there are two APIs for ML, MLLib and ML. The Apache Spark team deprecated MLLib in favor of the ML library, so in .NET for Apache Spark, we are also only implementing the ML version of the API.

Feature Parity

The .NET for Apache Spark project was first released to the public in April 2019 and included a lot of the core functionality that is also available in Apache Spark. However, there was quite a lot of functionality missing, even from the DataFrame API, and that is ignoring the APIs which are likely not going to be implemented, such as the RDD API. In the time since the initial release, the Microsoft team and outside contributors have increased the amount of functionality. In the meantime, the Apache Spark team has also released more functionality, so in some ways, the Microsoft project is playing catch-up with the Apache team, so not all functionality is currently available in the .NET project. Over the last year and a bit, the gap has been closing, and I fully expect over the next year or so the gap to get smaller and smaller, and feature parity will exist at some point.

If you are trying to use the .NET for Apache Spark project and some functionality is missing that is a blocker for you, there are a couple of options that you can take to implement the missing functionality, and I cover this in Appendix B.

Summary

Apache Spark is a compelling data processing project that makes it almost too simple to query large distributed datasets. .NET for Apache Spark brings that power to .NET developers, and I, for one, am excited by the possibility of creating ETL, ELT, ML, and all sorts of data processing applications using C# and F#.

E. ElliottIntroducing .NET for Apache Sparkhttps://doi.org/10.1007/978-1-4842-6992-3_2

2. Setting Up Spark

Ed Elliott¹

(1)

Sussex, UK

So that we can develop a .NET for Apache Spark application, we need to install Apache Spark on our development machines and then configure .NET for Apache Spark so that our application executes correctly. When we run our Apache Spark application in production, we will use a cluster, either something like a YARN cluster or using a fully managed environment such as Databricks. When we develop applications, we use the same version of Apache Spark locally as we would when we run against a cluster of many machines. Having the same version on our development machines means that when we develop and test the code, we can be confident that the code that runs in production is the same.

In this chapter, we will go through the various components that we need to have running correctly; Apache Spark is a Java application so we will need to install and configure the correct version of Java and then download and configure Apache Spark. Only when we have the correct version of Java and Apache Spark running are we able to write a .NET application, either in C# or F# that executes on Apache Spark.

Choosing Your Software Versions

In this section, we are going to start by helping you choose which version of Apache Spark and which version of Java you should use. Even though it seems like it should be a straightforward choice, there are some specific requirements, and getting this correct is critical to getting off to a smooth start.

Choosing a Version of Apache Spark

In this section, we will look at how to choose a version of Apache Spark. Apache Spark is an actively developed open source project, and new releases happen often, sometimes even multiple times a month. However, the .NET for Apache Spark project does not support every version, either because it will not support it or because the development team hasn’t yet added.

When we run a .NET for Apache Spark application, we need to understand that we need the .NET code, which runs on a specific version of the .NET Framework or .NET Core. The .NET for Apache Spark code is compatible with a limited set of versions of Apache Spark, and depending on which version of Apache Spark you have, you will either need Java 8 or Java 11.

To help choose the version of the components that you need, go to the home page of the .NET for Apache Spark project, https://github.com/dotnet/spark, and there is a section Supported Apache Spark; the current .NET for Apache Spark version v1.0.0 supports these versions of Apache Spark:

2.3.*

2.4.0

2.4.1

2.4.3

2.4.4

2.4.5

3.0.0

Note that 2.4.2 is not supported, and 3.0.0 of Apache Spark was supported when .NET for Apache Spark v1.0.0 was released in October 2020. Where possible, you should aim for the highest version of both projects that you can, and, today, in November 2020, I would start a new project with .NET for Apache Spark v1.0.0 and Apache Spark version 3.0. Unfortunately, any concrete advice we write here will quickly get out of date. Between the time of writing this and reviewing the chapter, the advice changed from using .NET for Apache Spark version v0.12.1 and v1.0.0.

Once you have selected a version of the Apache Spark code to use, visit the release notes for that version, such as https://spark.apache.org/docs/3.0.0/ or https://spark.apache.org/docs/3.0.0/. The release notes include details of which version of the Java VM is supported. If you try and run on a version of the JVM that is not supported, then your application will fail, so you do need to take care here.

When you download Apache Spark, you have a few options. You can download the source code and compile it by yourself, which we do not cover here, but you can get instructions on how to build from source from https://spark.apache.org/docs/latest/building-spark.html. You can also choose to either

Enjoying the preview?

Page 1 of 1

Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets

About this ebook

Ed Elliott

Related authors

Related to Introducing .NET for Apache Spark

Related ebooks

Programming For You

Related podcast episodes

Related articles

Related categories

Reviews for Introducing .NET for Apache Spark

What did you think?

Book preview

Introducing .NET for Apache Spark - Ed Elliott

1. Understanding Apache Spark

An Example

The Core Use Cases

Transform Your Data

Analyze Your Data

Machine Learning

.NET for Apache Spark

Feature Parity

Summary

2. Setting Up Spark

Choosing Your Software Versions

Choosing a Version of Apache Spark