Exploring Hadoop Ecosystem (Volume 2): Stream Processing

Ebook510 pages6 hours

Exploring Hadoop Ecosystem (Volume 2): Stream Processing

Name: Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Author: Wei Liu
ISBN: 9781667184500

By Wei Liu

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The Hadoop ecosystem consists of many components. It is a headache for people who want to learn or understand them. This book can help data engineers or architects understand the internals of the big data technologies, starting from the basic HDFS and MapReduce to Kafka, Spark, etc. There are currently 2 volumes, the volume 1 mainly describes batch processing, and the volume 2 mainly describes stream processing.

Skip carousel

LanguageEnglish

PublisherLulu.com

Release dateApr 1, 2021

ISBN9781667184500

Author

Wei Liu

Wei Liu is Doctor of engineering at Beijing University of Aeronautics and Astronautics, Professor of Beijing University of Posts and Telecommunications, Visiting scholar of Cambridge University, Expert of Artificial Intelligence Group, Center for strategy and security, Tsinghua University and vice chairman of cognitive branch of the China Association of Command-and-Control His research interests include human-computer integration intelligence, cognitive engineering, human-machine- environment system engineering, future situation awareness mode and behavior analysis / prediction technology, etc. So far, he has published more than 70 papers, 4 monographs and 2 translations. At present, he is a distinguished expert of Expert Committee of China information and Electronic Engineering Science and technology development center, an appraisal expert of National Natural Science Foundation of China, a member of national ergonomics Standardization Technical Committee, and a senior member of the Chinese artificial intelligence society.

Related to Exploring Hadoop Ecosystem (Volume 2)

Related ebooks

Skip carousel

Learn Hadoop in 24 Hours
Ebook
Learn Hadoop in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Learning Apache Spark 2
Ebook
Learning Apache Spark 2
byMuhammad Asif Abbasi
Rating: 0 out of 5 stars
0 ratings
Learning Hadoop 2
Ebook
Learning Hadoop 2
byGarry Turkington
Rating: 4 out of 5 stars
4/5
Mastering Apache Cassandra - Second Edition
Ebook
Mastering Apache Cassandra - Second Edition
byNishant Neeraj
Rating: 0 out of 5 stars
0 ratings
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Ebook
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
byDebananda Ghosh
Rating: 0 out of 5 stars
0 ratings
Learning Azure DocumentDB
Ebook
Learning Azure DocumentDB
byBecker Riccardo
Rating: 0 out of 5 stars
0 ratings
SQL Server Data Automation Through Frameworks: Building Metadata-Driven Frameworks with T-SQL, SSIS, and Azure Data Factory
Ebook
SQL Server Data Automation Through Frameworks: Building Metadata-Driven Frameworks with T-SQL, SSIS, and Azure Data Factory
byAndy Leonard
Rating: 0 out of 5 stars
0 ratings
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Ebook
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
byEric Chou
Rating: 0 out of 5 stars
0 ratings
Learn Cassandra in 24 Hours
Ebook
Learn Cassandra in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics
Ebook
Big Data Analytics
byVenkat Ankam
Rating: 0 out of 5 stars
0 ratings
PostgreSQL for Data Architects
Ebook
PostgreSQL for Data Architects
byJayadevan Maymala
Rating: 0 out of 5 stars
0 ratings
RDBMS In-Depth: Mastering SQL and PL/SQL Concepts, Database Design, ACID Transactions, and Practice Real Implementation of RDBM (English Edition)
Ebook
RDBMS In-Depth: Mastering SQL and PL/SQL Concepts, Database Design, ACID Transactions, and Practice Real Implementation of RDBM (English Edition)
byDr. Madhavi Vaidya
Rating: 0 out of 5 stars
0 ratings
Mastering Redis
Ebook
Mastering Redis
byJeremy Nelson
Rating: 0 out of 5 stars
0 ratings
Kafka Streams - Real-time Streams Processing
Ebook
Kafka Streams - Real-time Streams Processing
byPrashant Kumar Pandey
Rating: 5 out of 5 stars
5/5
High Performance SQL Server: Consistent Response for Mission-Critical Applications
Ebook
High Performance SQL Server: Consistent Response for Mission-Critical Applications
byBenjamin Nevarez
Rating: 0 out of 5 stars
0 ratings
Scala for Data Science
Ebook
Scala for Data Science
byBugnion Pascal
Rating: 0 out of 5 stars
0 ratings
Mastering Spark for Data Science
Ebook
Mastering Spark for Data Science
byAndrew Morgan
Rating: 0 out of 5 stars
0 ratings
Fast Data Processing Systems with SMACK Stack
Ebook
Fast Data Processing Systems with SMACK Stack
byRaúl Estrada
Rating: 0 out of 5 stars
0 ratings
Nginx Troubleshooting
Ebook
Nginx Troubleshooting
byAlex Kapranoff
Rating: 0 out of 5 stars
0 ratings
Hadoop Blueprints
Ebook
Hadoop Blueprints
byAnurag Shrivastava
Rating: 0 out of 5 stars
0 ratings
HDInsight Essentials - Second Edition
Ebook
HDInsight Essentials - Second Edition
byRajesh Nadipalli
Rating: 0 out of 5 stars
0 ratings
Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions
Ebook
Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions
bySudhir Rawat
Rating: 0 out of 5 stars
0 ratings
Learn Hive in 24 Hours
Ebook
Learn Hive in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Hadoop in Practice
Ebook
Hadoop in Practice
byAlex Holmes
Rating: 0 out of 5 stars
0 ratings
Monitoring Hadoop
Ebook
Monitoring Hadoop
byGurmukh Singh
Rating: 0 out of 5 stars
0 ratings
Apache Hive Cookbook
Ebook
Apache Hive Cookbook
byShrey Mehrotra
Rating: 0 out of 5 stars
0 ratings
Fast Data Processing with Spark 2 - Third Edition
Ebook
Fast Data Processing with Spark 2 - Third Edition
byKrishna Sankar
Rating: 0 out of 5 stars
0 ratings
Getting Started with Big Data Query using Apache Impala
Ebook
Getting Started with Big Data Query using Apache Impala
byAgus Kurniawan
Rating: 0 out of 5 stars
0 ratings
Hadoop 2.x Administration Cookbook
Ebook
Hadoop 2.x Administration Cookbook
byGurmukh Singh
Rating: 0 out of 5 stars
0 ratings
Apache Hive Essentials
Ebook
Apache Hive Essentials
byDayong Du
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
Ebook
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
byTriumph Books
Rating: 5 out of 5 stars
5/5
Master Builder Roblox: The Essential Guide
Ebook
Master Builder Roblox: The Essential Guide
byTriumph Books
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Practical Lock Picking: A Physical Penetration Tester's Training Guide
Ebook
Practical Lock Picking: A Physical Penetration Tester's Training Guide
byDeviant Ollam
Rating: 5 out of 5 stars
5/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
The Designer's Web Handbook: What You Need to Know to Create for the Web
Ebook
The Designer's Web Handbook: What You Need to Know to Create for the Web
byPatrick McNeil
Rating: 0 out of 5 stars
0 ratings
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Learning the Chess Openings
Ebook
Learning the Chess Openings
byJef Kaan
Rating: 5 out of 5 stars
5/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
Ebook
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
byPatrick McNeil
Rating: 4 out of 5 stars
4/5
What Video Games Have to Teach Us About Learning and Literacy. Second Edition
Ebook
What Video Games Have to Teach Us About Learning and Literacy. Second Edition
byJames Paul Gee
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
Podcast episode
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Data Visualization and D3.js with Irene Ros: Scott talks to Data Visualization expert Irene Ros. When she isn't contributing to the Miso Project, teaching her d3.js class, or working on making OpenVis Conf the best data visualization conference it can be, she's working on projects that focus on creating engaging interactive visual displays of information.
Podcast episode
Data Visualization and D3.js with Irene Ros: Scott talks to Data Visualization expert Irene Ros. When she isn't contributing to the Miso Project, teaching her d3.js class, or working on making OpenVis Conf the best data visualization conference it can be, she's working on projects that focus on creating engaging interactive visual displays of information.
byHanselminutes with Scott Hanselman
0 ratings
0% found this document useful
Accelerated data science with a Kaggle grandmaster: featuring Christof Henkel
Podcast episode
Accelerated data science with a Kaggle grandmaster: featuring Christof Henkel
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
Episode 101. Allright, let's talk about Kafka: Whew! So we took a big break over summer (like Bob said, we were just swamped with work.. oof), but we are BACK! and like always we are ready to explore even deeper Java topics for the professional developer. This time we set our sights in Apache...
Podcast episode
Episode 101. Allright, let's talk about Kafka: Whew! So we took a big break over summer (like Bob said, we were just swamped with work.. oof), but we are BACK! and like always we are ready to explore even deeper Java topics for the professional developer. This time we set our sights in Apache...
byJava Pub House
0 ratings
0% found this document useful
2155: Databricks - The Story Behind the Lakehouse Company: Many are citing open source as the future. The UK Government's National Data Strategy even talks about the importance of opening public sector datasets to form the backbone of innovation, efficiency, and growth. This is a trend that Databricks...
Podcast episode
2155: Databricks - The Story Behind the Lakehouse Company: Many are citing open source as the future. The UK Government's National Data Strategy even talks about the importance of opening public sector datasets to form the backbone of innovation, efficiency, and growth. This is a trend that Databricks...
byThe Tech Talks Daily Podcast
0 ratings
0% found this document useful
Kubernetes Registry with Benjamin Elder: Benjamin Elder is a Senior Software Engineer at Google, a Kubernetes SIG Testing Chair & Tech Lead, and a Kubernetes Steering Committee member. In this episode we got to chat with Benjamin about the new kubernetes registry migration from k8s.gcr.io to...
Podcast episode
Kubernetes Registry with Benjamin Elder: Benjamin Elder is a Senior Software Engineer at Google, a Kubernetes SIG Testing Chair & Tech Lead, and a Kubernetes Steering Committee member. In this episode we got to chat with Benjamin about the new kubernetes registry migration from k8s.gcr.io to...
byKubernetes Podcast from Google
0 ratings
0% found this document useful
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57: Scalable and Stateful Streaming Data With Apache Flink (Interview)
Podcast episode
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57: Scalable and Stateful Streaming Data With Apache Flink (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
Podcast episode
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
byScreaming in the Cloud
0 ratings
0% found this document useful
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
Podcast episode
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
ML Lifecycle with Dale Markowitz and Craig Wiley: Jenny Brown co-hosts with Mark Mirchandani this week for a great conversation about the ML lifecycle with our guests Craig Wiley and Dale Markowitz.
Podcast episode
ML Lifecycle with Dale Markowitz and Craig Wiley: Jenny Brown co-hosts with Mark Mirchandani this week for a great conversation about the ML lifecycle with our guests Craig Wiley and Dale Markowitz.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
Gateway API Beta, with Rob Scott: Three years after they were first proposed, the new Kubernetes Gateway APIs - the evolution of the Ingress API - are in Beta. Rob Scott is a software engineer at Google and a lead on the SIG Network Gateway API project.
Podcast episode
Gateway API Beta, with Rob Scott: Three years after they were first proposed, the new Kubernetes Gateway APIs - the evolution of the Ingress API - are in Beta. Rob Scott is a software engineer at Google and a lead on the SIG Network Gateway API project.
byKubernetes Podcast from Google
0 ratings
0% found this document useful
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
Podcast episode
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
byData Engineering Podcast
0 ratings
0% found this document useful
Episode 458 - Integration Patterns: Elizabeth Graham, a Senior Software Engineer in the Commercial Software Engineering group at Microsoft, talks to Evan and Sujit about the various options we have to integrating external systems with Azure. She talks about how messages, files and other input forms can be ingested and processed with some of the PaaS options available in Azure. Media file: https://azpodcast.blob.core.windows.net/episodes/Episode458.mp3 YouTube: https://youtu.be/NekugFAh2x8 Resources: https://learn.microsoft.com/en-us/azure/architecture/integration/integration-start-here Data Mapper   Other updates: https://azure.microsoft.com/en-us/updates/azure-service-fabric-91-third-refresh-release-2/ https://azure.microsoft.com/en-us/updates/generally-available-crossregion-service-endpoints-for-azure-storage/ https://azure.microsoft.com/en-us/updates/azurefilessmblinuxad/
Podcast episode
Episode 458 - Integration Patterns: Elizabeth Graham, a Senior Software Engineer in the Commercial Software Engineering group at Microsoft, talks to Evan and Sujit about the various options we have to integrating external systems with Azure. She talks about how messages, files and other input forms can be ingested and processed with some of the PaaS options available in Azure. Media file: https://azpodcast.blob.core.windows.net/episodes/Episode458.mp3 YouTube: https://youtu.be/NekugFAh2x8 Resources: https://learn.microsoft.com/en-us/azure/architecture/integration/integration-start-here Data Mapper   Other updates: https://azure.microsoft.com/en-us/updates/azure-service-fabric-91-third-refresh-release-2/ https://azure.microsoft.com/en-us/updates/generally-available-crossregion-service-endpoints-for-azure-storage/ https://azure.microsoft.com/en-us/updates/azurefilessmblinuxad/
byThe Azure Podcast
0 ratings
0% found this document useful
LLMs, Retrieval Augmented Generation, Knowledge Graph, Vector Databases with Mike Dillinger: <p>RAG, Retrieval Augemented Generation, is the term you now constantly hear in conjunction with LLM that provides context. But how does it actually work? And what's the relationship with Vector Databases and Knowledge Graphs? This will be a geeky AI e...
Podcast episode
LLMs, Retrieval Augmented Generation, Knowledge Graph, Vector Databases with Mike Dillinger: <p>RAG, Retrieval Augemented Generation, is the term you now constantly hear in conjunction with LLM that provides context. But how does it actually work? And what's the relationship with Vector Databases and Knowledge Graphs? This will be a geeky AI e...
byCatalog & Cocktails: The Honest, No-BS Data Podcast
0 ratings
0% found this document useful
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
Podcast episode
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
byData Engineering Podcast
0 ratings
0% found this document useful
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
Designing Data Transfer Systems That Scale: The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.
Podcast episode
Designing Data Transfer Systems That Scale: The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.
byData Engineering Podcast
0 ratings
0% found this document useful
#08 - Tech stack: Metabase, Superset, Redash, Grafana
Podcast episode
#08 - Tech stack: Metabase, Superset, Redash, Grafana
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.
$Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.$
$Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.$
Podcast episode
Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.
byData Engineering Podcast
0 ratings
0% found this document useful
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
Podcast episode
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
byData Engineering Podcast
0 ratings
0% found this document useful
Powering Vector Search With Real Time And Incremental Vector Indexes: The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.
Podcast episode
Powering Vector Search With Real Time And Incremental Vector Indexes: The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.
byData Engineering Podcast
0 ratings
0% found this document useful
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
Podcast episode
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
byData Engineering Podcast
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
Column by your name: The analytics database that skips the rows: On this sponsored episode of the podcast, we chat with Rohit (Ro) Amarnath, the CTO at Vertica, to find out how your analytics engine can speed up your workflow. After a humble beginning with a ZX Spectrum 128, he’s now in charge of Vertica Accelerator, a SaaS version of the Vertica database.
Podcast episode
Column by your name: The analytics database that skips the rows: On this sponsored episode of the podcast, we chat with Rohit (Ro) Amarnath, the CTO at Vertica, to find out how your analytics engine can speed up your workflow. After a humble beginning with a ZX Spectrum 128, he’s now in charge of Vertica Accelerator, a SaaS version of the Vertica database.
byThe Stack Overflow Podcast
0 ratings
0% found this document useful
Troubleshooting Kafka In Production: Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.
Podcast episode
Troubleshooting Kafka In Production: Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.
byData Engineering Podcast
0 ratings
0% found this document useful
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
Podcast episode
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
byData Engineering Podcast
0 ratings
0% found this document useful
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
Podcast episode
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
byData Engineering Podcast
0 ratings
0% found this document useful
#07 - Tech stack: Rust, TypeScript, Edge Worker, and Cloudflare
Podcast episode
#07 - Tech stack: Rust, TypeScript, Edge Worker, and Cloudflare
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful

Skip carousel

Build A Search And Analytic Engine
Linux Format
Article
Build A Search And Analytic Engine
Mar 10, 2020
7 min read
Your First Steps In Grafana
Linux Format
Article
Your First Steps In Grafana
Nov 17, 2020
The easiest way to get hold of Grafana and begin using it as soon as possible is by downloading and executing its official Docker image. This means that apart from the Docker image, you won’t need to download, set up or install anything else for Graf
1 min read
Grafana, Telegraf And Influxdb
Linux Format
Article
Grafana, Telegraf And Influxdb
Jun 30, 2020
If you don’t like Netdata or if you want to try something else, you can give Grafana (https://grafana.com), Telegraf (www.influxdata.com/time-series-platform/telegraf) and InfluxDB (www.influxdata.com/products/influxdb-overview) a try. Grafana can’t
1 min read
KAFKA Build Utilities With The Kafka Server
Linux Format
Article
KAFKA Build Utilities With The Kafka Server
Jul 2, 2019
Nowadays, quite a few data architectures involve both a database and Apache Kafka, which is a distributed streaming platform and the subject of this tutorial. You can also find Kafka described as a publish-subscribe message system, which is a fancy w
7 min read
Grafana Terminology
Linux Format
Article
Grafana Terminology
Jan 14, 2020
A Grafana data source is a database, file or service that provides data to Grafana – it cannot operate without data. A Grafana panel is the basic building block of Grafana. Panels are made of visualisations or queries. A Grafana query is used for req
1 min read
Build A Cloud-based Documentation Site
Linux Format
Article
Build A Cloud-based Documentation Site
Jul 27, 2021
Docusaurus is an open source documentation application developed by Facebook. It’s one of a growing number of JAMstack static site generators that uses a blend of JavaScript, React and markdown to make it easy for you to deploy clean, professional-lo
8 min read
Elasticsearch And Kibana Basics
Linux Format
Article
Elasticsearch And Kibana Basics
Dec 15, 2020
1 min read
Rediscover Speed With The Redis Revolution
Linux Format
Article
Rediscover Speed With The Redis Revolution
Jul 25, 2023
Credit: https://redis.io Redis is an open-source, in-memory data structure store that has gained popularity R as a highly efficient caching and messaging system. It prioritises speed, efficiency and versatility, making it a top choice for various ap
8 min read
What is ELT?
Techfastly
Article
What is ELT?
Apr 1, 2021
It stands for extract, load, and transform- the processes a data pipeline uses for replicating the data from a source system into a target system such as a cloud data warehouse. 1. Extraction is the first step in which data is copied from the source
6 min read
Lag Is Killing Games
Linux Format
Article
Lag Is Killing Games
Jan 11, 2022
8 min read
Hot Picks
Linux Format
Article
Hot Picks
Mar 9, 2021
13 min read
Open Success
Linux Format
Article
Open Success
Nov 17, 2020
“ClickHouse was developed for Yandex Metrics (the Russian equivalent of Google Analytics) as a data store and was Apache 2 licenced in 2016. In 2020. Altinity picked up $4m in funding to help it finish off a ClickHouse cloud service that’s in private
1 min read
Browser Wars 2020
Maximum PC
Article
Browser Wars 2020
May 26, 2020
8 min read
Browser wars 2020
APC
Article
Browser wars 2020
Nov 2, 2020
8 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Join the Pod, Man!
Linux Format
Article
Join the Pod, Man!
May 30, 2023
8 min read
Arcserve UDP 8.1
PC Pro Magazine
Article
Arcserve UDP 8.1
Oct 8, 2022
2 min read
Browser Wars 2020
TechLife
Article
Browser Wars 2020
Aug 24, 2020
8 min read
Browser Wars 2020
Linux Format
Article
Browser Wars 2020
Jun 30, 2020
8 min read
Build A Pi-powered Network Storage Device
Linux Format
Article
Build A Pi-powered Network Storage Device
Dec 14, 2021
10 min read
Benchmark your SSD
APC
Article
Benchmark your SSD
Nov 2, 2020
4 min read
Retrospect Backup 18.5
PC Pro Magazine
Article
Retrospect Backup 18.5
Oct 8, 2022
2 min read
Revealing The Shell Behind The Shell
Linux Format
Article
Revealing The Shell Behind The Shell
Mar 5, 2024
Ferenc Deák sees no way back from the C++ mayhem he brought upon readers with this quick and dirty shell, so you just have to accept it. C++. The code for the shell can still be found at https://github. com/fritzone/lxf-shell. In the first three part
11 min read
Natural Language Translation
Linux Format
Article
Natural Language Translation
Jun 27, 2023
4 min read
Database Control With C++ Tools
Linux Format
Article
Database Control With C++ Tools
Dec 17, 2019
10 min read
HotPicks
Linux Format
Article
HotPicks
May 2, 2023
12 min read
KeePassXC: The Friendlier Free Offline Password Manager
PCWorld
Article
KeePassXC: The Friendlier Free Offline Password Manager
Sep 5, 2023
7 min read
FLASK Web Frameworks
Linux Format
Article
FLASK Web Frameworks
Jun 4, 2019
The main focus of Python has always been to get you cracking on with your coding – the language was never made for web programming. However, this has just made it more interesting to extend the language for the web, or to create an interface to web-b
9 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
Rise Of The Robots
Linux Format
Article
Rise Of The Robots
Jan 12, 2021
7 min read

Related categories

Skip carousel

Reviews for Exploring Hadoop Ecosystem (Volume 2)

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Exploring Hadoop Ecosystem (Volume 2) - Wei Liu

Exploring Hadoop Ecosystem (Volume 2)

Stream Processing

(Spark Core, Spark SQL, Spark Streaming, Spark Structured Streaming, Kafka)

Wei Liu

Exploring Hadoop Ecosystem (volume 2) Stream Processing

by Wei Liu

All rights reserved. No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law. For permission requests, contact pollux.liu@gmail.com.

ISBN: 978-1-6671-8450-0

First Edition

For the future

ABOUT THE AUTHOR

Wei Liu, an enterprise architect focusing on the big data, graduated from college in 2002, has 18 years of experience in the software development and architecture. He now lives in Beijing with his family. His email: pollux.liu@gmail.com.

Batch processing vs Stream processing

Scala

Overview

sbt

Development Environment

Values and Variables

Basic Types

Inheritance Hierarchy

Conditional Expressions

Block Expressions

Loops

Collections

Tuple

Class and Object

Trait

Functional Programming

Currying

Type Parameters

Type Checks and Casts

Pattern Matching

Implicits

try/catch/finally

Packages

Chapter 1. Spark

Overview

Spark Architecture

SparkSession

Spark on YARN

Deploying Applications

Interactive Shells

Building Applications

RDD

Partitioning

DAG

RDD Dependency

Spark Execution Model

DAGScheduler

TaskScheduler

Data Locality

Task Execution

Shuffle

Memory Management

Blocks

RDD Persistence

RDD Checkpointing

Shared Variables

Chapter 2. Spark SQL

Shark

Spark SQL

Spark SQL Thrift Server

Spark SQL CLI

Dataset and DataFrame

Catalyst

Data Source APIs

Metastore

Hive Integration

Local Development Environment

Spark UI

Join Types

Join Implementation

Performance Tuning

Chapter 3. Spark Streaming

DStream

DStreamGraph

JobScheduler

Receiver-Based Approach

Direct Approach

Checkpointing

Transformations

Window Operations

Fault Tolerance

Kafka Exactly-once Semantics

Chapter 4. Spark Structured Streaming

Programming Model

Fault Tolerance Semantics

Word Count Example

State Management

Event Time Processing

Operations

Window Operations on Event Time

Watermarking

Deduplication

Join

Kafka Integration

Micro-batch vs Continuous

Chapter 5. Kafka

Message

Schema

Topic and Partitions

Log Flush

Log Cleanup

Replication

CLI Tools

Kafka Cluster

Producer

Consumer

Message Delivery Semantics

Java Application

Kafka Connect

Mirroring

UI Tools

Data Lake Architecture

Lambda Architecture

Kappa Architecture

Batch processing vs Stream processing

Today we are analyzing Terabytes and Petabytes of data in the Hadoop Ecosystem. There are generally two ways of processing data today- Batch Processing and Stream Processing. The distinction between batch processing and stream processing is one of the most fundamental principles in the big data world.

Batch Processing

Batch processing is the processing of a large volume of data all at once. It is most often used when dealing with very large amounts of data, or when data sources are legacy systems that are not capable of delivering data in streams.

Big data solutions often use long-running batch jobs to filter, aggregate, and prepare the data for analysis. Usually these jobs involve reading source files from scalable storage (like HDFS), processing them, and writing the output to new files in scalable storage. The key requirement of such batch processing engines is the ability to scale out computations, in order to handle a large volume of data. Unlike stream processing, batch processing is expected to have latencies that measure in minutes to hours.

A batch processing architecture has the following logical components:

Data storage

Typically, a distributed file store that can serve as a repository for high volumes of large files in various formats. Generically, this kind of store is often referred to as a data lake.

Batch processing

The high-volume nature of big data often means that solutions must process data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually, these jobs involve reading source files, processing them, and writing the output to new files.

Analytical data store

Many big data solutions are designed to prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools.

Analysis and reporting

The goal of most big data solutions is to provide insights into the data through analysis and reporting.

Orchestration

With batch processing, typically some orchestration is required to migrate or copy the data into our data storage, batch processing, analytical data store, and reporting layers.

Stream Processing

Stream processing is defined as the processing of unbounded stream of input data, with very short latency requirements for processing- measured in milliseconds or seconds. The incoming data typically arrives in an unstructured or semi-structured format, such as JSON, and has the same processing requirements as batch processing, but with shorter turnaround times to support real-time consumption. Processed data is often written to an analytical data store, which is optimized for analytics and visualization. The processed data can also be ingested directly into the analytics and reporting layer for analysis, business intelligence, and real-time dashboard visualization.

While batch processing handles a large batch of data, stream processing handles individual records or micro batches of few records. By building data streams, we can feed data into analytics tools as soon as it is generated and get near-instant analytics results using platforms like Spark Streaming.

A stream processing architecture has the following logical components:

Streaming message ingestion

The architecture must include a way to capture and store streaming messages to be consumed by a stream processing consumer. In simple cases, this could be implemented as a simple data store in which new messages are deposited in a folder. But often the solution requires a message broker, such as Kafka, that acts as a buffer for the messages. The message broker should support scale-out processing and reliable delivery.

Stream processing

After capturing streaming messages, the solution must process them by filtering, aggregating, and preparing the data for analysis.

Analytical data store

Many big data solutions are designed to prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools.

Analysis and reporting

The goal of most big data solutions is to provide insights into the data through analysis and reporting.

Stream processing refers to a method of continuous computation that happens as data is flowing through the system. There are no compulsory time limitations in stream processing. The terms stream processing and real-time processing are often used interchangeably. But stream processing is not a synonym for real-time processing. The term real-time has no precise industry definition. The meaning of real-time varies significantly between businesses. Some use cases, such as online fraud detection, may require processing to complete within milliseconds, but for others multiple seconds or even minutes might be sufficiently fast. Usually, a system is called a real-time system if it has tight deadlines within which a result is guaranteed. In practice real-time systems are extremely hard to implement using common software systems. When we use the term real-time, we mean systems that can respond fast. Here fast can be milliseconds to seconds.

Scala

Overview

Scala, stands for scalable language, was created by Martin Odersky and was first released in 2003. The language is so named because it was designed to grow with the demands of its users. We can apply Scala to a wide range of programming tasks, from writing small scripts to building large systems.

Scala is compiled to class files that we package as JAR files, which is executed by the Java Virtual Machine (JVM). JVM is a cross platform runtime engine that can execute instructions compiled into Java bytecode. Scala can be compiled into bytecode and runs on JVM. This means that Scala and Java have a common run-time platform. Scala interoperates seamlessly with all Java libraries. Scala enables us to use all the classes of the Java SDK's, and also our custom Java classes, or Java open source projects.

But Scala doesn't just target the Java Virtual Machine. Scala.js is a compiler that compiles Scala source code to equivalent JavaScript code. That lets us write Scala code that we can run in a web browser, or other environments (Chrome plugins, Node.js, etc.) where JavaScript is supported. This enables us to write both the server-side and client-side code of web applications in only one language.

Scala is a multi-paradigm programming language and supports functional as well as object oriented paradigms. Scala is a pure object-oriented language in the sense that every value is an object. Scala is also a functional language in the sense that every function is a value.

Scala is strongly statically typed.

Statically vs Dynamically

Type Checking is the process of verifying and enforcing the constraints of types. Type Checking may occur either at compile-time or at run-time. Many Programming languages throw type errors which halts the run-time or compilation of the program, depending on the language type- statically or dynamically typed.

A language is statically-typed if the type of a variable is known at compile-time instead of at run-time. A language is dynamically-typed if the type of a variable is checked during run-time.

Strongly vs Weakly

A strongly-typed language is one in which variables are bound to specific data types, and will result in type errors if types do not match up as expected in the expression, regardless of when type checking occurs.

A weakly-typed language is a language in which variables are not bound to a specific data type; they still have a type, but type safety constraints are lower compared to strongly-typed languages.

classification of some programming languages

sbt

We can use several different tools to build our Scala projects, including Ant, Maven, Gradle, and more. But sbt was the first build tool that was specifically created for Scala and is most commonly used build tool in the Scala community. It's similar to Java Maven or .net NuGet.

sbt (Scala Build Tool) is a highly interactive build tool, which can be used for Scala, Java, and more. It requires Java 8 or later. It provides a parallel execution engine and configuration system that allow us design an efficient and robust script to build our software.

Build Definition

A build definition defines a set of subprojects. Each subproject holds a sequence of key-value pairs called setting expressions using build.sbt DSL. The build.sbt DSL is a domain-specific language used to construct a DAG of settings.

A setting expression consists of three parts: key, operator, body. For example,

A key is an instance of SettingKey[T], TaskKey[T], or InputKey[T] where T is the expected value type.

A SettingKey represents a setting, a TaskKey represents a task, and a InputKey represents an input task. A given key always refers to either a task or a setting. A setting in sbt is just a value. It could be the name of the project, or the version of Scala to use. A task is an operation such as compile or package. It may return Unit (Unit is void for Scala), or it may return a value related to the task, for example package is a TaskKey[File] and its value is the jar file it creates. An input task parses user input and produce a task to run.

Built-in Keys vs Custom Keys

The built-in keys are just fields in an object called Keys. A build.sbt implicitly has an import sbt.Keys._, so sbt.Keys.name can be referred to as name.

Each type of keys can be defined with their respective creation methods: settingKey, taskKey, and inputKey. Each method expects the type of the value associated with the key as well as a description. For example, define a key for a new task called hello,

lazy val hello = taskKey[Unit](An example task)

Defining settings and tasks

sbt consists of two things, settings and tasks. Both settings and tasks produce values, but there are two major differences between them.

An example of setting

lazy val root = (project in file(.))

.settings(

name := hello

)

An example of task

By default, sbt runs all of the tasks in parallel, but using the dependency tree it can work out what should be sequential and what can be parallel. A task in sbt is Scala code. There isn't any intermediate XML, we just write the code directly in our build configuration.

// first define a task key

lazy val hello = taskKey[Unit](An example task)

// then implement the task key

lazy val root = (project in file(.))

.settings(

hello := { println(Hello!) }

)

An example of input task

An input task is any task that can accept additional user input before execution.

val demo = inputKey[Unit](A demo input task.)

lazy val root = (project in file(.))

.settings(

demo := {

// get the result of parsing

val args: Seq[String] = spaceDelimited().parsed

// Here, we also use the value of the `scalaVersion` setting

println(The current Scala version is + scalaVersion.value)

println(The arguments to demo were:)

args foreach println

}

)

Multi-project Builds

A subproject is defined by declaring a lazy val of type Project. For example,

lazy val util = (project in file(util))

lazy val core = (project in file(core))

We can keep multiple related subprojects in a single build definition. Each subproject in a build definition has its own source directories, generates its own jar file when we run package.

To factor out common settings across multiple projects, define the settings scoped to ThisBuild.

ThisBuild / organization := com.example

ThisBuild / version := 0.1.0-SNAPSHOT

ThisBuild / scalaVersion := 2.12.10

lazy val core = (project in file(core))

.settings(

// other settings

)

lazy val util = (project in file(util))

.settings(

// other settings

)

These settings, which are written directly into the build.sbt file instead of putting them inside a .settings(...) call, are called bare style.

Another way to factor out common settings across multiple projects is to create a sequence named commonSettings and call settings method on each project.

lazy val commonSettings = Seq(

target := { baseDirectory.value / target2 }

)

lazy val core = (project in file(core))

.settings(

commonSettings,

// other settings

)

lazy val util = (project in file(util))

.settings(

commonSettings,

// other settings

)

Setting up a build

Every project using sbt should have two files:

The build.properties file is used to inform sbt which version it should use for our build. While this file can be used to specify several things, it's commonly only used to specify the sbt version. The build.sbt file defines the actual build.

The project directory can contain .scala files that define helper objects and one-off plugins. These .scala files under project directory are part of the build definition. We can create project/Dependencies.scala to track dependencies in one place.

import sbt._

object Dependencies {

lazy val scalaTest = org.scalatest %% scalatest % 3.0.8

}

This Dependencies object will be available in build.sbt. To make it easier to use the val defined in Dependencies.scala, add import Dependencies._ in build.sbt file.

We can write any Scala code in .scala files, including top-level classes and objects. The recommended approach is to define most settings in a Multi-project build.sbt file, and using project/*.scala files for task implementations or to share values, such as keys.

The project directory is one special directory in an sbt build. It contains definitions which apply to the build itself, and not the artifact that sbt is building. When sbt is trying to construct our project definition, it reads the .sbt and .scala files files in project directory, which forms the build definition for the build project itself. It then uses this to help read the .sbt files in the base directory. It compiles everything that it finds and this gives us our build definition. sbt then runs this build and creates our artifact, jar or whatever.

Note that the project directory is itself a recursive structure.

Development Environment

The most popular way to work in Scala is using Scala through sbt within an IDE.

Using sbt on Command Line

First we need to install brew, if we don't have it yet.

/bin/bash -c $(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)

Then run this command to install sbt.

brew install sbt

Following the convention, here we setup a simple helloworld project. This command pulls the hello-world template from GitHub. When prompted, name the application helloworld. This will create a project called helloworld.

sbt new scala/scala-seed.g8

Then cd to the helloworld folder.

cd helloworld

The following command will open up the sbt console.

sbt

图片包含文字描述已自动生成

Then type run. We will get the following output.

[info] Compiling 1 Scala source to /Projects/sbt/helloworld/target/scala-2.13/classes ...

[info] running example.Hello

hello

[success] Total time: 3 s, completed March 8, 2020 1:47:10 AM

We don't need to install Scala separately, sbt will download Scala for us.

sbt Directory structure

When we create the project, the default directory structure is shown in the left figure below. After we run sbt to open up the sbt console, two target/ folders are generated as shown in the right figure.

图片包含屏幕截图, 文字描述已自动生成图片包含屏幕截图, 文字描述已自动生成

Base directory is the directory containing the project. Here helloworld/ is the base directory.

The build definition is described in build.sbt (actually any files named *.sbt) in the base directory. For example, project name, project version, scalaVersion, libraryDependencies, etc. Note that this file is written in Scala.

As build.sbt is written in Scala, we need to build build.sbt. This directory contains all the files needed for building build.sbt.

sbt uses the same directory structure as Maven for source files by default. Main contains anything going to production while test contains anything used for unit testing.

Generated files (compiled classes, packaged jars, managed files, caches, and documentation) will be written to the target directory by default.

Every subproject in SBT has a target directory. That's where its compiled artifacts go. In the project directory we can write scala code to implement build-related tasks and settings. These compiled artifacts go to the project/target directory.

Visual Studio Code

Install the Metals extension from the Visutal Studio Code Marketplace.

Then open a directory containing a build.sbt file. The extension activates when a .scala or .sbt file is opened.

The first time we open Metals in a new workspace it prompts us to import the build. Click Import build to start the installation step.

Once the import step completes, compilation starts for our open *.scala files.

Language Server Protocol

Metals is a Scala Language Server with rich IDE features.

The Language Server Protocol (LSP) defines the protocol used between an editor or IDE and a language server that provides language features like auto complete, go to definition, find all references etc. The LSP was created by Microsoft to define a common language for programming language analyzers to speak.

Today, several companies have come together to support its growth, including Codenvy, Red Hat, and Sourcegraph, and the protocol is becoming supported by a rapidly growing list of editor and language communities.

Adding features like auto complete, go to definition, or documentation on hover for a programming language takes significant effort. Traditionally this work had to be repeated for each development tool, as each tool provides different APIs for implementing the same feature.

LSP creates the opportunity to reduce the m-times-n complexity problem of providing a high level of support for any programming language in any editor, IDE, or client to a simpler m-plus-n problem. For example, instead of the traditional practice of building a Python plugin for VSCode, a Python plugin for Sublime Text, a Python plugin for Vim, a Python plugin for Sourcegraph, and so on, for every language, LSP allows language communities to concentrate their efforts on a single, high performing language server that can provide code completion, hover tooltips, jump-to-definition, find-references, and more, while editor and client communities can concentrate on building a single, high performing, intuitive and idiomatic extension that can communicate with any language server to instantly provide deep language support.

LSP is a win for both language providers and tooling vendors.

Example of Language servers

Example of LSP clients

Scala REPL

The easiest way to get started with Scala is by using the Scala REPL, an interactive shell for writing Scala expressions and programs.

The Scala REPL is a command-line interpreter that we use as a playground area to test our Scala code. To start a REPL session, just type scala at our operating system command line. Behind the scenes, our input is quickly compiled into bytecode, and the bytecode is executed by JVM.

A read–eval–print loop (REPL), also termed

Enjoying the preview?

Page 1 of 1

Exploring Hadoop Ecosystem (Volume 2): Stream Processing

About this ebook

Wei Liu

Read more from Wei Liu

Related authors

Related to Exploring Hadoop Ecosystem (Volume 2)

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Exploring Hadoop Ecosystem (Volume 2)

What did you think?

Book preview

Exploring Hadoop Ecosystem (Volume 2) - Wei Liu

Exploring Hadoop Ecosystem (Volume 2)

Stream Processing

(Spark Core, Spark SQL, Spark Streaming, Spark Structured Streaming, Kafka)

Wei Liu

For the future

ABOUT THE AUTHOR

TABLE OF CONTENTS

Batch processing vs Stream processing

Overview

sbt

Development Environment