Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture

Ebook253 pages1 hour

Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture

Name: Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture
Author: Bahaaldine Azarmi
ISBN: 9781484213261

By Bahaaldine Azarmi

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book highlights the different types of data architecture and illustrates the many possibilities hidden behind the term "Big Data", from the usage of No-SQL databases to the deployment of stream analytics architecture, machine learning, and governance.

Scalable Big Data Architecture covers real-world, concrete industry use cases that leverage complex distributed applications , which involve web applications, RESTful API, and high throughput of large amount of data stored in highly scalable No-SQL data stores such as Couchbase and Elasticsearch. This book demonstrates how data processing can be done at scale from the usage of NoSQL datastores to the combination of Big Data distribution.

When the data processing is too complex and involves different processing topology like long running jobs, stream processing, multiple data sources correlation, and machine learning, it’s often necessary to delegate the load to Hadoop or Spark and use the No-SQLto serve processed data in real time.

This book shows you how to choose a relevant combination of big data technologies available within the Hadoop ecosystem. It focuses on processing long jobs, architecture, stream data patterns, log analysis, and real time analytics. Every pattern is illustrated with practical examples, which use the different open sourceprojects such as Logstash, Spark, Kafka, and so on.

Traditional data infrastructures are built for digesting and rendering data synthesis and analytics from large amount of data. This book helps you to understand why you should consider using machine learning algorithms early on in the project, before being overwhelmed by constraints imposed by dealing with the high throughput of Big data.

Scalable Big Data Architecture is for developers, data architects, and data scientists looking for a better understanding of how to choose the most relevant pattern for a Big Data project and which tools tointegrate into that pattern.

Skip carousel

LanguageEnglish

PublisherApress

Release dateDec 31, 2015

ISBN9781484213261

Author

Bahaaldine Azarmi

Related authors

Skip carousel

Related to Scalable Big Data Architecture

Related ebooks

Skip carousel

Practical API Architecture and Development with Azure and AWS: Design and Implementation of APIs for the Cloud
Ebook
Practical API Architecture and Development with Azure and AWS: Design and Implementation of APIs for the Cloud
byThurupathan Vijayakumar
Rating: 0 out of 5 stars
0 ratings
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
Ebook
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
byMayank Malhotra
Rating: 0 out of 5 stars
0 ratings
Learning Azure DocumentDB
Ebook
Learning Azure DocumentDB
byBecker Riccardo
Rating: 0 out of 5 stars
0 ratings
Practical DataOps: Delivering Agile Data Science at Scale
Ebook
Practical DataOps: Delivering Agile Data Science at Scale
byHarvinder Atwal
Rating: 0 out of 5 stars
0 ratings
Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake
Ebook
Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake
bySaurabh Gupta
Rating: 0 out of 5 stars
0 ratings
Building Big Data Applications
Ebook
Building Big Data Applications
byKrish Krishnan
Rating: 0 out of 5 stars
0 ratings
Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions
Ebook
Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions
bySudhir Rawat
Rating: 0 out of 5 stars
0 ratings
Learn PySpark: Build Python-based Machine Learning and Deep Learning Models
Ebook
Learn PySpark: Build Python-based Machine Learning and Deep Learning Models
byPramod Singh
Rating: 0 out of 5 stars
0 ratings
HDInsight Essentials - Second Edition
Ebook
HDInsight Essentials - Second Edition
byRajesh Nadipalli
Rating: 0 out of 5 stars
0 ratings
Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud
Ebook
Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud
byRobert Ilijason
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics Using Splunk: Deriving Operational Intelligence from Social Media, Machine Data, Existing Data Warehouses, and Other Real-Time Streaming Sources
Ebook
Big Data Analytics Using Splunk: Deriving Operational Intelligence from Social Media, Machine Data, Existing Data Warehouses, and Other Real-Time Streaming Sources
byPeter Zadrozny
Rating: 0 out of 5 stars
0 ratings
PHPEclipse: A User Guide
Ebook
PHPEclipse: A User Guide
byShu-Wai Chow
Rating: 3 out of 5 stars
3/5
Deploy Machine Learning Models to Production: With Flask, Streamlit, Docker, and Kubernetes on Google Cloud Platform
Ebook
Deploy Machine Learning Models to Production: With Flask, Streamlit, Docker, and Kubernetes on Google Cloud Platform
byPramod Singh
Rating: 0 out of 5 stars
0 ratings
Practical Azure SQL Database for Modern Developers: Building Applications in the Microsoft Cloud
Ebook
Practical Azure SQL Database for Modern Developers: Building Applications in the Microsoft Cloud
byDavide Mauri
Rating: 0 out of 5 stars
0 ratings
Oracle Warehouse Builder 11g: Getting Started
Ebook
Oracle Warehouse Builder 11g: Getting Started
byBob Griesemer
Rating: 0 out of 5 stars
0 ratings
Effective Data Science Infrastructure: How to make data scientists productive
Ebook
Effective Data Science Infrastructure: How to make data scientists productive
byVille Tuulos
Rating: 0 out of 5 stars
0 ratings
Database Design and Relational Theory: Normal Forms and All That Jazz
Ebook
Database Design and Relational Theory: Normal Forms and All That Jazz
byC. J. Date
Rating: 4 out of 5 stars
4/5
Beginning SQL Server Reporting Services
Ebook
Beginning SQL Server Reporting Services
byKathi Kellenberger
Rating: 0 out of 5 stars
0 ratings
MySQL 8 Query Performance Tuning: A Systematic Method for Improving Execution Speeds
Ebook
MySQL 8 Query Performance Tuning: A Systematic Method for Improving Execution Speeds
byJesper Wisborg Krogh
Rating: 0 out of 5 stars
0 ratings
Beginning Application Lifecycle Management
Ebook
Beginning Application Lifecycle Management
byJoachim Rossberg
Rating: 0 out of 5 stars
0 ratings
MongoDB Recipes: With Data Modeling and Query Building Strategies
Ebook
MongoDB Recipes: With Data Modeling and Query Building Strategies
bySubhashini Chellappan
Rating: 0 out of 5 stars
0 ratings
Bayesian Optimization and Data Science
Ebook
Bayesian Optimization and Data Science
byFrancesco Archetti
Rating: 0 out of 5 stars
0 ratings
Beginning Oracle Database 12c Administration: From Novice to Professional
Ebook
Beginning Oracle Database 12c Administration: From Novice to Professional
byIgnatius Fernandez
Rating: 0 out of 5 stars
0 ratings
Assessing and Improving Prediction and Classification: Theory and Algorithms in C++
Ebook
Assessing and Improving Prediction and Classification: Theory and Algorithms in C++
byTimothy Masters
Rating: 0 out of 5 stars
0 ratings
The Chief Data Officer Management Handbook: Set Up and Run an Organization’s Data Supply Chain
Ebook
The Chief Data Officer Management Handbook: Set Up and Run an Organization’s Data Supply Chain
byMartin Treder
Rating: 0 out of 5 stars
0 ratings
Dynamic SQL: Applications, Performance, and Security in Microsoft SQL Server
Ebook
Dynamic SQL: Applications, Performance, and Security in Microsoft SQL Server
byEdward Pollack
Rating: 0 out of 5 stars
0 ratings
Numerical Python: Scientific Computing and Data Science Applications with Numpy, SciPy and Matplotlib
Ebook
Numerical Python: Scientific Computing and Data Science Applications with Numpy, SciPy and Matplotlib
byRobert Johansson
Rating: 0 out of 5 stars
0 ratings
Beginning T-SQL
Ebook
Beginning T-SQL
byKathi Kellenberger
Rating: 0 out of 5 stars
0 ratings
Microservices for the Enterprise: Designing, Developing, and Deploying
Ebook
Microservices for the Enterprise: Designing, Developing, and Deploying
byKasun Indrasiri
Rating: 0 out of 5 stars
0 ratings
Data Cleaning A Complete Guide - 2021 Edition
Ebook
Data Cleaning A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings

Databases For You

Skip carousel

Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
Ebook
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
byAlexander Cooper
Rating: 1 out of 5 stars
1/5
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator
Ebook
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator
byBrian Peasland
Rating: 0 out of 5 stars
0 ratings
100+ SQL Queries T-SQL for Microsoft SQL Server
Ebook
100+ SQL Queries T-SQL for Microsoft SQL Server
byIFS Harrison
Rating: 4 out of 5 stars
4/5
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
Ebook
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
byWilliam Sullivan
Rating: 5 out of 5 stars
5/5
Learn SQL Server Administration in a Month of Lunches
Ebook
Learn SQL Server Administration in a Month of Lunches
byDon Jones
Rating: 3 out of 5 stars
3/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Practical Data Analysis
Ebook
Practical Data Analysis
byHector Cuesta
Rating: 4 out of 5 stars
4/5
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Access 2010 All-in-One For Dummies
Ebook
Access 2010 All-in-One For Dummies
byAlison Barrows
Rating: 4 out of 5 stars
4/5
Excel 2021
Ebook
Excel 2021
byJIAYI SIMONDS
Rating: 4 out of 5 stars
4/5
Building a Scalable Data Warehouse with Data Vault 2.0
Ebook
Building a Scalable Data Warehouse with Data Vault 2.0
byDaniel Linstedt
Rating: 4 out of 5 stars
4/5
SQL: Practical Guide for Developers
Ebook
SQL: Practical Guide for Developers
byMichael J. Donahoo
Rating: 2 out of 5 stars
2/5
Blockchain Basics: A Non-Technical Introduction in 25 Steps
Ebook
Blockchain Basics: A Non-Technical Introduction in 25 Steps
byDaniel Drescher
Rating: 5 out of 5 stars
5/5
Joe Celko’s Complete Guide to NoSQL: What Every SQL Professional Needs to Know about Non-Relational Databases
Ebook
Joe Celko’s Complete Guide to NoSQL: What Every SQL Professional Needs to Know about Non-Relational Databases
byJoe Celko
Rating: 4 out of 5 stars
4/5
Access 2019 For Dummies
Ebook
Access 2019 For Dummies
byLaurie A. Ulrich
Rating: 0 out of 5 stars
0 ratings
Query Store for SQL Server 2019: Identify and Fix Poorly Performing Queries
Ebook
Query Store for SQL Server 2019: Identify and Fix Poorly Performing Queries
byTracy Boggiano
Rating: 0 out of 5 stars
0 ratings
Node.js Design Patterns - Second Edition
Ebook
Node.js Design Patterns - Second Edition
byMario Casciaro
Rating: 4 out of 5 stars
4/5
Python Projects for Everyone
Ebook
Python Projects for Everyone
byMohamad Charara
Rating: 0 out of 5 stars
0 ratings
Advanced Analytics in Power BI with R and Python: Ingesting, Transforming, Visualizing
Ebook
Advanced Analytics in Power BI with R and Python: Ingesting, Transforming, Visualizing
byRyan Wade
Rating: 0 out of 5 stars
0 ratings
Learning PostgreSQL
Ebook
Learning PostgreSQL
byJuba Salahaldin
Rating: 1 out of 5 stars
1/5
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
Ebook
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
byJohn Ladley
Rating: 4 out of 5 stars
4/5
Data Lake Development with Big Data
Ebook
Data Lake Development with Big Data
byPasupuleti Pradeep
Rating: 0 out of 5 stars
0 ratings
Access for Beginners: Access Essentials, #1
Ebook
Access for Beginners: Access Essentials, #1
byM.L. Humphrey
Rating: 0 out of 5 stars
0 ratings
Access 2016 For Dummies
Ebook
Access 2016 For Dummies
byLaurie A. Ulrich
Rating: 0 out of 5 stars
0 ratings
Learn Git in a Month of Lunches
Ebook
Learn Git in a Month of Lunches
byRick Umali
Rating: 0 out of 5 stars
0 ratings
Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework
Ebook
Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework
byLaura Sebastian-Coleman
Rating: 5 out of 5 stars
5/5
LINUX: Beginner's Crash Course. Your Step-By-Step Guide To Learning The Linux Operating System And Command Line Easy & Fast!
Ebook
LINUX: Beginner's Crash Course. Your Step-By-Step Guide To Learning The Linux Operating System And Command Line Easy & Fast!
byJeremy Li
Rating: 3 out of 5 stars
3/5
A Concise Guide to Object Orientated Programming
Ebook
A Concise Guide to Object Orientated Programming
byalasdair gilchrist
Rating: 0 out of 5 stars
0 ratings
SQL Server: Tips and Tricks - 2
Ebook
SQL Server: Tips and Tricks - 2
byPriyanka Agarwal
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

[DataFramed Careers Series #3]: Accelerating Data Careers with Writing
Podcast episode
[DataFramed Careers Series #3]: Accelerating Data Careers with Writing
byDataFramed
0 ratings
0% found this document useful
Bring Your Own Data to LLMs (W/ Jerry Liu of LlamaIndex): Jerry Liu is the CEO and co-founder of LlamaIndex. LlamaIndex is an open-source framework that helps people prep their data for use with large language models in a process called retrieval augmented generation. LLMs are great decision engines, but in...
Podcast episode
Bring Your Own Data to LLMs (W/ Jerry Liu of LlamaIndex): Jerry Liu is the CEO and co-founder of LlamaIndex. LlamaIndex is an open-source framework that helps people prep their data for use with large language models in a process called retrieval augmented generation. LLMs are great decision engines, but in...
byThe Analytics Engineering Podcast
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Build Your Data Analytics Like An Engineer - Episode 81: An interview about how dbt enables your data teams to build better analytics in your data warehouse
Podcast episode
Build Your Data Analytics Like An Engineer - Episode 81: An interview about how dbt enables your data teams to build better analytics in your data warehouse
byData Engineering Podcast
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
#54 Women in Data Science
Podcast episode
#54 Women in Data Science
byDataFramed
0 ratings
0% found this document useful
A Non-Traditional Path into the SRE Folds with Serena Tiede: This week Serena Tiede, an SRE at Optum, joins Corey to talk about the world of SREs. Serena discusses their mix of traditional and non-traditional background and making the jump from electrical engineering to tech. Serena tells us about their beginnings
Podcast episode
A Non-Traditional Path into the SRE Folds with Serena Tiede: This week Serena Tiede, an SRE at Optum, joins Corey to talk about the world of SREs. Serena discusses their mix of traditional and non-traditional background and making the jump from electrical engineering to tech. Serena tells us about their beginnings
byScreaming in the Cloud
0 ratings
0% found this document useful
Potluck - Tabs are better? × Coding Music × SEO × Is Angular good? × Biggie Smalls × Soy Sauce × More!: It’s another potluck! In this episode Scott and Wes talk about tabs vs spaces, coding music, SEO, React vs Angular vs Vue vs Svelte, Rapping, Soy sauce and more! Sentry - Sponsor If you want to know what’s happening with your errors, track them...
Podcast episode
Potluck - Tabs are better? × Coding Music × SEO × Is Angular good? × Biggie Smalls × Soy Sauce × More!: It’s another potluck! In this episode Scott and Wes talk about tabs vs spaces, coding music, SEO, React vs Angular vs Vue vs Svelte, Rapping, Soy sauce and more! Sentry - Sponsor If you want to know what’s happening with your errors, track them...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
Podcast episode
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
23: Growing a data community to 200K Followers in 2 years w/ Mike Delgado of Experian: Have you ever thought about starting your own data community? Ever wondered what really goes into it? Data communities are hot and have become the number one source for learning and networking! Our guest today talks to us about exactly how he grew...
Podcast episode
23: Growing a data community to 200K Followers in 2 years w/ Mike Delgado of Experian: Have you ever thought about starting your own data community? Ever wondered what really goes into it? Data communities are hot and have become the number one source for learning and networking! Our guest today talks to us about exactly how he grew...
byAnalytics on Fire
0 ratings
0% found this document useful
#52 Data Science at the BBC
Podcast episode
#52 Data Science at the BBC
byDataFramed
0 ratings
0% found this document useful
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
Podcast episode
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Geo-spatial Awakening in Global Supply Chains with Nathan Eaton and Denise Pearl: This week, Googler and Executive Director Nathan Eaton join hosts and Donna Schut to talk about how modern technology and data collection can significantly enhance environmental protection practices. Denise starts the show with a thorough...
Podcast episode
Geo-spatial Awakening in Global Supply Chains with Nathan Eaton and Denise Pearl: This week, Googler and Executive Director Nathan Eaton join hosts and Donna Schut to talk about how modern technology and data collection can significantly enhance environmental protection practices. Denise starts the show with a thorough...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
20 JavaScript Array and Object Methods to make you a better developer: Wes and Scott rattle through ~20 different Object and Arra Methods that will make you a better JavaScript developer. Freshbooks - Sponsor This is episode Wes mentions the free book . Get a 30 day free trial of Freshbooks at . Netlify —...
Podcast episode
20 JavaScript Array and Object Methods to make you a better developer: Wes and Scott rattle through ~20 different Object and Arra Methods that will make you a better JavaScript developer. Freshbooks - Sponsor This is episode Wes mentions the free book . Get a 30 day free trial of Freshbooks at . Netlify —...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
Distributing Geospatial Data: Distributing Geospatial Data - Every wondered why you might what to do this? Or maybe you understand the why but are unsure about the how? Perhaps you have heard people talk about partitioning data or sharding data, you might have heard some of thes...
Podcast episode
Distributing Geospatial Data: Distributing Geospatial Data - Every wondered why you might what to do this? Or maybe you understand the why but are unsure about the how? Perhaps you have heard people talk about partitioning data or sharding data, you might have heard some of thes...
byThe MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
0 ratings
0% found this document useful
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
Podcast episode
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
FoA 282: Open Source Weed Control with Guy Coleman and William Salter of OWL: OWL GitHub Weed AI Guy Coleman Twitter: William Salter Twitter: Video: On the show today we have Guy Coleman, and William Salter. Guy is PhD Student at the University of Sydney and Fulbright Future Scholar currently based at...
Podcast episode
FoA 282: Open Source Weed Control with Guy Coleman and William Salter of OWL: OWL GitHub Weed AI Guy Coleman Twitter: William Salter Twitter: Video: On the show today we have Guy Coleman, and William Salter. Guy is PhD Student at the University of Sydney and Fulbright Future Scholar currently based at...
byFuture of Agriculture
0 ratings
0% found this document useful
Massively Parallel Data Processing In Python Without The Effort Using Bodo: An interview about how Bodo converts standard Python code to native MPI automatically for massive speed ups in data processing workloads
Podcast episode
Massively Parallel Data Processing In Python Without The Effort Using Bodo: An interview about how Bodo converts standard Python code to native MPI automatically for massive speed ups in data processing workloads
byData Engineering Podcast
0 ratings
0% found this document useful
Data Operations vs. Data Analytics: Are we doing data and analytics correctly? Self service, centralization vs decentralization, analytics vs operations… so many aspects that data teams need to consider. Join this week’s episode of Catalog & Cocktails with hos...
Podcast episode
Data Operations vs. Data Analytics: Are we doing data and analytics correctly? Self service, centralization vs decentralization, analytics vs operations… so many aspects that data teams need to consider. Join this week’s episode of Catalog & Cocktails with hos...
byCatalog & Cocktails: The Honest, No-BS Data Podcast
0 ratings
0% found this document useful
Show 212 - Jesse Anderson - Big Data: Today’s episode is an interview with Jesse Anderson, a preeminent expert who teaches software engineers how to become data scientists and data engineers. He has years under his belt teaching at Fortune 100 companies and startups alike. Jesse is a data...
Podcast episode
Show 212 - Jesse Anderson - Big Data: Today’s episode is an interview with Jesse Anderson, a preeminent expert who teaches software engineers how to become data scientists and data engineers. He has years under his belt teaching at Fortune 100 companies and startups alike. Jesse is a data...
byThe Ultimate Entrepreneur
0 ratings
0% found this document useful
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
Podcast episode
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
byScreaming in the Cloud
0 ratings
0% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
Investing In Understanding The Customer Journey At American Express: An interview with Purvi Shah about the Customer 360 project at American Express and their journey into the cloud for enterprise data management
Podcast episode
Investing In Understanding The Customer Journey At American Express: An interview with Purvi Shah about the Customer 360 project at American Express and their journey into the cloud for enterprise data management
byData Engineering Podcast
0 ratings
0% found this document useful
#122 How Organizations Can Bridge the Data Literacy Gap
Podcast episode
#122 How Organizations Can Bridge the Data Literacy Gap
byDataFramed
0 ratings
0% found this document useful
79: BI Brainz Masterclass with Raquel Seville and Anna Ria: Welcome to another exciting masterclass, taking you behind the scenes of how two female leaders on our BI Brainz team created not one, but two of our most downloaded database templates ever! Sharing these juicy details, will be Raquel Seville, CEO of...
Podcast episode
79: BI Brainz Masterclass with Raquel Seville and Anna Ria: Welcome to another exciting masterclass, taking you behind the scenes of how two female leaders on our BI Brainz team created not one, but two of our most downloaded database templates ever! Sharing these juicy details, will be Raquel Seville, CEO of...
byAnalytics on Fire
0 ratings
0% found this document useful
AI, Machine Learning and Human Creativity: Just as manufacturing automation cuts into human jobs, the prospect of creative artificial intelligence raises the specter of robot writers, robot artists and robot musicians who never sleep and always agree with their patron. But paranoia aside, where wi
Podcast episode
AI, Machine Learning and Human Creativity: Just as manufacturing automation cuts into human jobs, the prospect of creative artificial intelligence raises the specter of robot writers, robot artists and robot musicians who never sleep and always agree with their patron. But paranoia aside, where wi
byStuff To Blow Your Mind
0 ratings
0% found this document useful
Runway Gen-2: Generative AI for Video Creation with Anastasis Germanidis - #622
Podcast episode
Runway Gen-2: Generative AI for Video Creation with Anastasis Germanidis - #622
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Mark Sussman CEO/Founder of Aardvark AI: Mark Sussman, retired Captain in the U.S. Army, medical practitioner in surgical critical care and CEO/Founder of Aardvark AI talks with Tyler and Sophia about how Augmented Reality, Virtual Reality, and Artificial Intelligence allows medicine to take...
Podcast episode
Mark Sussman CEO/Founder of Aardvark AI: Mark Sussman, retired Captain in the U.S. Army, medical practitioner in surgical critical care and CEO/Founder of Aardvark AI talks with Tyler and Sophia about how Augmented Reality, Virtual Reality, and Artificial Intelligence allows medicine to take...
byEverything VR & AR
0 ratings
0% found this document useful

Skip carousel

Rokoko Studio 2.0
3D World
Article
Rokoko Studio 2.0
Feb 23, 2021
1 min read
The Not-Com Bubble Is Popping
The Atlantic
Article
The Not-Com Bubble Is Popping
Oct 18, 2019
4 min read
The Three Cornerstones of a Smart Business
Rotman Management
Article
The Three Cornerstones of a Smart Business
Jan 1, 2019
Adaptable Products. Algorithms cannot iterate without the products—the online consumer interface that delivers customer experience directly while gathering consumer feedback to adjust algorithm models. Google’s search bar is a classic example of prod
1 min read
Data Backups: Critical Part of Cyber Strategy Strategies to Protect Your Data
Techfastly
Article
Data Backups: Critical Part of Cyber Strategy Strategies to Protect Your Data
Jun 1, 2022
6 min read
Best New Apps
TechLife
Article
Best New Apps
May 3, 2021
3 min read
Get Into Coding!
Linux Format
Article
Get Into Coding!
Aug 23, 2022
1 min read
Monitor Git Projects
Linux Format
Article
Monitor Git Projects
Feb 11, 2020
8 min read
Compute This
Maximum PC
Article
Compute This
Oct 12, 2021
16 min read
AMD Starts to Stack
Maximum PC
Article
AMD Starts to Stack
Jul 20, 2021
3 min read
Syncing
Linux Format
Article
Syncing
May 4, 2021
1 min read
Grafana Terminology
Linux Format
Article
Grafana Terminology
Jan 14, 2020
A Grafana data source is a database, file or service that provides data to Grafana – it cannot operate without data. A Grafana panel is the basic building block of Grafana. Panels are made of visualisations or queries. A Grafana query is used for req
1 min read
Install These Essential Apps
TechLife
Article
Install These Essential Apps
Nov 15, 2021
1 min read
Rise Of The Robots
Linux Format
Article
Rise Of The Robots
Jan 12, 2021
7 min read
All Your Database Are Belong To Us
Linux Format
Article
All Your Database Are Belong To Us
Apr 6, 2021
7 min read
Pull, Configure And Run
Linux Format
Article
Pull, Configure And Run
Apr 7, 2020
Guacamole offers ready-to-run installation packages that are available for Linux distros such as CentOS or Debian. However, the thrust of this article is to illustrate running Guacamole in a Docker container context. Fire up an environment where you
8 min read
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
TechLife News
Article
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
Dec 16, 2023
4 min read
The Return Of GPU Computing
APC
Article
The Return Of GPU Computing
May 16, 2022
5 min read
Software Pools Server Memory for Faster Networks
Futurity
Article
Software Pools Server Memory for Faster Networks
May 31, 2017
A group of engineers has created open-source software that allows for memory sharing among servers in a computer network, allowing for more efficient use of memory and even faster computer operations. For decades, operators of large computer clusters
2 min read
AMD FidelityFX Super Resolution Technology Now Available
Linux Format
Article
AMD FidelityFX Super Resolution Technology Now Available
Jul 27, 2021
AMD has finally officially launched its FidelityFX Super Resolution technology. Similar to Nvidia’s proprietary DLSS (Deep Learning Super Sampling) tech, FidelityFX Super Resolution uses machine learning to upscale games to higher resolutions without
1 min read
3D Modellers
Linux Format
Article
3D Modellers
Jun 1, 2021
1 min read
Bitdefender Internet Security 25.0
APC
Article
Bitdefender Internet Security 25.0
Oct 4, 2021
1 min read
Linux Kernel Hits One Million Git Commits
Linux Format
Article
Linux Kernel Hits One Million Git Commits
Jun 1, 2021
The Linux kernel has hit a major milestone, with over one million git commits and climbing – even with the removal of any that came from the University of Minnesota. You can view the number of commits on Github at https://github.com/torvalds/linux. A
1 min read
The Case For Leaving City Rats Alone: A Vancouver rat study is showing us how pest control can backfire.
Nautilus
Article
The Case For Leaving City Rats Alone: A Vancouver rat study is showing us how pest control can backfire.
Jul 28, 2016
Kaylee Byers crouches in a patch of urban blackberries early one morning this June, to check a live trap in one of Vancouver’s poorest areas, the V6A postal code. Her first catch of the day is near a large blue dumpster on “Block 5,” in front of a 20
8 min read
Unhappy Truckers and Other Algorithmic Problems: Transportation optimization starts with math, but ends in understanding human behavior.
Nautilus
Article
Unhappy Truckers and Other Algorithmic Problems: Transportation optimization starts with math, but ends in understanding human behavior.
Jul 18, 2013
When Bob Santilli, a senior project manager at UPS, was invited in 2009 to his daughter’s fifth grade class on Career Day, he struggled with how to describe exactly what he did for a living. Eventually, he decided he would show the class a travel opt
11 min read
We Need an FDA For Algorithms
Nautilus
Article
We Need an FDA For Algorithms
Nov 1, 2018
In the introduction to her new book, Hannah Fry points out something interesting about the phrase “Hello World.” It’s never been quite clear, she says, whether the phrase—which is frequently the entire output of a student’s first computer program—is
10 min read
Let’s Make It!
Linux Format
Article
Let’s Make It!
Apr 5, 2022
The first time I saw a Raspberry Pi was sat in a Welsh pub garden with my parents and one-month old daughter, during the summer of 2012. It was hard to know back then how all of these things were going to change my life dramatically in one way or ano
1 min read
Docker vs Podman
APC
Article
Docker vs Podman
Apr 19, 2021
When Cockpit was first developed, it had plug-in support for administering your Docker containers remotely via its user-friendly web interface. But then Red Hat OS became a major backer of Cockpit, and when Red Hat developed its own alternative to Do
1 min read
Your Data Is Yours
Linux Format
Article
Your Data Is Yours
May 4, 2021
“It’s nice to let someone else take over occasionally. For example, you might want to hand over database management to a service provider to free up capacity and focus on key business initiatives. Managed databases are increasingly popular – in our l
1 min read
Route Traffic Between Networks Using A Pi
Linux Format
Article
Route Traffic Between Networks Using A Pi
Jun 2, 2020
A deep-dive into Pi networking solutions resulted in this tutorial. The goal was to uncover a Pi configuration that would enable the routing of network traffic from a wired network to a wireless network. The aim is to build a network router using a R
10 min read
VisionFive V1 RISC-V SBC on sale
Linux Format
Article
VisionFive V1 RISC-V SBC on sale
May 3, 2022
1 min read

Related categories

Skip carousel

Reviews for Scalable Big Data Architecture

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Scalable Big Data Architecture - Bahaaldine Azarmi

Bahaaldine AzarmiScalable Big Data Architecture10.1007/978-1-4842-1326-1_1

1. The Big (Data) Problem

Bahaaldine Azarmi¹

(1)

Saint Cloud, France

Data management is getting more complex than it has ever been before. Big Data is everywhere, on everyone’s mind, and in many different forms: advertising, social graphs, news feeds, recommendations, marketing, healthcare, security, government, and so on.

In the last three years, thousands of technologies having to do with Big Data acquisition, management, and analytics have emerged; this has given IT teams the hard task of choosing, without having a comprehensive methodology to handle the choice most of the time.

When making such a choice for your own situation, ask yourself the following questions: When should I think about employing Big Data for my IT system? Am I ready to employ it? What should I start with? Should I really go for it despite feeling that Big Data is just a marketing trend?

All these questions are running around in the minds of most Chief Information Officers (CIOs) and Chief Technology Officers (CTOs), and they globally cover the reasons and the ways you are putting your business at stake when you decide to deploy a distributed Big Data architecture.

This chapter aims to help you identity Big Data symptoms—in other words when it becomes apparent that you need to consider adding Big Data to your architecture—but it also guides you through the variety of Big Data technologies to differentiate among them so that you can understand what they are specialized for. Finally, at the end of the chapter, we build the foundation of a typical distributed Big Data architecture based on real life examples.

Identifying Big Data Symptoms

You may choose to start a Big Data project based on different needs: because of the volume of data you handle, because of the variety of data structures your system has, because of scalability issues you are experiencing, or because you want to reduce the cost of data processing. In this section, you’ll see what symptoms can make a team realize they need to start a Big Data project.

Size Matters

The two main areas that get people to start thinking about Big Data are when they start having issues related to data size and volume; although most of the time these issues present true and legitimate reasons to think about Big Data, today, they are not the only reasons to go this route.

There are others symptoms that you should also consider—type of data, for example. How will you manage to increase various types of data when traditional data stores, such as SQL databases, expect you to do the structuring, like creating tables?

This is not feasible without adding a flexible, schemaless technology that handles new data structures as they come. When I talk about types of data, you should imagine unstructured data, graph data, images, videos, voices, and so on.

Yes, it’s good to store unstructured data, but it’s better if you can get something out of it. Another symptom comes out of this premise: Big Data is also about extracting added value information from a high-volume variety of data. When, a couple of years ago, there were more read transactions than write transactions, common caches or databases were enough when paired with weekly ETL (extract, transform, load) processing jobs. Today that’s not the trend any more. Now, you need an architecture that is capable of handling data as it comes through long processing to near real-time processing jobs. The architecture should be distributed and not rely on the rigid high-performance and expensive mainframe; instead, it should be based on a more available, performance driven, and cheaper technology to give it more flexibility.

Now, how do you leverage all this added value data and how are you able to search for it naturally? To answer this question, think again about the traditional data store in which you create indexes on different columns to speed up the search query. Well, what if you want to index all hundred columns because you want to be able to execute complex queries that involve a nondeterministic number of key columns? You don’t want to do this with a basic SQL database; instead, you would rather consider using a NoSQL store for this specific need.

So simply walking down the path of data acquisition, data structuring, data processing, and data visualization in the context of the actual data management trends makes it easy to conclude that size is no longer the main concern.

Typical Business Use Cases

In addition to technical and architecture considerations, you may be facing use cases that are typical Big Data use cases. Some of them are tied to a specific industry; others are not specialized and can be applied to various industries.

These considerations are generally based on analyzing application’s logs, such as web access logs, application server logs, and database logs, but they can also be based on other types of data sources such as social network data.

When you are facing such use cases, you might want to consider a distributed Big Data architecture if you want to be able to scale out as your business grows.

Consumer Behavioral Analytics

Knowing your customer, or what we usually call the 360-degree customer view might be the most popular Big Data use case. This customer view is usually used on e-commerce websites and starts with an unstructured clickstream—in other words, it is made up of the active and passive website navigation actions that a visitor performs. By counting and analyzing the clicks and impressions on ads or products, you can adapt the visitor’s user experience depending on their behavior, while keeping in mind that the goal is to gain insight in order to optimize the funnel conversion.

Sentiment Analysis

Companies care about how their image and reputation is perceived across social networks; they want to minimize all negative events that might affect their notoriety and leverage positive events. By crawling a large amount of social data in a near-real-time way, they can extract the feelings and sentiments of social communities regarding their brand, and they can identify influential users and contact them in order to change or empower a trend depending on the outcome of their interaction with such users.

CRM Onboarding

You can combine consumer behavioral analytics with sentiment analysis based on data surrounding the visitor’s social activities. Companies want to combine these online data sources with the existing offline data, which is called CRM (customer relationship management) onboarding, in order to get better and more accurate customer segmentation. Thus, companies can leverage this segmentation and build a better targeting system to send profile-customized offers through marketing actions.

Prediction

Learning from data has become the main Big Data trend for the past two years. Prediction-enabled Big Data can be very efficient in multiple industries, such as in the telecommunication industry, where prediction router log analysis is democratized. Every time an issue is likely to occur on a device, the company can predict it and order part to avoid downtime or lost profits.

When combined with the previous use cases, you can use predictive architecture to optimize the product catalog selection and pricing depending on the user’s global behavior.

Understanding the Big Data Project’s Ecosystem

Once you understand that you actually have a Big Data project to implement, the hardest thing is choosing the technologies to use in your architecture. It is not just about picking the most famous Hadoop-related technologies, it’s also about understanding how to classify them in order to build a consistent distributed architecture.

To get an idea of the number of projects in the Big Data galaxy, browse to https://github.com/zenkay/bigdata-ecosystem#projects-1 to see more than 100 classified projects.

Here, you see that you might consider choosing a Hadoop distribution, a distributed file system, a SQL-like processing language, a machine learning language, a scheduler, message-oriented middleware, a NoSQL datastore, data visualization, and so on.

Since this book’s purpose is to describe a scalable way to build a distributed architecture, I don’t dive into all categories of projects; instead, I highlight the ones you are likely to use in a typical Big Data project. You can eventually adapt this architecture and integrate projects depending on your needs. You’ll see concrete examples of using such projects in the dedicated parts.

To make the Hadoop technology presented more relevant, we will work on a distributed architecture that meets the previously described typical use cases, namely these:

Consumer behavioral analytics

Sentiment analysis

CRM onboarding and prediction

Hadoop Distribution

In a Big Data project that involves Hadoop-related ecosystem technologies, you have two choices:

Download the project you need separately and try to create or assemble the technologies in a coherent, resilient, and consistent architecture.

Use one of the most popular Hadoop distributions, which assemble or create the technologies for you.

Although the first option is completely feasible, you might want to choose the second one, because a packaged Hadoop distribution ensures capability between all installed components, ease of installation, configuration-based deployment, monitoring, and support.

Hortonworks and Cloudera are the main actors in this field. There are a couple of differences between the two vendors, but for starting a Big Data package, they are equivalent, as long as you don’t pay attention to the proprietary add-ons.

My goal here is not to present all the components within each distribution but to focus on what each vendor adds to the standard ecosystem. I describe most of the other components in the following pages depending on what we need for our architecture in each situation.

Cloudera CDH

Cloudera adds a set of in-house components to the Hadoop-based components; these components are designed to give you better cluster management and search experiences.

The following is a list of some of these components:

Impala: A real-time, parallelized, SQL-based engine that searches for data in HDFS (Hadoop Distributed File System) and Base. Impala is considered to be the fastest querying engine within the Hadoop distribution vendors market, and it is a direct competitor of Spark from UC Berkeley.

Cloudera Manager: This is Cloudera’s console to manage and deploy Hadoop components within your Hadoop cluster.

Hue: A console that lets the user interact with the data and run scripts for the different Hadoop components

Enjoying the preview?

Page 1 of 1

Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture

About this ebook

Bahaaldine Azarmi

Related authors

Related to Scalable Big Data Architecture

Related ebooks

Databases For You

Related podcast episodes

Related articles

Related categories

Reviews for Scalable Big Data Architecture

What did you think?

Book preview

Scalable Big Data Architecture - Bahaaldine Azarmi

1. The Big (Data) Problem

Identifying Big Data Symptoms

Size Matters

Typical Business Use Cases

Understanding the Big Data Project’s Ecosystem

Hadoop Distribution