Ebook824 pages7 hours

Pro Apache Hadoop

Name: Pro Apache Hadoop
Brand: Apress
Rating: 5.0 (1 reviews)

By Jason Venner, Sameer Wadkar and Madhu Siddalingaiah

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

Pro Apache Hadoop, Second Edition brings you up to speed on Hadoop – the framework of big data. Revised to cover Hadoop 2.0, the book covers the very latest developments such as YARN (aka MapReduce 2.0), new HDFS high-availability features, and increased scalability in the form of HDFS Federations. All the old content has been revised too, giving the latest on the ins and outs of MapReduce, cluster design, the Hadoop Distributed File System, and more.

This book covers everything you need to build your first Hadoop cluster and begin analyzing and deriving value from your business and scientific data. Learn to solve big-data problems the MapReduce way, by breaking a big problem into chunks and creating small-scale solutions that can be flung across thousands upon thousands of nodes to analyze large data volumes in a short amount of wall-clock time. Learn how to let Hadoop take care of distributing and parallelizing your software—you just focus on the code; Hadoop takes care of the rest.

Covers all that is new in Hadoop 2.0
Written by a professional involved in Hadoop since day one
Takes you quickly to the seasoned pro level on the hottest cloud-computing framework

Skip carousel

LanguageEnglish

PublisherApress

Release dateSep 18, 2014

ISBN9781430248644

Author

Jason Venner

Related authors

Skip carousel

Related to Pro Apache Hadoop

Related ebooks

Skip carousel

Introducing Spring Framework: A Primer
Ebook
Introducing Spring Framework: A Primer
byFelipe Gutierrez
Rating: 0 out of 5 stars
0 ratings
Learning Hunk
Ebook
Learning Hunk
byDmitry Anoshin
Rating: 0 out of 5 stars
0 ratings
Automation through Chef Opscode: A Hands-on Approach to Chef
Ebook
Automation through Chef Opscode: A Hands-on Approach to Chef
byNavin Sabharwal
Rating: 0 out of 5 stars
0 ratings
Apache Hive Essentials
Ebook
Apache Hive Essentials
byDayong Du
Rating: 0 out of 5 stars
0 ratings
Advanced API Security: Securing APIs with OAuth 2.0, OpenID Connect, JWS, and JWE
Ebook
Advanced API Security: Securing APIs with OAuth 2.0, OpenID Connect, JWS, and JWE
byPrabath Siriwardena
Rating: 4 out of 5 stars
4/5
Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture
Ebook
Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture
byBahaaldine Azarmi
Rating: 0 out of 5 stars
0 ratings
High Impact Data Visualization with Power View, Power Map, and Power BI
Ebook
High Impact Data Visualization with Power View, Power Map, and Power BI
byAdam Aspin
Rating: 0 out of 5 stars
0 ratings
Hands-On Machine Learning Recommender Systems with Apache Spark
Ebook
Hands-On Machine Learning Recommender Systems with Apache Spark
byErnesto Lee
Rating: 0 out of 5 stars
0 ratings
HL7 for BizTalk
Ebook
HL7 for BizTalk
byHoward Edidin
Rating: 0 out of 5 stars
0 ratings
Beginning Hibernate
Ebook
Beginning Hibernate
byDave Minter
Rating: 0 out of 5 stars
0 ratings
Getting Started with hapi.js
Ebook
Getting Started with hapi.js
byJohn Brett
Rating: 5 out of 5 stars
5/5
Machine Learning with Spark - Second Edition
Ebook
Machine Learning with Spark - Second Edition
byNick Pentreath
Rating: 0 out of 5 stars
0 ratings
Visual Studio Condensed: For Visual Studio 2013 Express, Professional, Premium and Ultimate Editions
Ebook
Visual Studio Condensed: For Visual Studio 2013 Express, Professional, Premium and Ultimate Editions
byPatrick Desjardins
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Azure: Building and Deploying Artificial Intelligence Solutions on the Microsoft AI Platform
Ebook
Deep Learning with Azure: Building and Deploying Artificial Intelligence Solutions on the Microsoft AI Platform
byMathew Salvaris
Rating: 0 out of 5 stars
0 ratings
Practical hapi: Build Your Own hapi Apps and Learn from Industry Case Studies
Ebook
Practical hapi: Build Your Own hapi Apps and Learn from Industry Case Studies
byKanika Sud
Rating: 0 out of 5 stars
0 ratings
Implementing Cloud Design Patterns for AWS
Ebook
Implementing Cloud Design Patterns for AWS
byMarcus Young
Rating: 0 out of 5 stars
0 ratings
Securing Hadoop
Ebook
Securing Hadoop
bySudheesh Narayanan
Rating: 4 out of 5 stars
4/5
Swift Quick Syntax Reference
Ebook
Swift Quick Syntax Reference
byMatthew Campbell
Rating: 0 out of 5 stars
0 ratings
Building React Apps with Server-Side Rendering: Use React, Redux, and Next to Build Full Server-Side Rendering Applications
Ebook
Building React Apps with Server-Side Rendering: Use React, Redux, and Next to Build Full Server-Side Rendering Applications
byMohit Thakkar
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Hadoop
Ebook
Deep Learning with Hadoop
byDipayan Dev
Rating: 0 out of 5 stars
0 ratings
Decoupled Drupal in Practice: Architect and Implement Decoupled Drupal Architectures Across the Stack
Ebook
Decoupled Drupal in Practice: Architect and Implement Decoupled Drupal Architectures Across the Stack
byPreston So
Rating: 0 out of 5 stars
0 ratings
Cloud Computing: Master the Concepts, Architecture and Applications with Real-world examples and Case studies
Ebook
Cloud Computing: Master the Concepts, Architecture and Applications with Real-world examples and Case studies
byRuchi Doshi
Rating: 0 out of 5 stars
0 ratings
Practical Predictive Analytics
Ebook
Practical Predictive Analytics
byRalph Winters
Rating: 0 out of 5 stars
0 ratings
Web Applications with Elm: Functional Programming for the Web
Ebook
Web Applications with Elm: Functional Programming for the Web
byWolfgang Loder
Rating: 0 out of 5 stars
0 ratings
Applied Network Security
Ebook
Applied Network Security
byArthur Salmon
Rating: 0 out of 5 stars
0 ratings
Building a Web Application with PHP and MariaDB: A Reference Guide
Ebook
Building a Web Application with PHP and MariaDB: A Reference Guide
bySai Srinivas Sriparasa
Rating: 0 out of 5 stars
0 ratings
PostgreSQL 11 Administration Cookbook: Over 175 recipes for database administrators to manage enterprise databases
Ebook
PostgreSQL 11 Administration Cookbook: Over 175 recipes for database administrators to manage enterprise databases
bySimon Riggs
Rating: 0 out of 5 stars
0 ratings
Software Engineering from Scratch: A Comprehensive Introduction Using Scala
Ebook
Software Engineering from Scratch: A Comprehensive Introduction Using Scala
byJason Lee Hodges
Rating: 0 out of 5 stars
0 ratings
Java EE 7 Recipes: A Problem-Solution Approach
Ebook
Java EE 7 Recipes: A Problem-Solution Approach
byJosh Juneau
Rating: 0 out of 5 stars
0 ratings
Implementing OpenShift
Ebook
Implementing OpenShift
byEuan Cook
Rating: 0 out of 5 stars
0 ratings

Programming For You

Skip carousel

The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
Ebook
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
byHeath Haskins
Rating: 5 out of 5 stars
5/5
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
Ebook
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
byTimothy C. Needham
Rating: 4 out of 5 stars
4/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
Ebook
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
byJason Scotts
Rating: 4 out of 5 stars
4/5
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
Ebook
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
byJames Tudor
Rating: 5 out of 5 stars
5/5
HTML & CSS: Learn the Fundaments in 7 Days
Ebook
HTML & CSS: Learn the Fundaments in 7 Days
byMichael Knapp
Rating: 4 out of 5 stars
4/5
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
Ebook
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
byBrady Ellison
Rating: 5 out of 5 stars
5/5
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
Ebook
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
byi Code Academy
Rating: 5 out of 5 stars
5/5
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
Ebook
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
byJoseph Labrecque
Rating: 5 out of 5 stars
5/5
HTML & CSS QuickStart Guide: The Simplified Beginners Guide to Developing a Strong Coding Foundation, Building Responsive Websites, and Mastering the Fundamentals of Modern Web Design
Ebook
HTML & CSS QuickStart Guide: The Simplified Beginners Guide to Developing a Strong Coding Foundation, Building Responsive Websites, and Mastering the Fundamentals of Modern Web Design
byDavid DuRocher
Rating: 4 out of 5 stars
4/5
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
Ebook
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
byEric Vargas
Rating: 0 out of 5 stars
0 ratings
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
Ebook
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
byAnthony Adams
Rating: 4 out of 5 stars
4/5
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
Ebook
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
byGwendolyn Faraday
Rating: 5 out of 5 stars
5/5
Coding All-in-One For Dummies
Ebook
Coding All-in-One For Dummies
byNikhil Abraham
Rating: 4 out of 5 stars
4/5
Python Machine Learning By Example
Ebook
Python Machine Learning By Example
byYuxi (Hayden) Liu
Rating: 4 out of 5 stars
4/5
101 Amazing Nintendo NES Facts: Includes facts about the Famicom
Ebook
101 Amazing Nintendo NES Facts: Includes facts about the Famicom
byJimmy Russell
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
Pokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS
Ebook
Pokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS
byGame Guidez
Rating: 5 out of 5 stars
5/5
Linux: Learn in 24 Hours
Ebook
Linux: Learn in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
Ebook
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
bySebastian Raschka
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
SQL All-in-One For Dummies
Ebook
SQL All-in-One For Dummies
byAllen G. Taylor
Rating: 3 out of 5 stars
3/5
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
Ebook
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
byKevin Pitch
Rating: 5 out of 5 stars
5/5
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
Ebook
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
byMark Chan
Rating: 5 out of 5 stars
5/5
Modern C++ for Absolute Beginners: A Friendly Introduction to C++ Programming Language and C++11 to C++20 Standards
Ebook
Modern C++ for Absolute Beginners: A Friendly Introduction to C++ Programming Language and C++11 to C++20 Standards
bySlobodan Dmitrović
Rating: 0 out of 5 stars
0 ratings
Python Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming
Ebook
Python Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming
byConnor P. Milliken
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

How Hera is an Enabler of MLOps Integrations // Flaviu Vadan // Coffee Sessions #115
Podcast episode
How Hera is an Enabler of MLOps Integrations // Flaviu Vadan // Coffee Sessions #115
byMLOps.community
0 ratings
0% found this document useful
Developing Storage Solutions Before the Rest with AB Periasamay: Conversations about what the cloud is might be an infinitely convoluted one, but some are taking the conversation down paths less traveled. That is certainly the case for AB Periasamy, CEO and Co-Founder of MinIO, an open source provider of high performan
Podcast episode
Developing Storage Solutions Before the Rest with AB Periasamay: Conversations about what the cloud is might be an infinitely convoluted one, but some are taking the conversation down paths less traveled. That is certainly the case for AB Periasamy, CEO and Co-Founder of MinIO, an open source provider of high performan
byScreaming in the Cloud
0 ratings
0% found this document useful
Build Real Time Applications With Operational Simplicity Using Dozer: Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.
Podcast episode
Build Real Time Applications With Operational Simplicity Using Dozer: Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.
byData Engineering Podcast
0 ratings
0% found this document useful
Small Data, Big Impact: The Story Behind DuckDB // Hannes Mühleisen & Jordan Tigani // #202
Podcast episode
Small Data, Big Impact: The Story Behind DuckDB // Hannes Mühleisen & Jordan Tigani // #202
byMLOps.community
0 ratings
0% found this document useful
381 Programming Framework: Which Ones To Learn? - Simple Programmer Podcast: If you're a software developer I doubt you'll ever be able to learn everything that software developer has to offer. Every day new programming languages come out, technology changes and the process is updated. All this amount of information makes it...
Podcast episode
381 Programming Framework: Which Ones To Learn? - Simple Programmer Podcast: If you're a software developer I doubt you'll ever be able to learn everything that software developer has to offer. Every day new programming languages come out, technology changes and the process is updated. All this amount of information makes it...
bySimple Programmer Podcast
0 ratings
0% found this document useful
Open Source and Fast Decision Making // Rob Hirschfeld // MLOps Podcast #164
Podcast episode
Open Source and Fast Decision Making // Rob Hirschfeld // MLOps Podcast #164
byMLOps.community
0 ratings
0% found this document useful
Can You Build a Security Program on Open Source?: All links and images for this episode can be found on . What would it take to build your entire security program on open source software, tools, and intelligence? Check out this for the discussion that is the basis of our conversation on this...
Podcast episode
Can You Build a Security Program on Open Source?: All links and images for this episode can be found on . What would it take to build your entire security program on open source software, tools, and intelligence? Check out this for the discussion that is the basis of our conversation on this...
byDefense in Depth
0 ratings
0% found this document useful
BEST-OF-BRAD: Using Top Tier Solutions to Build Hybrid Cloud Ecosystems with Brad Feakes: Today on What the Duck?!, we’re ducking around with Brad Feakes, an expert in operations, supply chain management, and information technology. Brad sits down with Host, Sarah Scudder, to discuss the use of best-of-breed solutions to build hybrid Cloud ecosystems that support ERP customer needs. Brad shares his personal and professional journey, including his education, career choices, and his experience working with ERP systems in manufacturing companies. They also touch upon Brad's role as a business analyst and his involvement in implementing Epicor as company-wide ERP system and his current role at EstesGroup.
Podcast episode
BEST-OF-BRAD: Using Top Tier Solutions to Build Hybrid Cloud Ecosystems with Brad Feakes: Today on What the Duck?!, we’re ducking around with Brad Feakes, an expert in operations, supply chain management, and information technology. Brad sits down with Host, Sarah Scudder, to discuss the use of best-of-breed solutions to build hybrid Cloud ecosystems that support ERP customer needs. Brad shares his personal and professional journey, including his education, career choices, and his experience working with ERP systems in manufacturing companies. They also touch upon Brad's role as a business analyst and his involvement in implementing Epicor as company-wide ERP system and his current role at EstesGroup.
byWhat the Duck - Another Supply Chain Podcast
0 ratings
0% found this document useful
#93 - Maximum Value Maximum Speed Software - Dave Thomas
Podcast episode
#93 - Maximum Value Maximum Speed Software - Dave Thomas
byTech Lead Journal
0 ratings
0% found this document useful
Machine in Production = Data Engineering + ML + Software Engineering // Satish Chandra Gupta // MLOps Coffee Sessions #16
Podcast episode
Machine in Production = Data Engineering + ML + Software Engineering // Satish Chandra Gupta // MLOps Coffee Sessions #16
byMLOps.community
0 ratings
0% found this document useful
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
Podcast episode
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
byData Engineering Podcast
0 ratings
0% found this document useful
Realtime Data Applications Made Easier With Meroxa: Real-time capabilities have quickly become an expectation for consumers. The complexity of providing those capabilities is still high, however, making it more difficult for small teams to compete. Meroxa was created to enable teams of all sizes to deliver real-time data applications. In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows.
Podcast episode
Realtime Data Applications Made Easier With Meroxa: Real-time capabilities have quickly become an expectation for consumers. The complexity of providing those capabilities is still high, however, making it more difficult for small teams to compete. Meroxa was created to enable teams of all sizes to deliver real-time data applications. In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows.
byData Engineering Podcast
0 ratings
0% found this document useful
DevOps and Incident Response Evolution
Podcast episode
DevOps and Incident Response Evolution
byThe Cloudcast
0 ratings
0% found this document useful
312: Why Package Managers: The UNIX Philosophy in 2019, why use package managers, touchpad interrupted, Porting wine to amd64 on NetBSD second evaluation report, Enhancing Syzkaller Support for NetBSD, all about the Pinebook Pro, killing a process and all of its descendants, fast software the best software, and more.
Podcast episode
312: Why Package Managers: The UNIX Philosophy in 2019, why use package managers, touchpad interrupted, Porting wine to amd64 on NetBSD second evaluation report, Enhancing Syzkaller Support for NetBSD, all about the Pinebook Pro, killing a process and all of its descendants, fast software the best software, and more.
byBSD Now
0 ratings
0% found this document useful
MLOps at DoorDash // Hien Luu and DoorDash Leads // Coffee Sessions #119
Podcast episode
MLOps at DoorDash // Hien Luu and DoorDash Leads // Coffee Sessions #119
byMLOps.community
0 ratings
0% found this document useful
Product Engineering for LLMs // LLMs in Production Conference Part III // Panel 2
Podcast episode
Product Engineering for LLMs // LLMs in Production Conference Part III // Panel 2
byMLOps.community
0 ratings
0% found this document useful
#134 - A Developer-Centric Approach to Measuring and Improving Productivity - Margaret-Anne Storey & Abi Noda
Podcast episode
#134 - A Developer-Centric Approach to Measuring and Improving Productivity - Margaret-Anne Storey & Abi Noda
byTech Lead Journal
0 ratings
0% found this document useful
ProductizeML: Assisting Your Team to Better Build ML Products // Adrià Romero // MLOps Meetup #47
Podcast episode
ProductizeML: Assisting Your Team to Better Build ML Products // Adrià Romero // MLOps Meetup #47
byMLOps.community
0 ratings
0% found this document useful
Impact of LLMs on the Tech Stack and Product Development // Anand Das // #188
Podcast episode
Impact of LLMs on the Tech Stack and Product Development // Anand Das // #188
byMLOps.community
0 ratings
0% found this document useful
Ep. 33 - Code dependencies are the devil: Have you built your app on someone else's code? And beyond that, does the "secret sauce" of your product depend on external libraries or frameworks? While it's tempting to use the latest and greatest tech as soon as it comes out, that's not always a...
Podcast episode
Ep. 33 - Code dependencies are the devil: Have you built your app on someone else's code? And beyond that, does the "secret sauce" of your product depend on external libraries or frameworks? While it's tempting to use the latest and greatest tech as soon as it comes out, that's not always a...
byfreeCodeCamp Podcast
0 ratings
0% found this document useful
House Hunters: Machine Learning at Redfin with Akshat Kaul - #530
Podcast episode
House Hunters: Machine Learning at Redfin with Akshat Kaul - #530
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Developer Security
Podcast episode
Developer Security
byThe Cloudcast
0 ratings
0% found this document useful
Forecasting Software Panel
Podcast episode
Forecasting Software Panel
byForecasting Impact
0 ratings
0% found this document useful
2022 Look Ahead, Developer Careers
Podcast episode
2022 Look Ahead, Developer Careers
byThe Cloudcast
0 ratings
0% found this document useful
AI Hardware, Explained: In 2011, Marc Andreessen said, “software is eating the world.” And in the last year, we’ve seen a new wave of generative AI, with some apps becoming some of the most swiftly adopted software products of all time. So if software is becoming more important than ever, hardware is following suit. In this episode – the first in our three-part series – we explore the terminology and technology that is now the backbone of the AI models taking the world by storm. We’ll explore what GPUs are, how they work, the key players like Nvidia competing for chip dominance, and also… whether Moore’s Law is dead? Look out for the rest of our series, where we dive even deeper; covering supply and demand mechanics, including why we can’t just “print” our way out of a shortage, how founders get access to inventory, whether they should own or rent, where open source plays a role, and of course… how much all of this truly costs!
Podcast episode
AI Hardware, Explained: In 2011, Marc Andreessen said, “software is eating the world.” And in the last year, we’ve seen a new wave of generative AI, with some apps becoming some of the most swiftly adopted software products of all time. So if software is becoming more important than ever, hardware is following suit. In this episode – the first in our three-part series – we explore the terminology and technology that is now the backbone of the AI models taking the world by storm. We’ll explore what GPUs are, how they work, the key players like Nvidia competing for chip dominance, and also… whether Moore’s Law is dead? Look out for the rest of our series, where we dive even deeper; covering supply and demand mechanics, including why we can’t just “print” our way out of a shortage, how founders get access to inventory, whether they should own or rent, where open source plays a role, and of course… how much all of this truly costs!
bya16z Podcast
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Podcast episode
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
Ep. 37 - The Rise of the Data Engineer: When Maxime worked at Facebook, his role started evolving. He was developing new skills, new ways of doing things, and new tools. And — more often than not — he was turning his back on traditional methods. He was a pioneer. He was a...
Podcast episode
Ep. 37 - The Rise of the Data Engineer: When Maxime worked at Facebook, his role started evolving. He was developing new skills, new ways of doing things, and new tools. And — more often than not — he was turning his back on traditional methods. He was a pioneer. He was a...
byfreeCodeCamp Podcast
0 ratings
0% found this document useful
Scaling Similarity Learning at Digits // Hannes Hapke // Coffee Sessions #122
Podcast episode
Scaling Similarity Learning at Digits // Hannes Hapke // Coffee Sessions #122
byMLOps.community
0 ratings
0% found this document useful
Episode 5: The Last Mainframe with a Kickstart and a Double Clutch: How are companies evolving in a world where Cloud is on the rise? Where Cloud providers are bought out and absorbed into other companies? Today, we’re talking to Nell Shamrell-Harrington about Cloud infrastructure. She is a senior software engineer at Che
Podcast episode
Episode 5: The Last Mainframe with a Kickstart and a Double Clutch: How are companies evolving in a world where Cloud is on the rise? Where Cloud providers are bought out and absorbed into other companies? Today, we’re talking to Nell Shamrell-Harrington about Cloud infrastructure. She is a senior software engineer at Che
byScreaming in the Cloud
0 ratings
0% found this document useful
The Role of Infrastructure in ML // Niels Bantilan // #197
Podcast episode
The Role of Infrastructure in ML // Niels Bantilan // #197
byMLOps.community
0 ratings
0% found this document useful

Skip carousel

Newsdesk
Linux Format
Article
Newsdesk
Dec 13, 2022
OPEN SOURCE FUNDING GitHub, Fastly and Mozilla are all looking for new projects to back, giving a boost to open source development. Small open source projects might be created solely by enthusiasts but most make use of outside developers, often paid
9 min read
Contributing For Non - Coders
Linux Format
Article
Contributing For Non - Coders
Jan 10, 2023
9 min read
Cloudy With No Chance Of Erp
Architectural Review Asia Pacific
Article
Cloudy With No Chance Of Erp
Nov 11, 2019
ERP (enterprise resource planning) was born around the time the first ‘[Something] for Dummies’ book was published*. It’s typically inflexible, uncompromising software designed for large businesses, like banks, large corporations, manufacturing and s
2 min read
22 Awesome Open-source Programs That Do Everything You Need
PCWorld
Article
22 Awesome Open-source Programs That Do Everything You Need
Oct 30, 2023
6 min read
Digital Trust Is On The Horizon
The European Business Review
Article
Digital Trust Is On The Horizon
Mar 1, 2022
11 min read
Enterprise Soaring Success
Linux Format
Article
Enterprise Soaring Success
Aug 27, 2019
7 min read
Software Whiteboards
Linux Format
Article
Software Whiteboards
Jul 26, 2022
1 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
An Expert Speaks Up on What You Should Know About Programming Languages
Entrepreneur
Article
An Expert Speaks Up on What You Should Know About Programming Languages
Oct 1, 2015
1 min read
Remote Audio Data Is Here
NPR
Article
Remote Audio Data Is Here
Dec 11, 2018
3 min read
Licence To Influence
Linux Format
Article
Licence To Influence
Aug 23, 2022
The app store mentality has filtered beyond consumer software and now extends into the enterprise space. We’re migrating our applications to the cloud to take advantage of these services. Alongside this, we see more software features offered only via
1 min read
In Conversation with Surbhi Rathore
Techfastly
Article
In Conversation with Surbhi Rathore
Oct 1, 2021
4 min read
Three Low-code Options
PC Pro Magazine
Article
Three Low-code Options
Nov 12, 2020
Counting Intel, Vodafone and VW among its customers, OutSystems helps businesses create cloudbased, on-premises and hybrid applications for mobile and web. Its development environment is predominantly drag-and-drop, with views for processes, data and
3 min read
COMPETITIVE ADVANTAGE THROUGH SOFTWARE: Contrasting Enterprises & Startups
The European Business Review
Article
COMPETITIVE ADVANTAGE THROUGH SOFTWARE: Contrasting Enterprises & Startups
Feb 4, 2019
6 min read
The Security Dilemma Of Iot Devices And Potential Consequences
HWM Singapore
Article
The Security Dilemma Of Iot Devices And Potential Consequences
Jan 10, 2021
3 min read
Generativity: Driving The Promise Of Generative AI
The European Business Review
Article
Generativity: Driving The Promise Of Generative AI
May 31, 2023
7 min read
The Big Tech Boost
Business Today
Article
The Big Tech Boost
Jan 5, 2024
5 min read
Newsdesk
Linux Format
Article
Newsdesk
Mar 5, 2024
11 min read
Salesforce Buys Slack in a $27.4B Deal
Techfastly
Article
Salesforce Buys Slack in a $27.4B Deal
Feb 4, 2021
4 min read
Mining Actionable Information with Smart Capture
The European Business Review
Article
Mining Actionable Information with Smart Capture
May 22, 2018
4 min read
Picture In A Mainframe
Linux Format
Article
Picture In A Mainframe
Jul 2, 2019
11 min read
Supply Chain Attacks
TechLife
Article
Supply Chain Attacks
Aug 23, 2021
4 min read
3 Salesforce Buys Slack In A $27.4B Deal.
Techfastly
Article
3 Salesforce Buys Slack In A $27.4B Deal.
Jan 6, 2021
5 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
Web App Security
Linux Format
Article
Web App Security
Jun 29, 2021
8 min read
Voice-Activated Technology Must Advance to Support Hybrid Workplaces
Techfastly
Article
Voice-Activated Technology Must Advance to Support Hybrid Workplaces
Jun 1, 2022
5 min read
In Conversation with Rajesh Dhuddu Global Head, Blockchain & Metaverse Practice, Tech Mahindra
Techfastly
Article
In Conversation with Rajesh Dhuddu Global Head, Blockchain & Metaverse Practice, Tech Mahindra
Nov 1, 2022
6 min read
Leadership Forum: Investing in Disruption
Rotman Management
Article
Leadership Forum: Investing in Disruption
Jan 1, 2019
10 min read
Thriving As An Ecosystem Partner
The European Business Review
Article
Thriving As An Ecosystem Partner
Sep 30, 2022
Researching ecosystems that span industries from e-commerce and publishing to semiconductors and healthcare over the past decade, we found companies that have been successful for years by contributing to an ecosystem. Sometimes, by contributing as pa
10 min read
The Razor’s Edge
Linux Format
Article
The Razor’s Edge
Mar 10, 2020
10 min read

Related categories

Skip carousel

Reviews for Pro Apache Hadoop

Rating: 5 out of 5 stars

5/5

1 rating0 reviews

Book preview

Pro Apache Hadoop - Jason Venner

Sameer Wadkar and Madhu SiddalingaiahPro Apache Hadoop10.1007/978-1-4302-4864-4_1

1. Motivation for Big Data

Sameer Wadkar¹ and Madhu Siddalingaiah¹

(1)

MD, US

The computing revolution that began more than 2 decades ago has led to large amounts of digital data being amassed by corporations. Advances in digital sensors; proliferation of communication systems, especially mobile platforms and devices; massive scale logging of system events; and rapid movement toward paperless organizations have led to a massive collection of data resources within organizations. And the increasing dependence of businesses on technology ensures that the data will continue to grow at an even faster rate.

Moore’s Law, which says that the performance of computers has historically doubled approximately every 2 years, initially helped computing resources to keep pace with data growth. However, this pace of improvement in computing resources started tapering off around 2005.

The computing industry started looking at other options, namely parallel processing to provide a more economical solution. If one computer could not get faster, the goal was to use many computing resources to tackle the same problem in parallel. Hadoop is an implementation of the idea of multiple computers in the network applying MapReduce (a variation of the single instruction, multiple data [SIMD] class of computing technique) to scale data processing.

The evolution of cloud-based computing through vendors such as Amazon, Google, and Microsoft provided a boost to this concept because we can now rent computing resources for a fraction of the cost it takes to buy them.

This book is designed to be a practical guide to developing and running software using Hadoop, a project hosted by the Apache Software Foundation and now extended and supported by various vendors such as Cloudera, MapR, and Hortonworks. This chapter will discuss the motivation for Big Data in general and Hadoop in particular.

What Is Big Data?

In the context of this book, one useful definition of Big Data is any dataset that cannot be processed or (in some cases) stored using the resources of a single machine to meet the required service level agreements (SLAs). The latter part of this definition is crucial. It is possible to process virtually any scale of data on a single machine. Even data that cannot be stored on a single machine can be brought into one machine by reading it from a shared storage such as a network attached storage (NAS) medium. However, the amount of time it would take to process this data would be prohibitively large with respect to the available time to process this data.

Consider a simple example. If the average size of the job processed by a business unit is 200 GB, assume that we can read about 50 MB per second. Given the assumption of 50 MB per second, we will need 2 seconds to read 100 MB of data from the disk sequentially, and it would take us approximately 1 hour to read the entire 200 GB of data. Now imagine that this data was required to be processed in under 5 minutes. If the 200 GB required per job could be evenly distributed across 100 nodes, and each node could process its own data (consider a simplified use-case such as simply selecting a subset of data based on a simple criterion: SALES_YEAR>2001), discounting the time taken to perform the CPU processing and assembling the results from 100 nodes, the total processing can be completed in under 1 minute.

This simplistic example shows that Big Data is context-sensitive and that the context is provided by business need.

Note

Dr. Jeff Dean Keynote discusses parallelism in a paper you can find at www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf . To read 1 MB of data sequentially from a local disk requires 20 million nanoseconds. Reading the same data from a 1 Gbps network requires about 250 million nanoseconds (assuming that 2 KB needs 250,000 nanoseconds and 500,000 nanoseconds per round-trip for each 2 KB). Although the link is a bit dated, and the numbers have changed since then, we will use these numbers in the chapter for illustration. The proportions of the numbers with respect to each other, however, have not changed much.

Key Idea Behind Big Data Techniques

Although we have made many assumptions in the preceding example, the key takeaway is that we can process data very fast, yet there are significant limitations on how fast we can read the data from persistent storage. Compared with reading/writing node local persistent storage, it is even slower to send data across the network.

Some of the common characteristics of all Big Data methods are the following:

Data is distributed across several nodes (Network I/O speed << Local Disk I/O Speed).

Applications are distributed to data (nodes in the cluster) instead of the other way around.

As much as possible, data is processed local to the node (Network I/O speed << Local Disk I/O Speed).

Random disk I/O is replaced by sequential disk I/O (Transfer Rate << Disk Seek Time).

The purpose of all Big Data paradigms is to parallelize input/output (I/O) to achieve performance improvements.

Data Is Distributed Across Several Nodes

By definition, Big Data is data that cannot be processed using the resources of a single machine. One of the selling points of Big Data is the use of commodity machines. A typical commodity machine would have a 2–4 TB disk. Because Big Data refers to datasets much larger than that, the data would be distributed across several nodes.

Note that it is not really necessary to have tens of terabytes of data for processing to distribute data across several nodes. You will see that Big Data systems typically process data in place on the node. Because a large number of nodes are participating in data processing, it is essential to distribute data across these nodes. Thus, even a 500 GB dataset would be distributed across multiple nodes, even if a single machine in the cluster would be capable of storing the data. The purpose of this data distribution is twofold:

Each data block is replicated across more than one node (the default Hadoop replication factor is 3). This makes the system resilient to failure. If one node fails, other nodes have a copy of the data hosted on the failed node.

For parallel processing reasons, several nodes participate in the data processing. Thus, 50 GB of data shared within 10 nodes enables all 10 nodes to process their own subdataset, achieving 5–10 times improvement in performance. The reader may well ask why all the data is not on the network file system (NFS), in which each node can read its portion. The answer is that reading from a local disk is significantly faster than reading from the network. Big Data systems make the local computation possible because the application libraries are copied to each data node before a job (an application instance) is started. We discuss this in the next section.

Applications Are Moved to the Data

For those of us who rode the J2EE wave, the three-tier architecture was drilled into us. In the three-tier programming model, the data is processed in the centralized application tier after being brought into it over the network. We are used to the notion of data being distributed but the application being centralized.

Big Data cannot handle this network overhead. Moving terabytes of data to the application tier will saturate the networks and introduce considerable inefficiencies, possibly leading to system failure. In the Big Data world, the data is distributed across nodes, but the application moves to the data. It is important to note that this process is not easy. Not only does the application need to be moved to the data but all the dependent libraries also need to be moved to the processing nodes. If your cluster has hundreds of nodes, it is easy to see why this can be a maintenance/deployment nightmare. Hence Big Data systems are designed to allow you to deploy the code centrally, and the underlying Big Data system moves the application to the processing nodes prior to job execution.

Data Is Processed Local to a Node

This attribute of data being processed local to a node is a natural consequence of the earlier two attributes of Big Data systems. All Big Data programming models are distributed- and parallel-processing based. Network I/O is orders of magnitude slower than disk I/O. Because data has been distributed to various nodes, and application libraries have been moved to the nodes, the goal is to process the data in place.

Although processing data local to the node is preferred by a typical Big Data system, it is not always possible. Big Data systems will schedule tasks on nodes as close to the data as possible. You will see in the sections to follow that for certain types of systems, certain tasks require fetching data across nodes. At the very least, the results from every node have to be assimilated on a node (the famous reduce phase of MapReduce or something similar for massively parallel programming models). However, the final assimilation phases for a large number of use-cases have very little data compared with the raw data processed in the node-local tasks. Hence the effect of this network overhead is usually (but not always) negligible.

Sequential Reads Preferred Over Random Reads

First, you need to understand how data is read from the disk. The disk head needs to be positioned where the data is located on the disk. This process, which takes time, is known as the seek operation. Once the disk head is positioned as needed, the data is read off the disk sequentially. This is called the transfer operation. Seek time is approximately 10 milliseconds; transfer speeds are on the order of 20 milliseconds (per 1 MB). This means that if we were reading 100 MB from separate 1 MB sections of the disk, it would cost us 10 (seek time) * 100 (seeks) = 1 second, plus 20 (transfer rate per 1MB) * 100 = 2 seconds. This is a total of 3 seconds to read 100 MB. However, if we were reading 100 MB sequentially from the disk, it would cost us 10 (seek time) * 1 (seek) = 10 milliseconds + 20*100=2 seconds, for a total of 2.01 seconds.

Note that we have used the numbers based on the Dr. Jeff Dean’s address, which is from 2009. Admittedly, the numbers have changed; in fact, they have improved since then. However, relative proportions between numbers have not changed, so we will use it for consistency.

Most throughput–oriented Big Data programming models exploit this feature. Data is swept sequentially off the disk and filtered in the main memory. Contrast this with a typical relational database management system (RDBMS) model that is much more random–read-oriented.

An Example

Suppose that you want to get the total sales numbers for the year 2000 ordered by state, and the sales data is distributed randomly across multiple nodes. The Big Data technique to achieve this can be summarized in the following steps:

Each node reads in the entire sales data and filters out sales data that is not for the year 2000. Data is distributed randomly across all nodes and read in sequentially on the disk. The filtering happens in main memory, not on the disk, to avoid the cost of seek times.

Each node process proceeds to create groups for each state as they are discovered and adds the sales numbers for a given state bucket. (The application is present on all nodes, and data is processed local to a node.)

When all the nodes have completed the process of sweeping the sales data from the disk and computing the total sales by state numbers, they send their respective number to a designated node (we call this node the assembler node), which has been agreed upon by all nodes at the beginning of the process.

The designated assembler node assembles all the total sales by state number from each node and adds up the values received from each node per state.

The assembler node sorts the final numbers by state and delivers the results.

This process demonstrates typical features of a Big Data system: focusing on maximizing throughput (how much work gets done per unit time) over latency (how fast a request is responded to, one of the critical aspects based on which transactional systems are judged because we want the fastest possible response).

Big Data Programming Models

The major types of Big Data programming models you will encounter are the following:

Massively parallel processing (MPP) database system: EMC’s Greenplum and IBM’s Netezza are examples of such systems.

In-memory database systems: Examples include Oracle Exalytics and SAP HANA.

MapReduce systems: These systems include Hadoop, which is the most general-purpose of all the Big Data systems.

Bulk synchronous parallel (BSP) systems: Examples include Apache HAMA and Apache Giraph.

Massively Parallel Processing (MPP) Database Systems

At its core, MPP systems employ some form of splitting data based on values contained in a column or a set of columns. For example, in the earlier example in which sales for the year 2000 ordered by state were computed, we could have partitioned the data by state, so certain nodes would contain data for certain states. This method of partitioning would enable each node to compute the total sales for the year 2000.

The limitation of such a system should be obvious. You need to decide how the data will be split at design time. The splitting criteria chosen will often be driven by the underlying use-case. As such, it is not suitable for ad hoc querying. Certain queries will execute at a blazing fast speed because they can take advantage of how the data is split between nodes. Others will operate at a crawl speed because the data is not distributed in a manner consistent with how it is accessed to execute the query resulting in data needed to be transferred to the nodes over the network.

To handle this limitation, it is common for such systems to store the data multiple times, split by different criteria. Depending on the query, the appropriate dataset is picked.

Following is the way in which the MPP programming model meets the attributes defined earlier for Big Data systems (consider the sales ordered by the state example):

Data is split by state on separate nodes.

Each node contains all the necessary application libraries to work on its own subset of the data.

Each node reads data local to itself. An exception is when you apply a query that does not respect how the data is distributed; in this case, each task needs to fetch its own data from other nodes over the network.

Data is read sequentially for each task. All the sales data is co-located and swept off the disk. The filter (year = 2000) is applied in memory.

In-Memory Database Systems

From an operational perspective, in-memory database systems are identical to MPP systems. The implementation difference is that each node has a significant amount of memory, and most data is preloaded into memory. SAP HANA operates on this principle. Other systems, such as Oracle Exalytics, use specialized hardware to ensure that multiple hosts are housed in a single appliance. At the core, an in-memory database is like an in-memory MPP database with a SQL interface.

One of the major disadvantages of the commercial implementations of in-memory databases is that there is a considerable hardware and software lock-in. Also, given that the systems use proprietary and very specialized hardware, they are usually expensive. Trying to use commodity hardware for in-memory databases increases the size of the cluster very quickly. Consider, for example, a commodity server that has 25 GB of RAM. Trying to host 1 TB in-memory databases will need more than 40 hosts (accounting for other activities that need to be performed on the server). 1 TB is not even that big, and we are already up to a 40-node cluster.

The following describes how the in-memory database programming model meets the attributes we defined earlier for the Big Data systems:

Data is split by state in the earlier example. Each node loads data into memory.

Each node contains all the necessary application libraries to work on its own subset.

Each node reads data local to its nodes. The exception is when you apply a query that does not respect how the data is distributed; in this case, each task needs to fetch its own data from other nodes.

Because data is cached in memory, the Sequential Data Read attribute does not apply except when the data is read into memory the first time.

MapReduce Systems

MapReduce is the paradigm on which this book is based. It is by far the most general-purpose of four methods. Some of the important characteristics of Hadoop’s implementation of MapReduce are the following:

It uses commodity scale hardware. Note that commodity scale does not imply laptops or desktops. The nodes are still enterprise scale, but they use commonly available components.

Data does not need to be partitioned among nodes based on any predefined criteria.

The user needs to define only two separate processes: map and reduce.

We will discuss MapReduce extensively in this book. At a very high level, a MapReduce system needs the user to define a map process and a reduce process. When Hadoop is being used to implement MapReduce, the data is typically distributed in 64 MB–128 MB blocks, and each block is replicated twice (a replication factor of 3 is the default in Hadoop). In the example of computing sales for the year 2000 and ordered by state, the entire sales data would be loaded into the Hadoop Distributed File System (HDFS) as blocks (64 MB–128 MB in size). When the MapReduce process is launched, the system would first transfer all the application libraries (comprising the user-defined map and reduce processes) to each node.

Each node will schedule a map task that sweeps the blocks comprising the sales data file. Each Mapper (on the respective node) will read records of the block and filter out the records for the year 2000. Each Mapper will then output a record comprised of a key/value pair. Key will be the state and value will be the sales number from the given record if the sales record is for the year 2000.

Finally, a configurable number of Reducers will receive the key/value pairs from each of the Mappers. Keys will be assigned to specific Reducers to ensure that a given key is received by one and only one Reducer. Each Reducer will then add up the sales value number for all the key/value pairs received. The data format received by the Reducer is key (state), and a list of values for that key (sales records for the year 2000). The output is written back to the HDFS. The client will then sort the result by states after reading it from the HDFS. The last step can be delegated to the Reducer because the Reducer receives its assigned keys in the sorted order. In this example, we need to restrict the number of Reducers to one to achieve this, however. Because communication between Mappers and Reducers causes network I/O, it can lead to bottlenecks. We will discuss this issue in detail later in the book.

This is how the MapReduce programming model meets the attributes defined earlier for the Big Data systems:

Data is split into large blocks on HDFS. Because HDFS is a distributed file system the data blocks are distributed across all the nodes redundantly.

The application libraries, including the map and reduce application code, are propagated to all the task nodes.

Each node reads data local to its nodes. Mappers are launched on all the nodes and read the data blocks local to themselves (in most cases, the mapping between tasks and disk blocks is up to the scheduler, which may allocate remote blocks to map tasks to keep all nodes busy).

Data is read sequentially for each task on large block at a time (blocks are typically of size 64 MB–128 MB)

One of the important limitations of the MapReduce paradigm is that it is not suitable for iterative algorithms. A vast majority of data science algorithms are iterative by nature and eventually converge to a solution. When applied to such algorithms, the MapReduce paradigm requires each iteration to be run as a separate MapReduce job, and each iteration often uses the data produced by its previous iteration. But because each MapReduce job reads fresh from the persistent storage, the iteration needs to store its results in persistent storage for the next iteration to work on. This process leads to unnecessary I/O and significantly impacts the overall throughput. This limitation is addressed by the BSP class of systems, described next.

Bulk Synchronous Parallel (BSP) Systems

The BSP class of systems operates very similarly to the MapReduce approach. However, instead of the MapReduce job terminating at the end of its processing cycle, the BSP system is composed of a list of processes (identical to the map processes) that synchronize on a barrier, send data to the Master node, and exchange relevant information. Once the iteration is completed, the Master node will indicate to each processing node to resume the next iteration.

Synchronizing on a barrier is a commonly used concept in parallel programming. It is used when many threads are responsible for performing their own tasks, but need to agree on a checkpoint before proceeding. This pattern is needed when all threads need to have completed a task up to a certain point before the decision is made to proceed or abort with respect to the rest of the computation (in parallel or in sequence). Synchronization barriers are used all the time in the real world processes. Example, carpool mates often meet at a designated place before proceeding in a single car. The overall process is only as fast as the last person (or thread) arriving at the barrier.

The BSP method of execution allows each map-like process to cache its previous iteration’s data significantly improving the throughput of the overall process. We will discuss BSP systems in the Data Science chapter of this book. They are relevant to iterative algorithms.

Big Data and Transactional Systems

It is important to understand how the concept of transactions has evolved in the context of Big Data. This discussion is relevant to NoSQL databases. Hadoop has HBase as its NoSQL data store. Alternatively, you can use Cassandra or NoSQL systems available in the cloud such as Amazon Dynamo.

Although most RDBMS users expect ACID properties in databases, these properties come at a cost. When the underlying database needs to handle millions of transactions per second at peak time, it is extremely challenging to respect ACID features in their purest form.

Note

ACID is an acronym for atomicity, consistency, isolation, and durability. A detailed discussion can be found at the following link: http://en.wikipedia.org/wiki/ACID .

Some compromises are necessary, and the motivation behind these compromises is encapsulated in what is known as the CAP theorem (also known as Brewer’s theorem). CAP is an acronym for the following:

Consistency: All nodes see the same copy of the data at all times.

Availability: A guarantee that every request receives response about success and failure within a reasonable and well-defined time interval.

Partition tolerance: The system continues to perform despite failure of its parts.

The theorem goes on to prove that in any system only two of the preceding features are achievable, not all three. Now, let’s examine various types of systems:

Consistent and available: A single RDBMS with ACID properties is an example of a system that is consistent and available. It is not partition-tolerant; if the RDBMS goes down, users cannot access the data.

Consistent and partition-tolerant: A clustered RDBMS is such as system. Distributed transactions ensure that all users will always see the same data (consistency), and the distributed nature of the data will ensure that the system remains available despite loss of nodes. However, by virtue of distributed transactions, the system will be unavailable for durations of time when two-phase commits are being issued. This limits the number of simultaneous transactions that can be supported by the system, which in turn limits the availability of the system.

Available and partition-tolerant:The type of systems classified as eventually consistent fall into this category. Consider a very popular e-commerce web site such as Amazon.com. Imagine that you are browsing through the product catalogs and notice that two units of a certain item are available for sale. By nature of the buying process, you are aware that between you noticing that a certain number of items are available and issuing the buy request, someone could come in first and buy the items. So there is little incentive for always showing the most updated value because inventory changes. Inventory changes will be propagated to all the nodes serving the users. Preventing the users from browsing inventory while this propagation is taking place in order to provide the most current value of the inventory will limit the availability of the web site, resulting in lost sales. Thus, we have sacrificed consistency for availability, and partition tolerance allows multiple nodes to display the same data (although there may be a small window of time in which each user sees different data, depending on the nodes they are served by).

These decisions are very critical when developing Big Data systems. MapReduce, which is the main topic of the book, is only one of the components of the Big Data ecosystem. Often it exists in the context of other products such as HBase, in which making the trade-offs discussed in this section are critical to developing a good solution.

How Much Can We Scale?

We made several assumptions in our examples earlier in the chapter. For example, we ignored CPU time. For a large number of business problems, computational complexity does not dominate. However, with the growth in computing capability, various domains became practical from an implementation point of view. One example is data mining using complex Bayesian statistical techniques. These problems are indeed computationally expensive. For such problems, we need to increase the number of nodes to perform processing or apply alternative methods.

Note

The paradigms used in Big Data computing such as MapReduce have also been extended to other parallel computing methods. For example, general-purpose computation on graphics programming units (GPGPU) computing achieves massive parallelism for compute-intensive problems.

We also ignored network I/O costs. Using 50 compute nodes to process data also requires the use of a distributed file system and communication costs for assembling data from 50 nodes in the cluster. In all Big Data solutions, I/O costs will dominate. These costs introduce serial dependencies in the computational process.

A Compute-Intensive Example

Consider processing 200 GB of data with 50 nodes, in which each node processes 4 GB of data located on a local disk. Each node takes 80 seconds to read the data (at the rate of 50 MB per second). No matter how fast we compute, we cannot finish in under 80 seconds. Assume that the result of the process is a total dataset of size 200 MB, and each node generates 4 MB of this result. which is transferred over a 1 Gbps (1 MB per packet) network to a single node for display. It will take about 3 milliseconds (each 1 MB requires 250 microseconds to transfer over the network, and the network latency per packet is assumed to be 500 microseconds (based on the previously referenced talk by Dr. Jeff Dean) to transfer the data to the destination node. Ignoring computational costs, the total processing time cannot be under 40.003 seconds.

Now imagine that we have 4000 nodes, and magically each node reads its own 500 MB of data from a local disk and produces 0.1 MB of result set. Notice that we cannot go faster than 1 second if data is read in 50 MB blocks. This translates to maximum performance improvement by a factor of about 4000. In other words for a certain class of problems, if it takes 4000 hours to complete the processing, we cannot do better than 1 hour, no matter how many nodes are thrown at the problem. A factor of 4000 might sound like a lot, but there is an upper limit to how fast we can get. In this simplistic example, we have made many simplifying system assumptions. We also assumed that there are no serial dependencies in the application logic, which is usually a false assumption. Once we add those costs, the maximum performance gain possibly falls drastically.

Serial dependencies, which are the bane of all parallel computing algorithms, limit the degree of performance improvement. The limitation is well known and documented as the Amdhal’s Law.

Amdhal’s Law

Just as the speed of light defines the theoretical limit of how fast we can travel in our universe, Amdhal’s Law defines the limits of performance gain we can achieve by adding more nodes to clusters.

Note

See http://en.wikipedia.org/wiki/Amdahl's_law for a full discussion of Amdhal’s Law.

In a nutshell, the law states that if a given solution can be made perfectly parallelizable up to a proportion P (where P ranges from 0 to 1), the maximum performance improvement we can obtain given an infinite number of nodes (a fancy way of saying a lot of nodes in the cluster) is 1/(1-P). Thus, if we have even 1 percent of the execution that cannot be made, parallel the best improvement we can get is 100 fold. All programs have some serial dependencies, and disk I/O and network I/O will add more. There are limits to how many improvements we can achieve regardless of the methods we use.

Business Use-Cases for Big Data

Big Data and Hadoop have several applications in the business world. At the risk of sounding cliché, the three big attributes of Big Data are considered to be these:

Volume

Velocity

Variety

Volume relates the size of the data processed. If your organization needs to extract, load, and transform 2 TB of data in 2 hours each night, you have a volume problem.

Velocity relates to speed at which large data arrives. Organizations such as Facebook and Twitter encounter the velocity problem. They get massive amounts of tiny messages per second that need to be processed almost immediately, posted to the social media sites, propagated to related users (family, friends, and followers), events generated, and so on.

Variety is related to an increasing number of formats that need to be processed. Enterprise search systems have become commonplace in organizations. Open-source software such as Apache Solr has made search-based systems ubiquitous. Most unstructured data is not stand-alone; it has considerable structured data associated with it. For example, consider a simple document such as an e-mail. E-mail has considerable metadata associated with it. Examples include sender, receivers, order of receivers, time sent/received, organizational information about the senders/receivers (for example, a title at the time of sending), and so on.

Some of this information is even dynamic. For example, if you are analyzing years of e-mail (Area of Legal Practice has several use-cases around this), it is important to know what the title of senders or receivers were when the e-mail was first sent. This feature of dynamic master data is commonplace and leads to several interesting challenges.

Big Data helps solve everyday problems such as large-scale extract, transform, load (ETL) issues by using commodity software and hardware. In particular, open-source Hadoop, which runs on commodity servers and can scale by adding more nodes, enables ETL (or ELT, as it is commonly called in the Big Data domain) to be performed significantly faster at commodity costs.

Several open-source products have evolved around Hadoop and the HDFS to support velocity and variety use-cases. New data formats have evolved to manage the I/O performance around massive data processing. This book will discuss the motivations behind such developments and the appropriate use-cases for them.

Storm (which evolved at Twitter) and Apache Flume (designed for large–scale log analysis) evolved to handle the velocity factor. The choice of which software to use depends on how close to real time the processes need to be. Storm is useful for tackling problems that require more real-time processing than Flume.

The key message is this: Big Data is an ecosystem of various products that work in concert to solve very complex business problems. Hadoop is often at the center of such solutions. Understanding Hadoop enables you to develop a strong understanding of how to use the other entrants in the Big Data ecosystem.

Summary

Big Data has now become mainstream, and the two main drivers behind it are open-source Hadoop software and the advent of the cloud. Both of these developments allowed the mass-scale adoption of Big Data methods to handle business problems at low cost. Hadoop is the cornerstone of all Big Data solutions. Although other programming models, such as MPP and BSP, have sprung up to handle very specific problems, they all depend on Hadoop in some form or other when the scale of data to be processed reaches a multiterabyte scale. Developing a deep understanding of Hadoop enables users of other programming models to be more effective. The goal of this book is to you achieve that.

The chapters to come will guide you through the specifics of using the Hadoop software as well as offer practical methods for solving problems with Hadoop.

Sameer Wadkar and Madhu SiddalingaiahPro Apache Hadoop10.1007/978-1-4302-4864-4_2

2. Hadoop Concepts

Sameer Wadkar¹ and Madhu Siddalingaiah¹

(1)

MD, US

Applications frequently require more resources than are available on an inexpensive (commodity) machine. Many organizations find themselves with business processes that no longer fit on a single, cost-effective computer. A simple but expensive solution has been to buy specialty machines that cost a lot of memory and have many CPUs. This solution scales as far as what is supported by the fastest machines available, and usually the only limiting factor is your budget. An alternative solution is to build a high-availability cluster, which typically attempts to look like a single machine and usually requires very specialized installation and administration services. Many high-availability clusters are proprietary and expensive.

A more economical solution for acquiring the necessary computational resources is cloud computing. A common pattern is to have bulk data that needs to be transformed, in which the processing of each data item is essentially independent of other data items; that is, by using a single-instruction, multiple-data (SIMD) scheme. Hadoop provides an open-source framework for cloud computing, as well as a distributed file system.

This book is designed to be a practical guide to developing and running software using Hadoop, a project hosted by the Apache Software Foundation. This chapter introduces you to the core Hadoop concepts. It is meant to prepare you for the next chapter, in which you will get Hadoop installed and running.

Introducing Hadoop

Hadoop is based on the Google paper on MapReduce published in 2004, and its development started in 2005. At the time, Hadoop was developed to support the open-source web search engine project called Nutch. Eventually, Hadoop separated from Nutch and became its own project under the Apache Foundation.

Today Hadoop is the best–known MapReduce framework in the market. Currently, several companies have grown around Hadoop to provide support, consulting, and training services for the Hadoop software.

At its core, Hadoop is a Java–based MapReduce framework. However, due to the rapid adoption of the Hadoop platform, there was a need to support the non–Java user community. Hadoop evolved into having the following enhancements and subprojects to support this community and expand its reach into the Enterprise:

Hadoop Streaming: Enables using MapReduce with any command-line script. This makes MapReduce usable by UNIX script programmers, Python programmers, and so on for development of ad hoc jobs.

Hadoop Hive: Users of MapReduce quickly realized that developing MapReduce programs is a very programming-intensive task, which makes it error-prone and hard to test. There was a need for more expressive languages such as SQL to enable users to focus on the problem instead of low-level implementations of typical SQL artifacts (for example, the WHERE clause, GROUP BY clause, JOIN clause, etc.). Apache Hive evolved to provide a data warehouse (DW) capability to large datasets. Users can express their queries in Hive Query Language, which is very similar to SQL. The Hive engine converts these queries to low-level MapReduce jobs transparently. More advanced users can develop user-defined functions (UDFs) in Java. Hive also supports standard drivers such as ODBC and JDBC. Hive is also an appropriate platform to use when developing Business Intelligence (BI) types of applications for data stored in Hadoop.

Hadoop Pig: Although the motivation for Pig was similar to Hive, Hive is a SQL-like language, which is declarative. On the other hand, Pig is a procedural language that works well in data-pipeline scenarios. Pig will appeal to programmers who develop data-processing pipelines (for example, SAS programmers). It is also an appropriate platform to use for extract, load, and transform (ELT) types of applications.

Hadoop HBase: All the preceding projects, including MapReduce, are batch processes. However, there is a strong need for real–time data lookup in Hadoop. Hadoop did not have a native key/value store. For example, consider a Social Media site such as Facebook. If you want to look up a friend’s profile, you expect to get an answer immediately (not after a long batch job runs). Such use-cases were the motivation for developing the HBase platform.

We have only just scratched the surface of what Hadoop and its subprojects will allow us to do. However the previous examples should provide perspective on why Hadoop evolved the way it did. Hadoop started out as a MapReduce engine developed for the purpose of indexing massive amounts of text data. It slowly evolved into a general-purpose model to support standard Enterprise use-cases such as DW, BI, ELT, and real-time lookup cache. Although MapReduce is a very useful model, it was the adaptation to standard Enterprise use-cases of the type just described (ETL, DW) that enabled it to penetrate the mainstream computing market. Also important is that organizations are now grappling with processing massive amounts of data.

For a very long time, Hadoop remained a system in which users submitted jobs that ran on the entire cluster.Jobs would be executed in a First In, First Out (FIFO) mode. However, this lead to situations in which a long-running, less-important job would hog resources and not allow a smaller yet more important job to execute. To solve this problem, more complex job schedulers in Hadoop, such as the Fair Scheduler and Capacity Scheduler were created. But Hadoop 1.x (prior to version 0.23) still had scalability limitations that were a result of some deeply entrenched design decisions.

Yahoo engineers found that Hadoop had scalability problems when the number of nodes ( http://developer.yahoo.com/blogs/hadoop/scaling-hadoop-4000-nodes-yahoo-410.html ) increased to an order of a few thousand. As these problems became better understood, the Hadoop engineers went back to the drawing board and reassessed some of the core assumptions underlying the original Hadoop design; eventually this lead to a major design overhaul of the core Hadoop platform. Hadoop 2.x (from version 0.23 of Hadoop) is a result of this overhaul.

This book will cover version 2.x with appropriate references to 1.x, so you can appreciate the motivation for the changes in 2.x.

Introducing the MapReduce Model

Hadoop supports the MapReduce model, which was introduced by Google as a method of solving a class of petascale problems with large clusters of commodity machines. The model is based on two distinct steps, both of which are custom and user-defined for an application:

Map: An initial ingestion and transformation step in which individual input records can be processed in parallel

Reduce: An aggregation or summarization step in which all associated records must be processed together by a single entity

The core concept of MapReduce in Hadoop is that input can be split into logical chunks, and each chunk can be initially processed independently by a map task. The results of these individual processing chunks can be physically partitioned into distinct sets, which are then sorted. Each sorted chunk is passed to a reduce task. Figure 2-1 illustrates how the MapReduce model works.

A978-1-4302-4864-4_2_Fig1_HTML.jpg

Figure 2-1.

MapReduce model

A map task can run on any compute node in the cluster, and multiple map tasks can run in parallel across the cluster. The map task is responsible for transforming the input records into key/value pairs. The output

Enjoying the preview?

Page 1 of 1

Pro Apache Hadoop

About this ebook

Jason Venner

Related authors

Related to Pro Apache Hadoop

Related ebooks

Programming For You

Related podcast episodes

Related articles

Related categories

Reviews for Pro Apache Hadoop

What did you think?

Book preview

Pro Apache Hadoop - Jason Venner

1. Motivation for Big Data

What Is Big Data?

Key Idea Behind Big Data Techniques

Data Is Distributed Across Several Nodes

Applications Are Moved to the Data

Data Is Processed Local to a Node

Sequential Reads Preferred Over Random Reads

An Example

Big Data Programming Models

Massively Parallel Processing (MPP) Database Systems

In-Memory Database Systems

MapReduce Systems

Bulk Synchronous Parallel (BSP) Systems

Big Data and Transactional Systems

How Much Can We Scale?

A Compute-Intensive Example

Amdhal’s Law

Business Use-Cases for Big Data

Summary

2. Hadoop Concepts

Introducing Hadoop

Introducing the MapReduce Model