Ebook741 pages5 hours

Mastering Hadoop

Name: Mastering Hadoop
Author: Sandeep Karanth
ISBN: 9781783983650

By Sandeep Karanth

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Hadoop is synonymous with Big Data processing. Its simple programming model, "code once and deploy at any scale" paradigm, and an ever-growing ecosystem makes Hadoop an all-encompassing platform for programmers with different levels of expertise.

This book explores the industry guidelines to optimize MapReduce jobs and higher-level abstractions such as Pig and Hive in Hadoop 2.0. Then, it dives deep into Hadoop 2.0 specific features such as YARN and HDFS Federation.

This book is a step-by-step guide that focuses on advanced Hadoop concepts and aims to take your Hadoop knowledge and skill set to the next level. The data processing flow dictates the order of the concepts in each chapter, and each chapter is illustrated with code fragments or schematic diagrams.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateDec 29, 2014

ISBN9781783983650

Author

Sandeep Karanth

Related authors

Skip carousel

Related to Mastering Hadoop

Related ebooks

Skip carousel

Hadoop: Data Processing and Modelling
Ebook
Hadoop: Data Processing and Modelling
byGarry Turkington
Rating: 0 out of 5 stars
0 ratings
Hadoop Blueprints
Ebook
Hadoop Blueprints
byAnurag Shrivastava
Rating: 0 out of 5 stars
0 ratings
HDInsight Essentials - Second Edition
Ebook
HDInsight Essentials - Second Edition
byRajesh Nadipalli
Rating: 0 out of 5 stars
0 ratings
Machine Learning with Spark - Second Edition
Ebook
Machine Learning with Spark - Second Edition
byNick Pentreath
Rating: 0 out of 5 stars
0 ratings
Hadoop Beginner's Guide
Ebook
Hadoop Beginner's Guide
byGarry Turkington
Rating: 4 out of 5 stars
4/5
Learning HBase
Ebook
Learning HBase
byShashwat Shriparv
Rating: 0 out of 5 stars
0 ratings
Hadoop in Practice
Ebook
Hadoop in Practice
byAlex Holmes
Rating: 0 out of 5 stars
0 ratings
Learning Apache Spark 2
Ebook
Learning Apache Spark 2
byMuhammad Asif Abbasi
Rating: 0 out of 5 stars
0 ratings
Learning Hadoop 2
Ebook
Learning Hadoop 2
byGarry Turkington
Rating: 4 out of 5 stars
4/5
Cloudera Administration Handbook
Ebook
Cloudera Administration Handbook
byRohit Menon
Rating: 0 out of 5 stars
0 ratings
Monitoring Hadoop
Ebook
Monitoring Hadoop
byGurmukh Singh
Rating: 0 out of 5 stars
0 ratings
Apache Mahout Essentials
Ebook
Apache Mahout Essentials
byJayani Withanawasam
Rating: 0 out of 5 stars
0 ratings
Spark for Data Science
Ebook
Spark for Data Science
bySrinivas Duvvuri
Rating: 0 out of 5 stars
0 ratings
Learn Hadoop in 24 Hours
Ebook
Learn Hadoop in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Instant MapReduce Patterns – Hadoop Essentials How-to
Ebook
Instant MapReduce Patterns – Hadoop Essentials How-to
bySrinath Perera
Rating: 0 out of 5 stars
0 ratings
Apache Hive Essentials
Ebook
Apache Hive Essentials
byDayong Du
Rating: 0 out of 5 stars
0 ratings
Hadoop Essentials
Ebook
Hadoop Essentials
byShiva Achari
Rating: 5 out of 5 stars
5/5
Big Data Analytics
Ebook
Big Data Analytics
byVenkat Ankam
Rating: 0 out of 5 stars
0 ratings
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Ebook
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
byWei Liu
Rating: 0 out of 5 stars
0 ratings
Mastering Apache Cassandra - Second Edition
Ebook
Mastering Apache Cassandra - Second Edition
byNishant Neeraj
Rating: 0 out of 5 stars
0 ratings
Mastering Spark for Data Science
Ebook
Mastering Spark for Data Science
byAndrew Morgan
Rating: 0 out of 5 stars
0 ratings
Scala for Data Science
Ebook
Scala for Data Science
byBugnion Pascal
Rating: 0 out of 5 stars
0 ratings
Learn Hive in 24 Hours
Ebook
Learn Hive in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Learning PySpark
Ebook
Learning PySpark
byTomasz Drabas
Rating: 0 out of 5 stars
0 ratings
Getting Started with Greenplum for Big Data Analytics
Ebook
Getting Started with Greenplum for Big Data Analytics
byGollapudi Sunila
Rating: 0 out of 5 stars
0 ratings
Cassandra High Availability
Ebook
Cassandra High Availability
byRobbie Strickland
Rating: 5 out of 5 stars
5/5
Hands-On Machine Learning Recommender Systems with Apache Spark
Ebook
Hands-On Machine Learning Recommender Systems with Apache Spark
byErnesto Lee
Rating: 0 out of 5 stars
0 ratings
Real-Time Big Data Analytics
Ebook
Real-Time Big Data Analytics
byShilpi
Rating: 5 out of 5 stars
5/5
Optimizing Hadoop for MapReduce
Ebook
Optimizing Hadoop for MapReduce
byKhaled Tannir
Rating: 0 out of 5 stars
0 ratings
Graph Databases in Action: Examples in Gremlin
Ebook
Graph Databases in Action: Examples in Gremlin
byJosh Perryman
Rating: 0 out of 5 stars
0 ratings

Data Modeling & Design For You

Skip carousel

Raspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps
Ebook
Raspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps
byJason Scotts
Rating: 3 out of 5 stars
3/5
Data Analytics for Beginners: Introduction to Data Analytics
Ebook
Data Analytics for Beginners: Introduction to Data Analytics
byAnthony S. Williams
Rating: 4 out of 5 stars
4/5
Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples
Ebook
Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples
byMichael Blake
Rating: 5 out of 5 stars
5/5
The Secrets of ChatGPT Prompt Engineering for Non-Developers
Ebook
The Secrets of ChatGPT Prompt Engineering for Non-Developers
byCea West
Rating: 5 out of 5 stars
5/5
Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI & Power Pivot in Excel 2010-2016
Ebook
Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI & Power Pivot in Excel 2010-2016
byRob Collie
Rating: 4 out of 5 stars
4/5
Principles of Data Science
Ebook
Principles of Data Science
bySinan Ozdemir
Rating: 4 out of 5 stars
4/5
Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch
Ebook
Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch
byIvan Vasilev
Rating: 0 out of 5 stars
0 ratings
Data Visualization: a successful design process
Ebook
Data Visualization: a successful design process
byAndy Kirk
Rating: 4 out of 5 stars
4/5
Mastering Agile User Stories
Ebook
Mastering Agile User Stories
byDeEtta Balthazar
Rating: 4 out of 5 stars
4/5
Learning Cypher
Ebook
Learning Cypher
byOnofrio Panzarino
Rating: 0 out of 5 stars
0 ratings
Thinking in Algorithms: Strategic Thinking Skills, #2
Ebook
Thinking in Algorithms: Strategic Thinking Skills, #2
byAlbert Rutherford
Rating: 5 out of 5 stars
5/5
Deep Learning: An Essential Guide to Deep Learning for Beginners Who Want to Understand How Deep Neural Networks Work and Relate to Machine Learning and Artificial Intelligence
Ebook
Deep Learning: An Essential Guide to Deep Learning for Beginners Who Want to Understand How Deep Neural Networks Work and Relate to Machine Learning and Artificial Intelligence
byHerbert Jones
Rating: 5 out of 5 stars
5/5
Tableau Desktop Certified Associate: Exam Guide: Develop your Tableau skills and prepare for Tableau certification with tips from industry experts
Ebook
Tableau Desktop Certified Associate: Exam Guide: Develop your Tableau skills and prepare for Tableau certification with tips from industry experts
byDmitry Anoshin
Rating: 0 out of 5 stars
0 ratings
DAX Patterns: Second Edition
Ebook
DAX Patterns: Second Edition
byMarco Russo
Rating: 5 out of 5 stars
5/5
Living in Data: A Citizen's Guide to a Better Information Future
Ebook
Living in Data: A Citizen's Guide to a Better Information Future
byJer Thorp
Rating: 4 out of 5 stars
4/5
Learn T-SQL Querying: A guide to developing efficient and elegant T-SQL code
Ebook
Learn T-SQL Querying: A guide to developing efficient and elegant T-SQL code
byPedro Lopes
Rating: 0 out of 5 stars
0 ratings
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
Ebook
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
byalasdair gilchrist
Rating: 0 out of 5 stars
0 ratings
Quality metrics for semantic interoperability in Health Informatics
Ebook
Quality metrics for semantic interoperability in Health Informatics
byAlberto Moreno Conde
Rating: 0 out of 5 stars
0 ratings
Supercharge Power BI: Power BI is Better When You Learn To Write DAX
Ebook
Supercharge Power BI: Power BI is Better When You Learn To Write DAX
byMatt Allington
Rating: 5 out of 5 stars
5/5
Data Analytics with Python: Data Analytics in Python Using Pandas
Ebook
Data Analytics with Python: Data Analytics in Python Using Pandas
byFrank Millstein
Rating: 3 out of 5 stars
3/5
Python Data Analysis
Ebook
Python Data Analysis
byIvan Idris
Rating: 4 out of 5 stars
4/5
A Concise Guide to Object Orientated Programming
Ebook
A Concise Guide to Object Orientated Programming
byalasdair gilchrist
Rating: 0 out of 5 stars
0 ratings
Microsoft 365 Excel: The Only App That Matters: Calculations, Analytics, Modeling, Data Analysis and Dashboard Reporting for the New Era of Dynamic Data Driven Decision Making & Insight
Ebook
Microsoft 365 Excel: The Only App That Matters: Calculations, Analytics, Modeling, Data Analysis and Dashboard Reporting for the New Era of Dynamic Data Driven Decision Making & Insight
byMike Girvin
Rating: 3 out of 5 stars
3/5
Neural Networks: Neural Networks Tools and Techniques for Beginners
Ebook
Neural Networks: Neural Networks Tools and Techniques for Beginners
byJohn Slavio
Rating: 5 out of 5 stars
5/5
Python Machine Learning: A Practical Beginner's Guide to Understanding Machine Learning, Deep Learning and Neural Networks with Python, Scikit-Learn, Tensorflow and Keras
Ebook
Python Machine Learning: A Practical Beginner's Guide to Understanding Machine Learning, Deep Learning and Neural Networks with Python, Scikit-Learn, Tensorflow and Keras
byBrandon Railey
Rating: 0 out of 5 stars
0 ratings
The Esri Guide to GIS Analysis, Volume 3: Modeling Suitability, Movement, and Interaction
Ebook
The Esri Guide to GIS Analysis, Volume 3: Modeling Suitability, Movement, and Interaction
byAndy Mitchell
Rating: 0 out of 5 stars
0 ratings
Bayesian Analysis with Python
Ebook
Bayesian Analysis with Python
byOsvaldo Martin
Rating: 5 out of 5 stars
5/5
Programmable Logic Controllers
Ebook
Programmable Logic Controllers
byWilliam Bolton
Rating: 4 out of 5 stars
4/5
Kafka in Action
Ebook
Kafka in Action
byDylan Scott
Rating: 0 out of 5 stars
0 ratings
Graph Databases in Action: Examples in Gremlin
Ebook
Graph Databases in Action: Examples in Gremlin
byJosh Perryman
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
Podcast episode
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
#121 — ChatGPT and How Generative AI is Augmenting Workflows
Podcast episode
#121 — ChatGPT and How Generative AI is Augmenting Workflows
byDataFramed
0 ratings
0% found this document useful
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
Podcast episode
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
byData Engineering Podcast
0 ratings
0% found this document useful
#54 Women in Data Science
Podcast episode
#54 Women in Data Science
byDataFramed
0 ratings
0% found this document useful
#122 How Organizations Can Bridge the Data Literacy Gap
Podcast episode
#122 How Organizations Can Bridge the Data Literacy Gap
byDataFramed
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
This Week In Machine Learning & AI - 5/20/16: AI at Google I/O, Amazon's Deep Learning DSSTNE: This Week In Machine Learning & AI - May 20, 2016…
Podcast episode
This Week In Machine Learning & AI - 5/20/16: AI at Google I/O, Amazon's Deep Learning DSSTNE: This Week In Machine Learning & AI - May 20, 2016…
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
Podcast episode
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
byData Engineering Podcast
100%
100% found this document useful
Performing Fast Data Analytics Using Apache Kudu - Episode 64: Bringing Fast Data To The Hadoop Ecosystem With Kudu (Interview)
Podcast episode
Performing Fast Data Analytics Using Apache Kudu - Episode 64: Bringing Fast Data To The Hadoop Ecosystem With Kudu (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
Podcast episode
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
byData Engineering Podcast
0 ratings
0% found this document useful
#08 - Tech stack: Metabase, Superset, Redash, Grafana
Podcast episode
#08 - Tech stack: Metabase, Superset, Redash, Grafana
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
The Evolution of Infrastructure Management with Mandi Walls
Podcast episode
The Evolution of Infrastructure Management with Mandi Walls
byThe IaC Podcast
0 ratings
0% found this document useful
Apache Hudi: Large Scale Data Systems with Vinoth Chandar: Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality.
Podcast episode
Apache Hudi: Large Scale Data Systems with Vinoth Chandar: Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality.
byData Archives - Software Engineering Daily
0 ratings
0% found this document useful
Episode 5: The Last Mainframe with a Kickstart and a Double Clutch: How are companies evolving in a world where Cloud is on the rise? Where Cloud providers are bought out and absorbed into other companies? Today, we’re talking to Nell Shamrell-Harrington about Cloud infrastructure. She is a senior software engineer at Che
Podcast episode
Episode 5: The Last Mainframe with a Kickstart and a Double Clutch: How are companies evolving in a world where Cloud is on the rise? Where Cloud providers are bought out and absorbed into other companies? Today, we’re talking to Nell Shamrell-Harrington about Cloud infrastructure. She is a senior software engineer at Che
byScreaming in the Cloud
0 ratings
0% found this document useful
Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52: Iceberg: Improving The Utility Of Cloud-Native Big Data At Netflix (Interview)
Podcast episode
Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52: Iceberg: Improving The Utility Of Cloud-Native Big Data At Netflix (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
SAP Devtoberfest How to simplify your data fetching life with RTK Query: We are super excited to announce that Devtoberfest returns this year running from October 1st - 31st 2022. It's a celebration for SAP developers and an excellent way to get ready for SAP TechEd.
Podcast episode
SAP Devtoberfest How to simplify your data fetching life with RTK Query: We are super excited to announce that Devtoberfest returns this year running from October 1st - 31st 2022. It's a celebration for SAP developers and an excellent way to get ready for SAP TechEd.
bySAP Developers
0 ratings
0% found this document useful
Move Your Database To The Data And Speed Up Your Analytics With DuckDB: An interview with Hannes Mühleisen about the DuckDB engine for in-process OLAP queries that lets you use the power of SQL and the flexibility of programming languages side by side.
Podcast episode
Move Your Database To The Data And Speed Up Your Analytics With DuckDB: An interview with Hannes Mühleisen about the DuckDB engine for in-process OLAP queries that lets you use the power of SQL and the flexibility of programming languages side by side.
byData Engineering Podcast
0 ratings
0% found this document useful
SAP Devtoberfest How to Make State Management Work for You with Redux and Redux Toolkit: We are super excited to announce that Devtoberfest returns this year running from October 1st - 31st 2022. It's a celebration for SAP developers and an excellent way to get ready for SAP TechEd.
Podcast episode
SAP Devtoberfest How to Make State Management Work for You with Redux and Redux Toolkit: We are super excited to announce that Devtoberfest returns this year running from October 1st - 31st 2022. It's a celebration for SAP developers and an excellent way to get ready for SAP TechEd.
bySAP Developers
0 ratings
0% found this document useful
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
BEST-OF-BRAD: Using Top Tier Solutions to Build Hybrid Cloud Ecosystems with Brad Feakes: Today on What the Duck?!, we’re ducking around with Brad Feakes, an expert in operations, supply chain management, and information technology. Brad sits down with Host, Sarah Scudder, to discuss the use of best-of-breed solutions to build hybrid Cloud ecosystems that support ERP customer needs. Brad shares his personal and professional journey, including his education, career choices, and his experience working with ERP systems in manufacturing companies. They also touch upon Brad's role as a business analyst and his involvement in implementing Epicor as company-wide ERP system and his current role at EstesGroup.
Podcast episode
BEST-OF-BRAD: Using Top Tier Solutions to Build Hybrid Cloud Ecosystems with Brad Feakes: Today on What the Duck?!, we’re ducking around with Brad Feakes, an expert in operations, supply chain management, and information technology. Brad sits down with Host, Sarah Scudder, to discuss the use of best-of-breed solutions to build hybrid Cloud ecosystems that support ERP customer needs. Brad shares his personal and professional journey, including his education, career choices, and his experience working with ERP systems in manufacturing companies. They also touch upon Brad's role as a business analyst and his involvement in implementing Epicor as company-wide ERP system and his current role at EstesGroup.
byWhat the Duck - Another Supply Chain Podcast
0 ratings
0% found this document useful
Cloud Dataflow with Frances Perry: Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data so we asked Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it.
Podcast episode
Cloud Dataflow with Frances Perry: Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data so we asked Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
107 HyperDev in 12 Parsecs: Daniel X Moore (@STRd6) and Pirijan (@pketh) join our panel to discuss Hyperdev, A web based code editor aimed at making developers as productive as possible as quickly as possible. Is a web based IDE really a viable option for full stack production...
Podcast episode
107 HyperDev in 12 Parsecs: Daniel X Moore (@STRd6) and Pirijan (@pketh) join our panel to discuss Hyperdev, A web based code editor aimed at making developers as productive as possible as quickly as possible. Is a web based IDE really a viable option for full stack production...
byThe Web Platform Podcast
0 ratings
0% found this document useful
Are Vector DBs the Future Data Platform for AI? with Ed Anuff - #664
Podcast episode
Are Vector DBs the Future Data Platform for AI? with Ed Anuff - #664
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
Podcast episode
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
byData Engineering Podcast
0 ratings
0% found this document useful
#07 - Tech stack: Rust, TypeScript, Edge Worker, and Cloudflare
Podcast episode
#07 - Tech stack: Rust, TypeScript, Edge Worker, and Cloudflare
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
2023 Look Ahead to Platform Engineering
Podcast episode
2023 Look Ahead to Platform Engineering
byThe Cloudcast
0 ratings
0% found this document useful
344: Grains of Salt: Shell text processing, data rebalancing on ZFS mirrors, Add Security Headers with OpenBSD relayd, ZFS filesystem hierarchy in ZFS pools, speeding up ZSH, How Unix pipes work, grow ZFS pools over time, the real reason ifconfig on Linux is deprecated, clear your terminal in style, and more.
Podcast episode
344: Grains of Salt: Shell text processing, data rebalancing on ZFS mirrors, Add Security Headers with OpenBSD relayd, ZFS filesystem hierarchy in ZFS pools, speeding up ZSH, How Unix pipes work, grow ZFS pools over time, the real reason ifconfig on Linux is deprecated, clear your terminal in style, and more.
byBSD Now
0 ratings
0% found this document useful
Rust in Production Ep 1 - InfluxData's Paul Dix: Paul Dix, CTO of InfluxDB, talks about the open-source time series database's development, the decision to use Go and Rust, challenges of managing high data volumes, performance improvements, future plans, and the value of hands-on learning.
Podcast episode
Rust in Production Ep 1 - InfluxData's Paul Dix: Paul Dix, CTO of InfluxDB, talks about the open-source time series database's development, the decision to use Go and Rust, challenges of managing high data volumes, performance improvements, future plans, and the value of hands-on learning.
byRust in Production
0 ratings
0% found this document useful

Skip carousel

Elasticsearch And Kibana Basics
Linux Format
Article
Elasticsearch And Kibana Basics
Dec 15, 2020
1 min read
Grafana Terminology
Linux Format
Article
Grafana Terminology
Jan 14, 2020
A Grafana data source is a database, file or service that provides data to Grafana – it cannot operate without data. A Grafana panel is the basic building block of Grafana. Panels are made of visualisations or queries. A Grafana query is used for req
1 min read
Understanding ELT & ETL
Techfastly
Article
Understanding ELT & ETL
Apr 1, 2021
8 min read
Rokoko Studio 2.0
3D World
Article
Rokoko Studio 2.0
Feb 23, 2021
1 min read
Rediscover Speed With The Redis Revolution
Linux Format
Article
Rediscover Speed With The Redis Revolution
Jul 25, 2023
Credit: https://redis.io Redis is an open-source, in-memory data structure store that has gained popularity R as a highly efficient caching and messaging system. It prioritises speed, efficiency and versatility, making it a top choice for various ap
8 min read
Httpdirfs
Linux Format
Article
Httpdirfs
Jun 2, 2020
1 min read
How We Tested…
Linux Format
Article
How We Tested…
Jan 12, 2021
You’ll find these applications in the software repositories of most desktop distributions, even if the featured version is not the latest. Some programs provide Snap packages, and others provide installable binaries for RPM- and DEB-based distributio
1 min read
Hot Picks
Linux Format
Article
Hot Picks
Mar 9, 2021
13 min read
Supercomputer On A Platter
Business Today
Article
Supercomputer On A Platter
Apr 1, 2022
CHENNAI-HEADQUARTERED automobile major TVS Motor Company uses high-performance computing (HPC) for running R&D simulations and testing the aero-dynamics of two-wheelers, which allows it to make the vehicles stable at speed and more efficient, cool en
7 min read
Drill Down Deeper
MacLife
Article
Drill Down Deeper
Aug 16, 2022
2 min read
HotPicks
Linux Format
Article
HotPicks
Feb 11, 2020
13 min read
Newsdesk
Linux Format
Article
Newsdesk
Nov 14, 2023
8 min read
Coding up a Storm
MacLife
Article
Coding up a Storm
Mar 30, 2017
$9.99omz-software.com > This complete Python scripting environment makes the most of being specialized for one coding language, instead of several. It has a debugger included, tabs for multiple documents, a tool for quickly building visual interface
1 min read
Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
HotPicks
Linux Format
Article
HotPicks
Nov 15, 2022
12 min read
22 Awesome Open-source Programs That Do Everything You Need
PCWorld
Article
22 Awesome Open-source Programs That Do Everything You Need
Oct 30, 2023
6 min read
Building A Better File Server With The Pi
APC
Article
Building A Better File Server With The Pi
Dec 27, 2021
4 min read
Building A Better File Server With The Pi
Linux Format
Article
Building A Better File Server With The Pi
Sep 21, 2021
Running your own cloud storage server saves money, allows you to expand storage as necessary, and can be done with a device as small as a Raspberry Pi. Our previous guide to setting up a Nextcloud server on the Raspberry Pi (LXF280) covered everythin
4 min read
Parallels Remote Application Server 18
PC Pro Magazine
Article
Parallels Remote Application Server 18
Aug 12, 2021
2 min read
Get Serenity – Direct From The Source
Linux Format
Article
Get Serenity – Direct From The Source
Jan 10, 2023
10 min read
Set Up A Production- Ready Web Server
APC
Article
Set Up A Production- Ready Web Server
Nov 4, 2019
8 min read
HotPicks
Linux Format
Article
HotPicks
Jan 14, 2020
12 min read
Automatically Provision Devices With Ansible
Linux Format
Article
Automatically Provision Devices With Ansible
Nov 15, 2022
Matt Holder has worked in IT support for over a decade, and always tries to utilise Linux alongside other installed systems. C loud computing is a term that means a number of things. Software as a Service (SaaS) is one such example of what can be hos
9 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
Twenty Years Of WordPress Websites!
Linux Format
Article
Twenty Years Of WordPress Websites!
Oct 17, 2023
11 min read
HotPicks
Linux Format
Article
HotPicks
Jun 29, 2021
12 min read
Art Beyond The Canvas
Linux Format
Article
Art Beyond The Canvas
May 2, 2023
9 min read
Automotive Grade Linux
Linux Format
Article
Automotive Grade Linux
Nov 16, 2021
9 min read
Behold CentOS Stream
Linux Format
Article
Behold CentOS Stream
Jun 29, 2021
4 min read
Replacing Rc.local And Using Systemd
Linux Format
Article
Replacing Rc.local And Using Systemd
Jun 28, 2022
8 min read

Related categories

Skip carousel

Reviews for Mastering Hadoop

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Mastering Hadoop - Sandeep Karanth

Mastering Hadoop

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book?

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Hadoop 2.X

The inception of Hadoop

The evolution of Hadoop

Hadoop's genealogy

Hadoop-0.20-append

Hadoop-0.20-security

Hadoop's timeline

Hadoop 2.X

Yet Another Resource Negotiator (YARN)

Architecture overview

Storage layer enhancements

High availability

HDFS Federation

HDFS snapshots

Other enhancements

Support enhancements

Hadoop distributions

Which Hadoop distribution?

Performance

Scalability

Reliability

Manageability

Available distributions

Cloudera Distribution of Hadoop (CDH)

Hortonworks Data Platform (HDP)

MapR

Pivotal HD

Summary

2. Advanced MapReduce

MapReduce input

The InputFormat class

The InputSplit class

The RecordReader class

Hadoop's small files problem

Filtering inputs

The Map task

The dfs.blocksize attribute

Sort and spill of intermediate outputs

Node-local Reducers or Combiners

Fetching intermediate outputs – Map-side

The Reduce task

Fetching intermediate outputs – Reduce-side

Merge and spill of intermediate outputs

MapReduce output

Speculative execution of tasks

MapReduce job counters

Handling data joins

Reduce-side joins

Map-side joins

Summary

3. Advanced Pig

Pig versus SQL

Different modes of execution

Complex data types in Pig

Compiling Pig scripts

The logical plan

The physical plan

The MapReduce plan

Development and debugging aids

The DESCRIBE command

The EXPLAIN command

The ILLUSTRATE command

The advanced Pig operators

The advanced FOREACH operator

The FLATTEN operator

The nested FOREACH operator

The COGROUP operator

The UNION operator

The CROSS operator

Specialized joins in Pig

The Replicated join

Skewed joins

The Merge join

User-defined functions

The evaluation functions

The aggregate functions

The Algebraic interface

The Accumulator interface

The filter functions

The load functions

The store functions

Pig performance optimizations

The optimization rules

Measurement of Pig script performance

Combiners in Pig

Memory for the Bag data type

Number of reducers in Pig

The multiquery mode in Pig

Best practices

The explicit usage of types

Early and frequent projection

Early and frequent filtering

The usage of the LIMIT operator

The usage of the DISTINCT operator

The reduction of operations

The usage of Algebraic UDFs

The usage of Accumulator UDFs

Eliminating nulls in the data

The usage of specialized joins

Compressing intermediate results

Combining smaller files

Summary

4. Advanced Hive

The Hive architecture

The Hive metastore

The Hive compiler

The Hive execution engine

The supporting components of Hive

Data types

File formats

Compressed files

ORC files

The Parquet files

The data model

Dynamic partitions

Semantics for dynamic partitioning

Indexes on Hive tables

Hive query optimizers

Advanced DML

The GROUP BY operation

ORDER BY versus SORT BY clauses

The JOIN operator and its types

Map-side joins

Advanced aggregation support

Other advanced clauses

UDF, UDAF, and UDTF

Summary

5. Serialization and Hadoop I/O

Data serialization in Hadoop

Writable and WritableComparable

Hadoop versus Java serialization

Avro serialization

Avro and MapReduce

Avro and Pig

Avro and Hive

Comparison – Avro versus Protocol Buffers / Thrift

File formats

The Sequence file format

Reading and writing Sequence files

The MapFile format

Other data structures

Compression

Splits and compressions

Scope for compression

Summary

6. YARN – Bringing Other Paradigms to Hadoop

The YARN architecture

Resource Manager (RM)

Application Master (AM)

Node Manager (NM)

YARN clients

Developing YARN applications

Writing YARN clients

Writing the Application Master entity

Monitoring YARN

Job scheduling in YARN

CapacityScheduler

FairScheduler

YARN commands

User commands

Administration commands

Summary

7. Storm on YARN – Low Latency Processing in Hadoop

Batch processing versus streaming

Apache Storm

Architecture of an Apache Storm cluster

Computation and data modeling in Apache Storm

Use cases for Apache Storm

Developing with Apache Storm

Apache Storm 0.9.1

Storm on YARN

Installing Apache Storm-on-YARN

Prerequisites

Installation procedure

Summary

8. Hadoop on the Cloud

Cloud computing characteristics

Hadoop on the cloud

Amazon Elastic MapReduce (EMR)

Provisioning a Hadoop cluster on EMR

Summary

9. HDFS Replacements

HDFS – advantages and drawbacks

Amazon AWS S3

Hadoop support for S3

Implementing a filesystem in Hadoop

Implementing an S3 native filesystem in Hadoop

Summary

10. HDFS Federation

Limitations of the older HDFS architecture

Architecture of HDFS Federation

Benefits of HDFS Federation

Deploying federated NameNodes

HDFS high availability

Secondary NameNode, Checkpoint Node, and Backup Node

High availability – edits sharing

Useful HDFS tools

Three-layer versus four-layer network topology

HDFS block placement

Pluggable block placement policy

Summary

11. Hadoop Security

The security pillars

Authentication in Hadoop

Kerberos authentication

The Kerberos architecture and workflow

Kerberos authentication and Hadoop

Authentication via HTTP interfaces

Authorization in Hadoop

Authorization in HDFS

Identity of an HDFS user

Group listings for an HDFS user

HDFS APIs and shell commands

Specifying the HDFS superuser

Turning off HDFS authorization

Limiting HDFS usage

Name quotas in HDFS

Space quotas in HDFS

Service-level authorization in Hadoop

Data confidentiality in Hadoop

HTTPS and encrypted shuffle

SSL configuration changes

Configuring the keystore and truststore

Audit logging in Hadoop

Summary

12. Analytics Using Hadoop

Data analytics workflow

Machine learning

Apache Mahout

Document analysis using Hadoop and Mahout

Term frequency

Document frequency

Term frequency – inverse document frequency

Tf-Idf in Pig

Cosine similarity distance measures

Clustering using k-means

K-means clustering using Apache Mahout

RHadoop

Summary

A. Hadoop for Microsoft Windows

Deploying Hadoop on Microsoft Windows

Prerequisites

Building Hadoop

Configuring Hadoop

Deploying Hadoop

Summary

Index

Mastering Hadoop

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: December 2014

Production reference: 1221214

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78398-364-3

www.packtpub.com

Cover image by Poonam Nayak (<pooh.graphics@gmail.com>)

Credits

Author

Sandeep Karanth

Reviewers

Shiva Achari

Pavan Kumar Polineni

Uchit Vyas

Yohan Wadia

Commissioning Editor

Edward Gordon

Acquisition Editor

Rebecca Youé

Content Development Editor

Ruchita Bhansali

Technical Editors

Bharat Patil

Rohit Kumar Singh

Parag Topre

Copy Editors

Sayanee Mukherjee

Vikrant Phadkay

Project Coordinator

Kranti Berde

Proofreaders

Simran Bhogal

Maria Gould

Ameesha Green

Paul Hindle

Indexer

Mariammal Chettiyar

Graphics

Abhinash Sahu

Valentina Dsilva

Production Coordinator

Arvindkumar Gupta

Cover Work

Arvindkumar Gupta

About the Author

Sandeep Karanth is a technical architect who specializes in building and operationalizing software systems. He has more than 14 years of experience in the software industry, working on a gamut of products ranging from enterprise data applications to newer-generation mobile applications. He has primarily worked at Microsoft Corporation in Redmond, Microsoft Research in India, and is currently a cofounder at Scibler, architecting data intelligence products.

Sandeep has special interest in data modeling and architecting data applications. In his area of interest, he has successfully built and deployed applications, catering to a variety of business use cases such as vulnerability detection from machine logs, churn analysis from subscription data, and sentiment analyses from chat logs. These applications were built using next generation big data technologies such as Hadoop, Spark, and Microsoft StreamInsight and deployed on cloud platforms such as Amazon AWS and Microsoft Azure.

Sandeep is also experienced and interested in areas such as green computing and the emerging Internet of Things. He frequently trains professionals and gives talks on topics such as big data and cloud computing. Sandeep believes in inculcating skill-oriented and industry-related topics in the undergraduate engineering curriculum, and his talks are geared with this in mind. Sandeep has a Master's degree in Computer and Information Sciences from the University of Minnesota, Twin Cities.

Sandeep's twitter handle is @karanths. His GitHub profile is https://github.com/Karanth, and he writes technical snippets at https://gist.github.com/Karanth.

Acknowledgments

I would like to dedicate this book to my loving daughter, Avani, who has taught me many a lesson in effective time management. I would like to thank my wife and parents for their constant support that has helped me complete this book on time. Packt Publishing have been gracious enough to give me this opportunity, and I would like to thank all individuals who were involved in editing, reviewing, and publishing this book. Questions and feedback from curious audiences at my lectures have driven much of the content of this book. Some of the subtopics are from experiences I gained working on a wide variety of projects throughout my career. I would like to thank my audience and also my employers for indirectly helping me write this book.

About the Reviewers

Shiva Achari has over 8 years of extensive industry experience and is currently working as a Big Data architect in Teradata. Over the years, he has architected, designed, and developed multiple innovative and high-performing large-scale solutions such as distributed systems, data center, Big Data management, SaaS cloud applications, Internet applications, and data analytics solutions.

He is currently writing a book on Hadoop essentials, which is based on Hadoop, its ecosystem components, and how we can leverage the components in different phases of the Hadoop project life cycle.

Achari has experience in designing Big Data and analytics applications, ingestion, cleansing, transformation, correlating different sources, data mining, and user experience using Hadoop, Cassandra, Solr, Storm, R, and Tableau.

He specializes in developing solutions for the Big Data domain and possesses a sound hands-on experience on projects migrating to the Hadoop world, new development, product consulting, and POC. He also has hands-on expertise on technologies such as Hadoop, Yarn, Sqoop, Hive, Pig, Flume, Solr, Lucene, Elasticsearch, Zookeeper, Storm, Redis, Cassandra, HBase, MongoDB, Talend, R, Mahout, Tableau, Java, and J2EE.

Shiva has expertise in requirement analysis, estimations, technology evaluation, and system architecture, with domain experience of telecom, Internet applications, document management, healthcare, and media.

Currently, he supports presales activities such as writing technical proposals (RFP), providing technical consultation to customers, and managing deliveries of Big Data practice group in Teradata.

He is active on LinkedIn at http://in.linkedin.com/in/shivaachari/.

I would like to thank Packt Publishing for helping me out with the reviewing process and the opportunity to review this book, which was a great opportunity and experience. I will wish the publication and author best of luck for the success of the book.

Pavan Kumar Polineni is working as Analytics Manager at Fantain Sports. He has experience in the fields of information retrieval and recommendation engines. He is a Cloudera certified Hadoop administrator. His is interested in machine learning, data mining, and visualization.

He has a Bachelor's degree in Computer Science from Koneru Lakshmaiah College of Engineering and is about to complete his Master's degree in Software Systems from BITS, Pilani. He has worked at organizations such as IBM and Ctrls Datacenter. He can be found on Twitter as @polinenipavan.

Uchit Vyas is an open source specialist and a hands-on lead DevOps of Clogeny Technologies. He is responsible for the delivery of solutions, services, and product development. He explores new enterprise open source and defining architecture, roadmaps, and best practices. He has consulted and provided training on various open source technologies, including cloud computing (AWS Cloud, Rackspace, Azure, CloudStack, Openstack, and Eucalyptus), Mule ESB, Chef, Puppet and Liferay Portal, Alfresco ECM, and JBoss, to corporations around the world.

He has a degree in Engineering in Computer Science from the Gujarat University. He worked in the education and research team of Infosys Limited as senior associate, during which time he worked on SaaS, private clouds, virtualization, and now, cloud system automation.

He has also published book on Mule ESB, and is writing various books on open source technologies and AWS.

He hosts a blog named Cloud Magic World, cloudbyuchit.blogspot.com, where he posts tips and phenomena about open source technologies, mostly cloud technologies. He can also be found on Twitter as @uchit_vyas.

I am thankful to Riddhi Thaker (my colleague) for helping me a lot in reviewing this book.

Yohan Wadia is a client-focused virtualization and cloud expert with 5 years of experience in the IT industry.

He has been involved in conceptualizing, designing, and implementing large-scale solutions for a variety of enterprise customers based on VMware vCloud, Amazon Web Services, and Eucalyptus Private Cloud.

His community-focused involvement enables him to share his passion for virtualization and cloud technologies with peers through social media engagements, public speaking at industry events, and through his personal blog at yoyoclouds.com.

He is currently working with Virtela Technology Services, an NTT communications company, as a cloud solutions engineer, and is involved in managing the company's in-house cloud platform. He works on various open source and enterprise-level cloud solutions for internal as well as external customers. He is also a VMware Certified Professional and vExpert (2012, 2013).

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

We are in an age where data is the primary driver in decision-making. With storage costs declining, network speeds increasing, and everything around us becoming digital, we do not hesitate a bit to download, store, or share data with others around us. About 20 years back, a camera was a device used to capture pictures on film. Every photograph had to be captured almost perfectly. The storage of film negatives was done carefully lest they get damaged. There was a higher cost associated with taking prints of these photographs. The time taken between a picture click and to view it was almost a day. This meant that less data was being captured as these factors presented a cliff for people from recording each and every moment of their life, unless it was very significant.

However, with cameras becoming digital, this has changed. We do not hesitate to click a photograph of almost anything anytime. We do not worry about storage as our externals disks of a terabyte capacity always provide a reliable backup. We seldom take our cameras anywhere as we have mobile devices that we can use to take photographs. We have applications such as Instagram that can be used to add effects to our pictures and share them. We gather opinions and information about the pictures, and we click and base some of our decisions on them. We capture almost every moment, of great significance or not, and push it into our memory books. The era of big data has arrived!

This era of Big Data has similar changes in businesses as well. Almost everything in a business is logged. Every action taken by a user on the page of an e-commerce page is recorded to improve quality of service and every item bought by the user are recorded to cross-sell or up-sell other items. Businesses want to understand the DNA of their customers and try to infer it by pinching out every possible data they can get about these customers. Businesses are not worried about the format of the data. They are ready to accept speech, images, natural language text, or structured data. These data points are used to drive business decisions and personalize experiences for the user. The more data, the higher the degree of personalization and better the experience for the user.

We saw that we are ready, in some aspects, to take on this Big Data challenge. However, what about the tools used to analyze this data? Can they handle the volume, velocity, and variety of the incoming data? Theoretically, all this data can reside on a single machine, but what is the cost of such a machine? Will it be able to cater to the variations in loads? We know that supercomputers are available, but there are only a handful of them in the world. Supercomputers don't scale. The alternative is to build a team of machines, a cluster, or individual computing units that work in tandem to achieve a task. A team of machines are interconnected via a very fast network and provide better scaling and elasticity, but that is not enough. These clusters have to be programmed. A greater number of machines, just like a team of human beings, require more coordination and synchronization. The higher the number of machines, the greater the possibility of failures in the cluster. How do we handle synchronization and fault tolerance in a simple way easing the burden on the programmer? The answer is systems such as Hadoop.

Hadoop is synonymous with Big Data processing. Its simple programming model, code once and deploy at any scale paradigm, and an ever-growing ecosystem make Hadoop an inclusive platform for programmers with different levels of expertise and breadth of knowledge. Today, it is the number-one sought after job skill in the data sciences space. To handle and analyze Big Data, Hadoop has become the go-to tool. Hadoop 2.0 is spreading its wings to cover a variety of application paradigms and solve a wider range of data problems. It is rapidly becoming a general-purpose cluster platform for all data processing needs, and will soon become a mandatory skill for every engineer across verticals.

This book covers optimizations and advanced features of MapReduce, Pig, and Hive. It also covers Hadoop 2.0 and illustrates how it can be used to extend the capabilities of Hadoop.

Hadoop, in its 2.0 release, has evolved to become a general-purpose cluster-computing platform. The book will explain the platform-level changes that enable this. Industry guidelines to optimize MapReduce jobs and higher-level abstractions such as Pig and Hive in Hadoop 2.0 are covered. Some advanced job patterns and their applications are also discussed. These topics will empower the Hadoop user to optimize existing jobs and migrate them to Hadoop 2.0. Subsequently, it will dive deeper into Hadoop 2.0-specific features such as YARN (Yet Another Resource Negotiator) and HDFS Federation, along with examples. Replacing HDFS with other filesystems is another topic that will be covered in the latter half of the book. Understanding these topics will enable Hadoop users to extend Hadoop to other application paradigms and data stores, making efficient use of the available cluster resources.

This book is a guide focusing on advanced concepts and features in Hadoop. Foundations of every concept are explained with code fragments or schematic illustrations. The data processing flow dictates the order of the concepts in each chapter.

What this book covers

Chapter 1, Hadoop 2.X, discusses the improvements in Hadoop 2.X in comparison to its predecessor generation.

Chapter 2, Advanced MapReduce, helps you understand the best practices and patterns for Hadoop MapReduce, with examples.

Chapter 3, Advanced Pig, discusses the advanced features of Pig, a framework to script MapReduce jobs on Hadoop.

Chapter 4, Advanced Hive, discusses the advanced features of a higher-level SQL abstraction on Hadoop MapReduce called Hive.

Chapter 5, Serialization and Hadoop I/O, discusses the IO capabilities in Hadoop. Specifically, this chapter covers the concepts of serialization and deserialization support and their necessity within Hadoop; Avro, an external serialization framework; data compression codecs available within Hadoop; their tradeoffs; and finally, the special file formats in Hadoop.

Chapter 6, YARN – Bringing Other Paradigms to Hadoop, discusses YARN (Yet Another Resource Negotiator), a new resource manager that has been included in Hadoop 2.X, and how it is generalizing the Hadoop platform to include other computing paradigms.

Chapter 7, Storm on YARN – Low Latency Processing in Hadoop, discusses the opposite paradigm, that is, moving data to the compute, and compares and contrasts it with batch processing systems such as MapReduce. It also discusses the Apache Storm framework and how to develop applications in Storm. Finally, you will learn how to install Storm on Hadoop 2.X with YARN.

Chapter 8, Hadoop on the Cloud, discusses the characteristics of cloud computing and Hadoop's Platform as a Service offering across cloud computing service providers. Further, it delves into Amazon's managed Hadoop services, also known as Elastic MapReduce (EMR) and looks into how to provision and run jobs on a Hadoop EMR cluster.

Chapter 9, HDFS Replacements, discusses the strengths and drawbacks of HDFS when compared to other file systems. The chapter also draws attention to Hadoop's support for Amazon's S3 cloud storage service. At the end, the chapter illustrates Hadoop HDFS extensibility features by implementing Hadoop's support for S3's native file system to extend Hadoop.

Chapter 10, HDFS Federation, discusses the advantages of HDFS Federation and its architecture. Block placement strategies, which are central to the success of HDFS in the MapReduce environment, are also discussed in the chapter.

Chapter 11, Hadoop Security, focuses on the security aspects of a Hadoop cluster. The main pillars of security are authentication, authorization, auditing, and data protection. We will look at Hadoop's features in each of these pillars.

Chapter 12, Analytics Using Hadoop, discusses higher-level analytic workflows, techniques such as machine learning, and their support in Hadoop. We take document analysis as an example to illustrate analytics using Pig on Hadoop.

Appendix, Hadoop for Microsoft Windows, explores Microsoft Window Operating System's native support for Hadoop that has been introduced in Hadoop 2.0. In this chapter, we look at how to build and deploy Hadoop on Microsoft Windows natively.

What you need for this book?

The following software suites are required to try out the examples in the book:

Java Development Kit (JDK 1.7 or later): This is free software from Oracle that provides a JRE (Java Runtime Environment) and additional tools for developers. It can be downloaded from http://www.oracle.com/technetwork/java/javase/downloads/index.html.

The IDE for editing Java code: IntelliJ IDEA is the IDE that has been used to develop the examples. Any other IDE of your choice can also be used. The community edition of the IntelliJ IDE can be downloaded from https://www.jetbrains.com/idea/download/.

Maven: Maven is a build tool that has been used to build the samples in the book. Maven can be used to automatically pull-build dependencies and specify configurations via XML files. The code samples in the chapters can be built into a JAR using two simple Maven commands:

mvn compilemvn assembly:single

These commands compile the code into a JAR file. These commands create a consolidated JAR with the program along with all its dependencies. It is important to change the mainClass references in the pom.xml to the driver class name when building the consolidated JAR file.

Hadoop-related consolidated JAR files can be run using the command:

hadoop jar args

This command directly picks the driver program from the mainClass that was specified in the pom.xml. Maven can be downloaded and installed from http://maven.apache.org/download.cgi. The Maven XML template file used to build the samples in this book is as follows:

1.0 encoding=UTF-8?>

http://maven.apache.org/POM/4.0.0 xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd>

4.0.0

MasteringHadoop

1.0-SNAPSHOT

org.apache.maven.plugins

maven-compiler-plugin

3.0

1.7

3.1

org.apache.maven.plugins

maven-jar-plugin

MasteringHadoop.MasteringHadoopTest

maven-assembly-plugin

MasteringHadoop.MasteringHadoopTest

jar-with-dependencies

only. It has no influence on the Maven build itself. -->

org.eclipse.m2e

lifecycle-mapping

1.0.0

org.apache.maven.plugins

maven-dependency-plugin

[2.1,)

copy-dependencies

Hadoop 2.2.0: Apache Hadoop is required to try out the examples in general. Appendix, Hadoop for Microsoft Windows, has the details on Hadoop's single-node installation on a Microsoft Windows machine. The steps are similar and easier for other operating systems such as Linux or Mac, and they can be found at http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/SingleNodeSetup.html

Who this book is for

This book is meant for a gamut of readers. A novice user of Hadoop can use this book to upgrade his skill level in the technology. People with existing experience in Hadoop can enhance their knowledge about Hadoop to solve challenging data processing problems they might be encountering in their profession. People who are using Hadoop, Pig, or Hive at their workplace can use the tips provided in this book to help make their jobs faster and more efficient. A curious Big Data professional can use this book to understand the expanding horizons of Hadoop and how it is broadening its scope by embracing other paradigms, not just MapReduce. Finally, a Hadoop 1.X user can get insights into the repercussions of upgrading to Hadoop 2.X. The book assumes familiarity with Hadoop, but the reader need not be an expert. Access to a Hadoop installation, either in your organization, on the cloud, or on your desktop/notebook is recommended to try some of the concepts.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: The FileInputFormat subclass and associated classes are commonly used for jobs taking inputs from HFDS.

Enjoying the preview?

Page 1 of 1

Mastering Hadoop

About this ebook

Sandeep Karanth

Related authors

Related to Mastering Hadoop

Related ebooks

Data Modeling & Design For You

Related podcast episodes

Related articles

Related categories

Reviews for Mastering Hadoop

What did you think?

Book preview

Mastering Hadoop - Sandeep Karanth

Table of Contents

Mastering Hadoop

Mastering Hadoop

Credits

About the Author

Acknowledgments

About the Reviewers

Support files, eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book?

Who this book is for

Conventions