Optimizing Hadoop for MapReduce
()
About this ebook
If you are a Hadoop administrator, developer, MapReduce user, or beginner, this book is the best choice available if you wish to optimize your clusters and applications. Having prior knowledge of creating MapReduce applications is not necessary, but will help you better understand the concepts and snippets of MapReduce class template code.
Related to Optimizing Hadoop for MapReduce
Related ebooks
Monitoring Hadoop Rating: 0 out of 5 stars0 ratingsApache Oozie Essentials Rating: 0 out of 5 stars0 ratingsLearning Apache Cassandra - Second Edition Rating: 0 out of 5 stars0 ratingsHadoop Cluster Deployment Rating: 0 out of 5 stars0 ratingsApache Spark Graph Processing Rating: 0 out of 5 stars0 ratingsLearning Azure DocumentDB Rating: 0 out of 5 stars0 ratingsMastering Scala Machine Learning Rating: 0 out of 5 stars0 ratingsCouchbase Essentials Rating: 0 out of 5 stars0 ratingsInstant MapReduce Patterns – Hadoop Essentials How-to Rating: 0 out of 5 stars0 ratingsApache Spark 2.x Cookbook Rating: 0 out of 5 stars0 ratingsApache Hive Essentials Rating: 0 out of 5 stars0 ratingsApache Cassandra Essentials Rating: 4 out of 5 stars4/5Hadoop: Data Processing and Modelling Rating: 0 out of 5 stars0 ratingsGetting Started with Big Data Query using Apache Impala Rating: 0 out of 5 stars0 ratingsProfessional Hadoop Solutions Rating: 4 out of 5 stars4/5Big Data Architecture A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsRelational Databases: State of the Art Report 14:5 Rating: 0 out of 5 stars0 ratingsLearning Hadoop 2 Rating: 4 out of 5 stars4/5Learn Hive in 24 Hours Rating: 0 out of 5 stars0 ratingsExploring Hadoop Ecosystem (Volume 1): Batch Processing Rating: 0 out of 5 stars0 ratingsTeradata A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsPentaho 3.2 Data Integration Beginner's Guide Rating: 0 out of 5 stars0 ratingsSpark SQL A Complete Guide Rating: 0 out of 5 stars0 ratingsMastering Hadoop Rating: 0 out of 5 stars0 ratingsInstant Pentaho Data Integration Kitchen Rating: 0 out of 5 stars0 ratingsAmazon Redshift Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsDataOps A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsHDInsight Essentials - Second Edition Rating: 0 out of 5 stars0 ratingsData Catalog Third Edition Rating: 0 out of 5 stars0 ratings
Databases For You
Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Excel 2021 Rating: 4 out of 5 stars4/5SQL Clearly Explained Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Visualizing Graph Data Rating: 0 out of 5 stars0 ratingsData Science Strategy For Dummies Rating: 0 out of 5 stars0 ratingsPython Projects for Everyone Rating: 0 out of 5 stars0 ratingsData Management for Researchers: Organize, maintain and share your data for research success Rating: 0 out of 5 stars0 ratingsPractical Data Analysis Rating: 4 out of 5 stars4/5Access 2019 For Dummies Rating: 0 out of 5 stars0 ratingsLearn SQL in 24 Hours Rating: 5 out of 5 stars5/5Building a Scalable Data Warehouse with Data Vault 2.0 Rating: 4 out of 5 stars4/5Business Intelligence Strategy and Big Data Analytics: A General Management Perspective Rating: 5 out of 5 stars5/5Behind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight Rating: 5 out of 5 stars5/5SQL Server: Tips and Tricks - 1 Rating: 5 out of 5 stars5/5Serverless Architectures on AWS, Second Edition Rating: 5 out of 5 stars5/5Jump Start MySQL: Master the Database That Powers the Web Rating: 0 out of 5 stars0 ratingsGetting Started with SQL Server 2014 Administration Rating: 0 out of 5 stars0 ratingsCodeless Data Structures and Algorithms: Learn DSA Without Writing a Single Line of Code Rating: 0 out of 5 stars0 ratingsA Concise Guide to Object Orientated Programming Rating: 0 out of 5 stars0 ratingsData Governance: How to Design, Deploy and Sustain an Effective Data Governance Program Rating: 4 out of 5 stars4/5100+ SQL Queries T-SQL for Microsoft SQL Server Rating: 4 out of 5 stars4/5Raspberry Pi Server Essentials Rating: 0 out of 5 stars0 ratingsBlockchain Basics: A Non-Technical Introduction in 25 Steps Rating: 5 out of 5 stars5/5Advanced Analytics in Power BI with R and Python: Ingesting, Transforming, Visualizing Rating: 0 out of 5 stars0 ratingsCompTIA DataSys+ Study Guide: Exam DS0-001 Rating: 0 out of 5 stars0 ratingsAccess 2010 All-in-One For Dummies Rating: 4 out of 5 stars4/5Learn SQL Server Administration in a Month of Lunches Rating: 3 out of 5 stars3/5Learning PostgreSQL Rating: 1 out of 5 stars1/5
Reviews for Optimizing Hadoop for MapReduce
0 ratings0 reviews
Book preview
Optimizing Hadoop for MapReduce - Khaled Tannir
Table of Contents
Optimizing Hadoop for MapReduce
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. Understanding Hadoop MapReduce
The MapReduce model
An overview of Hadoop MapReduce
Hadoop MapReduce internals
Factors affecting the performance of MapReduce
Summary
2. An Overview of the Hadoop Parameters
Investigating the Hadoop parameters
The mapred-site.xml configuration file
The CPU-related parameters
The disk I/O related parameters
The memory-related parameters
The network-related parameters
The hdfs-site.xml configuration file
The core-site.xml configuration file
Hadoop MapReduce metrics
Performance monitoring tools
Using Chukwa to monitor Hadoop
Using Ganglia to monitor Hadoop
Using Nagios to monitor Hadoop
Using Apache Ambari to monitor Hadoop
Summary
3. Detecting System Bottlenecks
Performance tuning
Creating a performance baseline
Identifying resource bottlenecks
Identifying RAM bottlenecks
Identifying CPU bottlenecks
Identifying storage bottlenecks
Identifying network bandwidth bottlenecks
Summary
4. Identifying Resource Weaknesses
Identifying cluster weakness
Checking the Hadoop cluster node's health
Checking the input data size
Checking massive I/O and network traffic
Checking for insufficient concurrent tasks
Checking for CPU contention
Sizing your Hadoop cluster
Configuring your cluster correctly
Summary
5. Enhancing Map and Reduce Tasks
Enhancing map tasks
Input data and block size impact
Dealing with small and unsplittable files
Reducing spilled records during the Map phase
Calculating map tasks' throughput
Enhancing reduce tasks
Calculating reduce tasks' throughput
Improving Reduce execution phase
Tuning map and reduce parameters
Summary
6. Optimizing MapReduce Tasks
Using Combiners
Using compression
Using appropriate Writable types
Reusing types smartly
Optimizing mappers and reducers code
Summary
7. Best Practices and Recommendations
Hardware tuning and OS recommendations
The Hadoop cluster checklist
The Bios tuning checklist
OS configuration recommendations
Hadoop best practices and recommendations
Deploying Hadoop
Hadoop tuning recommendations
Using a MapReduce template class code
Summary
Index
Optimizing Hadoop for MapReduce
Optimizing Hadoop for MapReduce
Copyright © 2014 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either expressed or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: February 2014
Production Reference: 1140214
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78328-565-5
www.packtpub.com
Cover Image by Khaled Tannir (<contact@khaledtannir.net>)
Credits
Author
Khaled Tannir
Reviewers
Włodzimierz Bzyl
Craig Henderson
Mark Kerzner
Acquisition Editor
Joanne Fitzpatrick
Commissioning Editor
Manasi Pandire
Technical Editors
Mario D'Souza
Rosmy George
Pramod Kumavat
Arwa Manasawala
Adrian Raposo
Copy Editors
Kirti Pai
Laxmi Subramanian
Project Coordinator
Aboli Ambardekar
Proofreaders
Simran Bhogal
Ameesha Green
Indexer
Rekha Nair
Graphics
Yuvraj Mannari
Production Coordinators
Manu Joseph
Alwin Roy
Cover Work
Alwin Roy
About the Author
Khaled Tannir has been working with computers since 1980. He began programming with the legendary Sinclair Zx81 and later with Commodore home computer products (Vic 20, Commodore 64, Commodore 128D, and Amiga 500).
He has a Bachelor's degree in Electronics, a Master's degree in System Information Architectures, in which he graduated with a professional thesis, and completed his education with a Master of Research degree.
He is a Microsoft Certified Solution Developer (MCSD) and has more than 20 years of technical experience leading the development and implementation of software solutions and giving technical presentations. He now works as an independent IT consultant and has worked as an infrastructure engineer, senior developer, and enterprise/solution architect for many companies in France and Canada.
With significant experience in Microsoft .Net, Microsoft Server Systems, and Oracle Java technologies, he has extensive skills in online/offline applications design, system conversions, and multilingual applications in both domains: Internet and Desktops.
He is always researching new technologies, learning about them, and looking for new adventures in France, North America, and the Middle-east. He owns an IT and electronics laboratory with many servers, monitors, open electronic boards such as Arduino, Netduino, RaspBerry Pi, and .Net Gadgeteer, and some smartphone devices based on Windows Phone, Android, and iOS operating systems.
In 2012, he contributed to the EGC 2012 (International Complex Data Mining forum at Bordeaux University, France) and presented, in a workshop session, his work on how to optimize data distribution in a cloud computing environment
. This work aims to define an approach to optimize the use of data mining algorithms such as k-means and Apriori in a cloud computing environment.
He is the author of RavenDB 2.x Beginner's Guide, Packt Publishing.
He aims to get a PhD in Cloud Computing and Big Data and wants to learn more and more about these technologies.
He enjoys taking landscape and night time photos, travelling, playing video games, creating funny electronic gadgets with Arduino/.Net Gadgeteer, and of course, spending time with his wife and family.
You can reach him at <contact@khaledtannir.net>.
Acknowledgments
All praise is due to Allah, the Lord of the Worlds. First, I must thank Allah for giving me the ability to think and write.
Next, I would like to thank my wife, Laila, for her big support, encouragement, and patience throughout this project. Also, I would like to thank my family in Canada and Lebanon for their support during the writing of this book.
I would like to thank everyone at Packt Publishing for their help and guidance, and for giving me the opportunity to share my experience and knowledge in technology with others in the Hadoop and MapReduce community.
Thank you as well to the technical reviewers, who provided great feedback to ensure that every tiny technical detail was accurate and rich in content.
About the Reviewers
Włodzimierz Bzyl works at the University of Gdańsk, Poland. His current interests include web-related technologies and NoSQL databases. He has a passion for new technologies and introduces his students to them. He enjoys contributing to open source software and spending time trekking in the Tatra mountains.
Craig Henderson graduated in 1995 with a degree in Computing for Real-time Systems and has spent his career working on large-scale data processing and distributed systems. He is the author of an open source C++ MapReduce library for single server application scalability, which is available at https://github.com/cdmh/mapreduce, and he currently researches image and video processing techniques for person identification.
Mark Kerzner holds degrees in Law, Mathematics, and Computer Science. He has been designing software for many years and Hadoop-based systems since 2008. He is the President of SHMsoft, a provider of Hadoop applications for various verticals, a co-founder of the Hadoop Illuminated training and consulting, and also the co-author of the open source book, Hadoop Illuminated. He has also authored and co-authored other books and patents.
I would like to acknowledge the help of my colleagues, in particular Sujee Maniyam, and last but not least, my multitalented family.
www.PacktPub.com
Support files, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support files and downloads related to your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser
Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply