Hadoop Cluster Deployment
()
About this ebook
Danil Zburivsky
Danil Zburivsky has over 10 years experience designing and supporting large-scale data infrastructure for enterprises across the globe.
Related to Hadoop Cluster Deployment
Related ebooks
Apache Oozie Essentials Rating: 0 out of 5 stars0 ratingsApache Mahout Clustering Designs Rating: 0 out of 5 stars0 ratingsMonitoring Hadoop Rating: 0 out of 5 stars0 ratingsOptimizing Hadoop for MapReduce Rating: 0 out of 5 stars0 ratingsOpenStack Sahara Essentials Rating: 0 out of 5 stars0 ratingsApache Cassandra Essentials Rating: 4 out of 5 stars4/5Apache Hive Essentials Rating: 0 out of 5 stars0 ratingsSecuring Hadoop Rating: 4 out of 5 stars4/5Instant MapReduce Patterns – Hadoop Essentials How-to Rating: 0 out of 5 stars0 ratingsApache Spark 2.x Cookbook Rating: 0 out of 5 stars0 ratingsPostgreSQL 11 Administration Cookbook: Over 175 recipes for database administrators to manage enterprise databases Rating: 0 out of 5 stars0 ratingsMicrosoft SQL Server 2012 with Hadoop Rating: 1 out of 5 stars1/5Cloud Development and Deployment with CloudBees Rating: 0 out of 5 stars0 ratingsLearning Heroku Postgres Rating: 0 out of 5 stars0 ratingsCouchbase Essentials Rating: 0 out of 5 stars0 ratingsGetting Started with Big Data Query using Apache Impala Rating: 0 out of 5 stars0 ratingsCloudera A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsGoogle Cloud Dataproc The Ultimate Step-By-Step Guide Rating: 0 out of 5 stars0 ratingsIPsec VPN A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsBizTalk Server 2010 Cookbook Rating: 0 out of 5 stars0 ratingsIBM InfoSphere Replication Server and Data Event Publisher Rating: 0 out of 5 stars0 ratingsLearning HBase Rating: 0 out of 5 stars0 ratingsElasticsearch for Hadoop Rating: 0 out of 5 stars0 ratingsSpark SQL A Complete Guide Rating: 0 out of 5 stars0 ratingsDatabase Security A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsAWS Certified Database Study Guide: Specialty (DBS-C01) Exam Rating: 0 out of 5 stars0 ratingsBuilding Websites with VB.NET and DotNetNuke 4 Rating: 1 out of 5 stars1/5Hadoop in Practice Rating: 0 out of 5 stars0 ratingsOpenShift A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsMastering Apache Cassandra - Second Edition Rating: 0 out of 5 stars0 ratings
Enterprise Applications For You
The Ridiculously Simple Guide to Google Docs: A Practical Guide to Cloud-Based Word Processing Rating: 0 out of 5 stars0 ratingsCreating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Bitcoin For Dummies Rating: 4 out of 5 stars4/5QuickBooks 2023 All-in-One For Dummies Rating: 0 out of 5 stars0 ratingsThe New Email Revolution: Save Time, Make Money, and Write Emails People Actually Want to Read! Rating: 5 out of 5 stars5/5Excel Formulas and Functions 2020: Excel Academy, #1 Rating: 4 out of 5 stars4/5ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology Rating: 0 out of 5 stars0 ratingsExcel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5MrExcel XL: The 40 Greatest Excel Tips of All Time Rating: 4 out of 5 stars4/5Scrivener For Dummies Rating: 4 out of 5 stars4/5Excel 2019 For Dummies Rating: 3 out of 5 stars3/5Systems Thinking: Managing Chaos and Complexity: A Platform for Designing Business Architecture Rating: 4 out of 5 stars4/550 Useful Excel Functions: Excel Essentials, #3 Rating: 5 out of 5 stars5/5QuickBooks Online For Dummies Rating: 0 out of 5 stars0 ratingsMicrosoft Power Platform A Deep Dive: Dig into Power Apps, Power Automate, Power BI, and Power Virtual Agents (English Edition) Rating: 0 out of 5 stars0 ratingsData Governance: How to Design, Deploy and Sustain an Effective Data Governance Program Rating: 4 out of 5 stars4/5Excel 2016 For Dummies Rating: 4 out of 5 stars4/5Excel Formulas That Automate Tasks You No Longer Have Time For Rating: 5 out of 5 stars5/5QuickBooks Online For Dummies Rating: 0 out of 5 stars0 ratingsQuickBooks 2021 For Dummies Rating: 0 out of 5 stars0 ratingsMastering QuickBooks 2020: The ultimate guide to bookkeeping and QuickBooks Online Rating: 0 out of 5 stars0 ratingsEnterprise AI For Dummies Rating: 3 out of 5 stars3/5Experts' Guide to OneNote Rating: 5 out of 5 stars5/5Evernote Essentials Guide (Boxed Set): Evernote Guide For Beginners for Organizing Your Life Rating: 3 out of 5 stars3/5101 Ready-to-Use Excel Formulas Rating: 4 out of 5 stars4/5
Reviews for Hadoop Cluster Deployment
0 ratings0 reviews
Book preview
Hadoop Cluster Deployment - Danil Zburivsky
Table of Contents
Hadoop Cluster Deployment
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. Setting Up Hadoop Cluster – from Hardware to Distribution
Choosing Hadoop cluster hardware
Choosing the DataNode hardware
Low storage density cluster
High storage density cluster
NameNode and JobTracker hardware configuration
The NameNode hardware
The JobTracker hardware
Gateway and other auxiliary services
Network considerations
Hadoop hardware summary
Hadoop distributions
Hadoop versions
Choosing Hadoop distribution
Cloudera Hadoop distribution
Hortonworks Hadoop distribution
MapR
Choosing OS for the Hadoop cluster
Summary
2. Installing and Configuring Hadoop
Configuring OS for Hadoop cluster
Choosing and setting up the filesystem
Setting up Java Development Kit
Other OS settings
Setting up the CDH repositories
Setting up NameNode
JournalNode, ZooKeeper, and Failover Controller
Hadoop configuration files
NameNode HA configuration
JobTracker configuration
Configuring the job scheduler
JobQueueTaskScheduler
FairScheduler
CapacityTaskScheduler
DataNode configuration
TaskTracker configuration
Advanced Hadoop tuning
hdfs-site.xml
mapred-site.xml
core-site.xml
Summary
3. Configuring the Hadoop Ecosystem
Hosting the Hadoop ecosystem
Sqoop
Installing and configuring Sqoop
Sqoop import example
Sqoop export example
Hive
Hive architecture
Installing Hive Metastore
Installing the Hive client
Installing Hive Server
Impala
Impala architecture
Installing Impala state store
Installing the Impala server
Summary
4. Securing Hadoop Installation
Hadoop security overview
HDFS security
MapReduce security
Hadoop Service Level Authorization
Hadoop and Kerberos
Kerberos overview
Kerberos in Hadoop
Configuring Kerberos clients
Generating Kerberos principals
Enabling Kerberos for HDFS
Enabling Kerberos for MapReduce
Summary
5. Monitoring Hadoop Cluster
Monitoring strategy overview
Hadoop Metrics
JMX Metrics
Monitoring Hadoop with Nagios
Monitoring HDFS
NameNode checks
JournalNode checks
ZooKeeper checks
Monitoring MapReduce
JobTracker checks
Monitoring Hadoop with Ganglia
Summary
6. Deploying Hadoop to the Cloud
Amazon Elastic MapReduce
Installing the EMR command-line interface
Choosing the Hadoop version
Launching the EMR cluster
Temporary EMR clusters
Preparing input and output locations
Using Whirr
Installing and configuring Whirr
Summary
Index
Hadoop Cluster Deployment
Hadoop Cluster Deployment
Copyright © 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: November 2013
Production Reference: 1181113
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78328-171-8
www.packtpub.com
Cover Image by Prashant Timappa Shetty (<sparkling.spectrum.123@gmail.com>)
Credits
Author
Danil Zburivsky
Reviewers
Skanda Bhargav
Yanick Champoux
Cyril Ganchev
Alan Gardner
Acquisition Editor
Joanne Fitzpatrick
Commissioning Editor
Amit Ghodake
Technical Editors
Venu Manthena
Pramod Kumavat
Project Coordinator
Amey Sawant
Copy Editors
Kirti Pai
Lavina Pereira
Adithi Shetty
Aditya Nair
Proofreader
Linda Morris
Indexer
Monica Ajmera Mehta
Graphics
Ronak Dhruv
Production Coordinator
Manu Joseph
Cover Work
Manu Joseph
About the Author
Danil Zburivsky is a database professional with a focus on open source technologies. Danil started his career as a MySQL database administrator and is currently working as a consultant at Pythian, a global data infrastructure management company. At Pythian, Danil was involved in building a number of Hadoop clusters for customers in financial, entertainment, and communication sectors.
Danil's other interests include writing fun things in Python, robotics, and machine learning. He is also a regular speaker at various industrial events.
I would like to thank my wife for agreeing to sacrifice most of our summer evenings while I was working on the book. I would also like to thank my colleagues from Pythian, especially Alan Gardner, Cyril Ganchev, and Yanick Champoux, who contributed a lot to this project.
About the Reviewers
Skanda Bhargav is an Engineering graduate from Visvesvaraya Technological University, Belgaum, Karnataka, India. He did his majors in Computer Science and Engineering. He is currently employed with an MNC based out of Bangalore. Skanda is a Cloudera Certified developer in Apache Hadoop. His interests are Big Data and Hadoop.
I would like to thank my family for their immense support and faith in me throughout my learning stage. My friends have brought my confidence to a level that brings out the best in me. I am happy that God has blessed me with such wonderful people around me, without which this work might not have been as successful as it is today
Yanick Champoux is currently sailing the Big Data seas as a solutions architect. In his spare time, he hacks Perl, grows orchids, and writes comic books.
Cyril Ganchev is a system administrator, database administrator, and a software developer living in Sofia, Bulgaria. He received a master's degree in Computer Systems and Technologies from the Technical University of Sofia in 2005.
In 2002, he started working as a system administrator in an Internet Café while studying at the Technical University of Sofia. In 2004, he began working as a software developer for the biggest Bulgarian IT company, Information Services Plc. He has been involved in many projects for the Bulgarian government, the Bulgarian National Bank, the National Revenue Agency, and others. He has been involved in several government elections in Bulgaria, writing the code that calculates the results.
Since 2012, he is working remotely for a Canadian company, Pythian. He started as an Oracle Database Administrator. In 2013, he transitioned to a newly formed team focused on Big Data and NoSQL.
Cyril Ganchev is an Oracle Advanced PL/SQL Developer Certified Professional and Oracle Database 11g Administrator Certified Associate.
I want to thank my parents for always supporting me, in all of my endeavors.
Alan Gardner is a solutions architect and developer specializing in designing Big Data systems. These systems incorporate technologies including Hadoop, Apache Kafka, and Storm, as well as Data Science techniques. Alan enjoys presenting his projects and shares his experience extensively at user groups and conferences. He also plays with functional programming and mobile and web development in his spare time.
Alan is also deeply involved in Ottawa's developer community, consulting with multiple organizations to help non-technical stakeholders organize developer events. With his group, Ottawa Drones, he runs hack days where developers can network, exchange ideas, and build their skills while programming flying robots.
I'd like to thank Paul White, Alex Gorbachev, and Mick Saunders for always helping me keep on the right path throughout different phases of my career, and Jasmin for always supporting me.
www.PacktPub.com
Support files, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support files and downloads related to your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser
Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.
Preface
In the last couple of years, Hadoop has become a standard solution for building data integration platforms. Introducing any new technology into a company's data infrastructure stack requires system engineers and database administrators to quickly learn all the aspects of the new component. Hadoop doesn't make this task any easier because it is not a single software product, but it is rather a collection of multiple separate open source projects. These projects need to be properly installed and configured in order to make the Hadoop platform robust and reliable.
Many existing Hadoop distributions provide a simplified way to install Hadoop using some kind of graphical interface. This approach dramatically reduces the amount of time required to go from zero to the fully functional Hadoop cluster. It also simplifies managing the cluster configuration. The problem with an automated setup and configuration is that it actually hides a lot of important aspects about Hadoop components that work together, such as why some components require other components, and which configuration parameters are the most important, and so on.
This book provides a guide to installing and configuring all the main Hadoop components manually. Setting up at least one fully operational cluster by yourself will provide very useful insights into how Hadoop operates under the hood and will make it much easier for you to debug any issues that may arise. You can also use this book as a quick reference to the main Hadoop components and configuration options gathered in one place and in a succinct format. While writing this book, I found myself constantly referring to it when working on real production Hadoop clusters, to look up a specific variable or refresh a best practice when it comes to OS configuration. This habit reassured me that such a guide might be useful to other aspiring and experienced Hadoop administrators and developers.
What this book covers
Chapter 1, Setting Up Hadoop Cluster – from Hardware to Distribution, reviews the main Hadoop components and approaches for choosing and sizing cluster hardware. It also touches on the topic of various Hadoop distributions.
Chapter 2, Installing and Configuring Hadoop, provides step-by-step instructions for installing and configuring the main Hadoop components: NameNode (including High Availability), JobTracker, DataNodes, and TaskTrackers.
Chapter 3, Configuring the Hadoop Ecosystem, reviews configuration procedures for Sqoop, Hive, and Impala.
Chapter 4, Securing Hadoop Installation, provides guidelines to securing various Hadoop components. It also provides an overview of configuring Kerberos with Hadoop.
Chapter 5, Monitoring Hadoop Cluster, guides you to getting your cluster ready for production usage.
Chapter 6, Deploying Hadoop to the Cloud,