Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Hadoop Cluster Deployment
Hadoop Cluster Deployment
Hadoop Cluster Deployment
Ebook265 pages2 hours

Hadoop Cluster Deployment

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book is a step-by-step tutorial filled with practical examples which will show you how to build and manage a Hadoop cluster along with its intricacies.This book is ideal for database administrators, data engineers, and system administrators, and it will act as an invaluable reference if you are planning to use the Hadoop platform in your organization. It is expected that you have basic Linux skills since all the examples in this book use this operating system. It is also useful if you have access to test hardware or virtual machines to be able to follow the examples in the book.
LanguageEnglish
Release dateNov 25, 2013
ISBN9781783281725
Hadoop Cluster Deployment
Author

Danil Zburivsky

Danil Zburivsky has over 10 years experience designing and supporting large-scale data infrastructure for enterprises across the globe.

Related to Hadoop Cluster Deployment

Related ebooks

Enterprise Applications For You

View More

Related articles

Reviews for Hadoop Cluster Deployment

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Hadoop Cluster Deployment - Danil Zburivsky

    Table of Contents

    Hadoop Cluster Deployment

    Credits

    About the Author

    About the Reviewers

    www.PacktPub.com

    Support files, eBooks, discount offers and more

    Why Subscribe?

    Free Access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Errata

    Piracy

    Questions

    1. Setting Up Hadoop Cluster – from Hardware to Distribution

    Choosing Hadoop cluster hardware

    Choosing the DataNode hardware

    Low storage density cluster

    High storage density cluster

    NameNode and JobTracker hardware configuration

    The NameNode hardware

    The JobTracker hardware

    Gateway and other auxiliary services

    Network considerations

    Hadoop hardware summary

    Hadoop distributions

    Hadoop versions

    Choosing Hadoop distribution

    Cloudera Hadoop distribution

    Hortonworks Hadoop distribution

    MapR

    Choosing OS for the Hadoop cluster

    Summary

    2. Installing and Configuring Hadoop

    Configuring OS for Hadoop cluster

    Choosing and setting up the filesystem

    Setting up Java Development Kit

    Other OS settings

    Setting up the CDH repositories

    Setting up NameNode

    JournalNode, ZooKeeper, and Failover Controller

    Hadoop configuration files

    NameNode HA configuration

    JobTracker configuration

    Configuring the job scheduler

    JobQueueTaskScheduler

    FairScheduler

    CapacityTaskScheduler

    DataNode configuration

    TaskTracker configuration

    Advanced Hadoop tuning

    hdfs-site.xml

    mapred-site.xml

    core-site.xml

    Summary

    3. Configuring the Hadoop Ecosystem

    Hosting the Hadoop ecosystem

    Sqoop

    Installing and configuring Sqoop

    Sqoop import example

    Sqoop export example

    Hive

    Hive architecture

    Installing Hive Metastore

    Installing the Hive client

    Installing Hive Server

    Impala

    Impala architecture

    Installing Impala state store

    Installing the Impala server

    Summary

    4. Securing Hadoop Installation

    Hadoop security overview

    HDFS security

    MapReduce security

    Hadoop Service Level Authorization

    Hadoop and Kerberos

    Kerberos overview

    Kerberos in Hadoop

    Configuring Kerberos clients

    Generating Kerberos principals

    Enabling Kerberos for HDFS

    Enabling Kerberos for MapReduce

    Summary

    5. Monitoring Hadoop Cluster

    Monitoring strategy overview

    Hadoop Metrics

    JMX Metrics

    Monitoring Hadoop with Nagios

    Monitoring HDFS

    NameNode checks

    JournalNode checks

    ZooKeeper checks

    Monitoring MapReduce

    JobTracker checks

    Monitoring Hadoop with Ganglia

    Summary

    6. Deploying Hadoop to the Cloud

    Amazon Elastic MapReduce

    Installing the EMR command-line interface

    Choosing the Hadoop version

    Launching the EMR cluster

    Temporary EMR clusters

    Preparing input and output locations

    Using Whirr

    Installing and configuring Whirr

    Summary

    Index

    Hadoop Cluster Deployment


    Hadoop Cluster Deployment

    Copyright © 2013 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: November 2013

    Production Reference: 1181113

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78328-171-8

    www.packtpub.com

    Cover Image by Prashant Timappa Shetty (<sparkling.spectrum.123@gmail.com>)

    Credits

    Author

    Danil Zburivsky

    Reviewers

    Skanda Bhargav

    Yanick Champoux

    Cyril Ganchev

    Alan Gardner

    Acquisition Editor

    Joanne Fitzpatrick

    Commissioning Editor

    Amit Ghodake

    Technical Editors

    Venu Manthena

    Pramod Kumavat

    Project Coordinator

    Amey Sawant

    Copy Editors

    Kirti Pai

    Lavina Pereira

    Adithi Shetty

    Aditya Nair

    Proofreader

    Linda Morris

    Indexer

    Monica Ajmera Mehta

    Graphics

    Ronak Dhruv

    Production Coordinator

    Manu Joseph

    Cover Work

    Manu Joseph

    About the Author

    Danil Zburivsky is a database professional with a focus on open source technologies. Danil started his career as a MySQL database administrator and is currently working as a consultant at Pythian, a global data infrastructure management company. At Pythian, Danil was involved in building a number of Hadoop clusters for customers in financial, entertainment, and communication sectors.

    Danil's other interests include writing fun things in Python, robotics, and machine learning. He is also a regular speaker at various industrial events.

    I would like to thank my wife for agreeing to sacrifice most of our summer evenings while I was working on the book. I would also 
like to thank my colleagues from Pythian, especially Alan Gardner, Cyril Ganchev, and Yanick Champoux, who contributed a lot to 
this project.

    About the Reviewers

    Skanda Bhargav is an Engineering graduate from Visvesvaraya Technological University, Belgaum, Karnataka, India. He did his majors in Computer Science and Engineering. He is currently employed with an MNC based out of Bangalore. Skanda is a Cloudera Certified developer in Apache Hadoop. His interests are Big Data and Hadoop.

    I would like to thank my family for their immense support and faith in me throughout my learning stage. My friends have brought my confidence to a level that brings out the best in me. I am happy that God has blessed me with such wonderful people around me, without which this work might not have been as successful as it is today

    Yanick Champoux is currently sailing the Big Data seas as a solutions architect. In his spare time, he hacks Perl, grows orchids, and writes comic books.

    Cyril Ganchev is a system administrator, database administrator, and a software developer living in Sofia, Bulgaria. He received a master's degree in Computer Systems and Technologies from the Technical University of Sofia in 2005.

    In 2002, he started working as a system administrator in an Internet Café while studying at the Technical University of Sofia. In 2004, he began working as a software developer for the biggest Bulgarian IT company, Information Services Plc. He has been involved in many projects for the Bulgarian government, the Bulgarian National Bank, the National Revenue Agency, and others. He has been involved in several government elections in Bulgaria, writing the code that calculates the results.

    Since 2012, he is working remotely for a Canadian company, Pythian. He started as an Oracle Database Administrator. In 2013, he transitioned to a newly formed team focused on Big Data and NoSQL.

    Cyril Ganchev is an Oracle Advanced PL/SQL Developer Certified Professional and Oracle Database 11g Administrator Certified Associate.

    I want to thank my parents for always supporting me, in all of 
my endeavors.

    Alan Gardner is a solutions architect and developer specializing in designing Big Data systems. These systems incorporate technologies including Hadoop, Apache Kafka, and Storm, as well as Data Science techniques. Alan enjoys presenting his projects and shares his experience extensively at user groups and conferences. He also plays with functional programming and mobile and web development in his spare time.

    Alan is also deeply involved in Ottawa's developer community, consulting with multiple organizations to help non-technical stakeholders organize developer events. With his group, Ottawa Drones, he runs hack days where developers can network, exchange ideas, and build their skills while programming flying robots.

    I'd like to thank Paul White, Alex Gorbachev, and Mick Saunders for always helping me keep on the right path throughout different phases of my career, and Jasmin for always supporting me.

    www.PacktPub.com

    Support files, eBooks, discount offers and more

    You might want to visit www.PacktPub.com for support files and downloads related to your book.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    http://PacktLib.PacktPub.com

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. 

    Why Subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print and bookmark content

    On demand and accessible via web browser

    Free Access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

    Preface

    In the last couple of years, Hadoop has become a standard solution for building data integration platforms. Introducing any new technology into a company's data infrastructure stack requires system engineers and database administrators to quickly learn all the aspects of the new component. Hadoop doesn't make this task any easier because it is not a single software product, but it is rather a collection of multiple separate open source projects. These projects need to be properly installed and configured in order to make the Hadoop platform robust and reliable.

    Many existing Hadoop distributions provide a simplified way to install Hadoop using some kind of graphical interface. This approach dramatically reduces the amount of time required to go from zero to the fully functional Hadoop cluster. It also simplifies managing the cluster configuration. The problem with an automated setup and configuration is that it actually hides a lot of important aspects about Hadoop components that work together, such as why some components require other components, and which configuration parameters are the most important, and so on.

    This book provides a guide to installing and configuring all the main Hadoop components manually. Setting up at least one fully operational cluster by yourself will provide very useful insights into how Hadoop operates under the hood and will make it much easier for you to debug any issues that may arise. You can also use this book as a quick reference to the main Hadoop components and configuration options gathered in one place and in a succinct format. While writing this book, I found myself constantly referring to it when working on real production Hadoop clusters, to look up a specific variable or refresh a best practice when it comes to OS configuration. This habit reassured me that such a guide might be useful to other aspiring and experienced Hadoop administrators and developers.

    What this book covers

    Chapter 1, Setting Up Hadoop Cluster – from Hardware to Distribution, reviews the main Hadoop components and approaches for choosing and sizing cluster hardware. It also touches on the topic of various Hadoop distributions.

    Chapter 2, Installing and Configuring Hadoop, provides step-by-step instructions for installing and configuring the main Hadoop components: NameNode (including High Availability), JobTracker, DataNodes, and TaskTrackers.

    Chapter 3, Configuring the Hadoop Ecosystem, reviews configuration procedures for Sqoop, Hive, and Impala.

    Chapter 4, Securing Hadoop Installation, provides guidelines to securing various Hadoop components. It also provides an overview of configuring Kerberos with Hadoop.

    Chapter 5, Monitoring Hadoop Cluster, guides you to getting your cluster ready for production usage.

    Chapter 6, Deploying Hadoop to the Cloud,

    Enjoying the preview?
    Page 1 of 1