Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Hadoop 2.x Administration Cookbook
Hadoop 2.x Administration Cookbook
Hadoop 2.x Administration Cookbook
Ebook865 pages4 hours

Hadoop 2.x Administration Cookbook

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book
  • Become an expert Hadoop administrator and perform tasks to optimize your Hadoop Cluster
  • Import and export data into Hive and use Oozie to manage workflow.
  • Practical recipes will help you plan and secure your Hadoop cluster, and make it highly available
Who This Book Is For

If you are a system administrator with a basic understanding of Hadoop and you want to get into Hadoop administration, this book is for you. It’s also ideal if you are a Hadoop administrator who wants a quick reference guide to all the Hadoop administration-related tasks and solutions to commonly occurring problems

LanguageEnglish
Release dateMay 26, 2017
ISBN9781787126879
Hadoop 2.x Administration Cookbook

Related to Hadoop 2.x Administration Cookbook

Related ebooks

Computers For You

View More

Related articles

Reviews for Hadoop 2.x Administration Cookbook

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Hadoop 2.x Administration Cookbook - Gurmukh Singh

    Table of Contents

    Hadoop 2.x Administration Cookbook

    Credits

    About the Author

    About the Reviewers

    www.PacktPub.com

    eBooks, discount offers, and more

    Why subscribe?

    Customer Feedback

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Sections

    Getting ready

    How to do it…

    How it works…

    There's more…

    See also

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Downloading the color images of this book

    Errata

    Piracy

    Questions

    1. Hadoop Architecture and Deployment

    Introduction

    Overview of Hadoop Architecture

    Building and compiling Hadoop

    Getting ready

    How to do it...

    How it works...

    Installation methods

    Getting ready

    How to do it...

    How it works...

    Setting up host resolution

    Getting ready

    How to do it...

    How it works...

    Installing a single-node cluster - HDFS components

    Getting ready

    How to do it...

    How it works...

    There's more...

    Setting up ResourceManager and NodeManager

    Installing a single-node cluster - YARN components

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Installing a multi-node cluster

    Getting ready

    How to do it...

    How it works...

    Configuring the Hadoop Gateway node

    Getting ready

    How to do it...

    How it works...

    See also

    Decommissioning nodes

    Getting ready

    How to do it...

    How it works...

    See also

    Adding nodes to the cluster

    Getting ready

    How to do it...

    How it works...

    There's more...

    2. Maintaining Hadoop Cluster HDFS

    Introduction

    Overview of HDFS

    Configuring HDFS block size

    Getting ready

    How to do it...

    How it works...

    Setting up Namenode metadata location

    Getting ready

    How to do it...

    How it works...

    Loading data in HDFS

    Getting ready

    How to do it...

    How it works...

    Configuring HDFS replication

    Getting ready

    How to do it...

    How it works...

    See also

    HDFS balancer

    Getting ready

    How to do it...

    How it works...

    Quota configuration

    Getting ready

    How to do it...

    How it works...

    HDFS health and FSCK

    Getting ready

    How to do it...

    How it works...

    See also

    Configuring rack awareness

    Getting ready

    How to do it...

    How it works...

    See also

    Recycle or trash bin configuration

    Getting ready

    How to do it...

    How it works...

    There's more...

    Distcp usage

    Getting ready

    How to do it...

    How it works...

    Control block report storm

    Getting ready

    How to do it...

    How it works...

    Configuring Datanode heartbeat

    Getting ready

    How to do it...

    How it works...

    3. Maintaining Hadoop Cluster – YARN and MapReduce

    Introduction

    Running a simple MapReduce program

    Getting ready

    How to do it...

    Hadoop streaming

    Getting ready

    How to do it...

    How it works...

    Configuring YARN history server

    Getting ready

    How to do it...

    How it works...

    There's more...

    Job history web interface and metrics

    Getting ready

    How to do it...

    How it works...

    Configuring ResourceManager components

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    YARN containers and resource allocations

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    ResourceManager Web UI and JMX metrics

    Getting ready

    How to do it...

    How it works...

    Preserving ResourceManager states

    Getting ready

    How to do it...

    How it works...

    There's more...

    4. High Availability

    Introduction

    Namenode HA using shared storage

    Getting ready

    How to do it...

    How it works...

    See also

    ZooKeeper configuration

    Getting ready

    How to do it...

    How it works...

    Namenode HA using Journal node

    Getting ready

    How to do it...

    How it works...

    Resourcemanager HA using ZooKeeper

    Getting ready

    How to do it...

    How it works…

    Rolling upgrade with HA

    Getting ready

    How to do it...

    How it works...

    Configure shared cache manager

    Getting ready

    How to do it...

    There's more...

    See also

    Configure HDFS cache

    Getting ready

    How to do it...

    How it works...

    See also

    HDFS snapshots

    Getting ready

    How to do it...

    How it works...

    Configuring storage based policies

    Getting ready

    How to do it...

    How it works...

    Configuring HA for Edge nodes

    Getting ready

    How to do it...

    How it works...

    5. Schedulers

    Introduction

    Configuring users and groups

    Getting ready

    How to do it...

    How it works...

    See also

    Fair Scheduler configuration

    Getting ready

    How to do it...

    How it works...

    Fair Scheduler pools

    Getting ready

    How to do it...

    How it works...

    Configuring job queues

    Getting ready

    How to do it...

    How it works...

    See also

    Job queue ACLs

    Getting ready

    How to do it...

    How it works...

    See also

    Configuring Capacity Scheduler

    Getting ready

    How to do it...

    How it works...

    See also

    Queuing mappings in Capacity Scheduler

    Getting ready

    How to do it...

    How it works...

    YARN and Mapred commands

    Getting ready

    How to do it...

    How it works...

    YARN label-based scheduling

    Getting ready

    How to do it...

    How it works...

    YARN SLS

    Getting ready

    How to do it...

    How it works...

    6. Backup and Recovery

    Introduction

    Initiating Namenode saveNamespace

    Getting ready

    How to do it...

    How it works...

    Using HDFS Image Viewer

    Getting ready

    How to do it...

    How it works...

    Fetching parameters which are in-effect

    Getting ready

    How to do it...

    How it works...

    Configuring HDFS and YARN logs

    Getting ready

    How to do it...

    How it works...

    See also

    Backing up and recovering Namenode

    Getting ready

    How to do it...

    How it works...

    See also

    Configuring Secondary Namenode

    Getting ready

    How to do it...

    How it works…

    Promoting Secondary Namenode to Primary

    Getting ready

    How to do it...

    How it works...

    See also

    Namenode recovery

    Getting ready

    How to do it...

    How it works...

    Namenode roll edits – online mode

    Getting ready

    How to do it...

    How it works...

    Namenode roll edits – offline mode

    Getting ready

    How to do it...

    How it works...

    Datanode recovery – disk full

    Getting ready

    How to do it...

    How it works...

    Configuring NFS gateway to serve HDFS

    Getting ready

    How to do it...

    How it works...

    Recovering deleted files

    Getting ready

    How to do it...

    How it works...

    7. Data Ingestion and Workflow

    Introduction

    Hive server modes and setup

    Getting ready

    How to do it...

    How it works...

    Using MySQL for Hive metastore

    How to do it…

    How it works...

    Operating Hive with ZooKeeper

    Getting ready

    How to do it...

    How it works...

    Loading data into Hive

    Getting ready

    How to do it...

    How it works...

    See also

    Partitioning and Bucketing in Hive

    Getting ready

    How to do it...

    How it works...

    See also

    Hive metastore database

    Getting ready

    How to do it...

    How it works...

    See also

    Designing Hive with credential store

    Getting ready

    How to do it...

    How it works...

    Configuring Flume

    Getting ready

    How to do it...

    How it works...

    Configure Oozie and workflows

    Getting ready

    How to do it...

    How it works...

    8. Performance Tuning

    Tuning the operating system

    Getting ready

    How to do it...

    How it works...

    See also

    Tuning the disk

    Getting ready

    How to do it...

    How it works...

    Tuning the network

    Getting ready

    How to do it...

    How it works...

    Tuning HDFS

    Getting ready

    How to do it...

    How it works...

    Tuning Namenode

    Getting ready

    How to do it...

    There's more...

    See also

    Tuning Datanode

    Getting ready

    How to do it...

    How it works...

    See also

    Configuring YARN for performance

    Getting ready

    How to do it...

    How it works...

    Configuring MapReduce for performance

    Getting ready

    How to do it...

    How it works...

    Hive performance tuning

    Getting ready

    How to do it...

    There's more...

    How it works...

    Benchmarking Hadoop cluster

    Getting ready

    How to do it...

    Benchmark 1--Testing HDFS with TestDFSIO

    Benchmark 2--Stress testing Namenode

    Benchmark 3--MapReduce testing by generating small files

    Benchmark 4--TeraGen, TeraSort, and TeraValidate benchmarks

    There's more...

    How it works...

    9. HBase Administration

    Introduction

    Setting up single node HBase cluster

    Getting ready

    How to do it...

    How it works...

    Setting up multi-node HBase cluster

    Getting ready

    How to do it...

    How it works...

    Inserting data into HBase

    Getting ready

    How to do it...

    How it works...

    Integration with Hive

    Getting ready

    How to do it...

    How it works...

    See also

    HBase administration commands

    Getting ready

    How to do it...

    How it works...

    See also

    HBase backup and restore

    Getting ready

    How to do it...

    How it works...

    Tuning HBase

    Getting ready

    How to do it...

    How it works...

    HBase upgrade

    Getting ready

    How to do it...

    How it works...

    Migrating data from MySQL to HBase using Sqoop

    Getting ready

    How to do it...

    10. Cluster Planning

    Introduction

    Disk space calculations

    Getting ready

    How to do it...

    How it works...

    Nodes needed in the cluster

    Getting ready

    How to do it...

    How it works...

    See also

    Memory requirements

    Getting ready

    How to do it...

    How it works...

    See also

    Sizing the cluster as per SLA

    Getting ready

    How to do it...

    How it works...

    See also

    Network design

    Getting ready

    How to do it...

    How it works...

    Estimating the cost of the Hadoop cluster

    How to do it...

    How it works...

    Hardware and software options

    How it works...

    11. Troubleshooting, Diagnostics, and Best Practices

    Introduction

    Namenode troubleshooting

    Getting ready

    How to do it...

    How it works...

    See also

    Datanode troubleshooting

    Getting ready

    How to do it...

    How it works...

    See also

    Resourcemanager troubleshooting

    Getting ready

    How to do it…

    How it works...

    See also

    Diagnose communication issues

    Getting ready

    How to do it...

    How it works...

    Parse logs for errors

    Getting ready

    How to do it...

    How it works...

    Hive troubleshooting

    Getting ready

    How to do it...

    How it works...

    See also

    HBase troubleshooting

    Getting ready

    How to do it...

    How it works...

    Hadoop best practices

    How it works...

    12. Security

    Introduction

    Encrypting disk using LUKS

    Getting ready

    How to do it...

    How it works...

    See also

    Configuring Hadoop users

    Getting ready

    How to do it...

    How it works...

    HDFS encryption at Rest

    Getting ready

    How to do it...

    How it works...

    Configuring SSL in Hadoop

    Getting ready

    How to do it...

    How it works...

    See also

    In-transit encryption

    Getting ready

    How to do it...

    There's more...

    See also

    Enabling service level authorization

    Getting ready

    How to do it...

    How it works...

    See also

    Securing ZooKeeper

    Getting ready

    How to do it...

    How it works...

    Configuring auditing

    Getting ready

    How to do it...

    How it works...

    Configuring Kerberos server

    Getting ready

    How to do it...

    How it works...

    Configuring and enabling Kerberos for Hadoop

    Getting ready

    How to do it...

    How it works...

    Index

    Hadoop 2.x Administration Cookbook


    Hadoop 2.x Administration Cookbook

    Copyright © 2017 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: May 2017

    Production reference: 1220517

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78712-673-2

    www.packtpub.com

    Credits

    Author

    Gurmukh Singh

    Reviewers

    Rajiv Tiwari

    Wissem EL Khlifi

    Commissioning Editor

    Amey Varangaonkar

    Acquisition Editor

    Varsha Shetty

    Content Development Editor

    Deepti Thore

    Technical Editor

    Nilesh Sawakhande

    Copy Editors

    Laxmi Subramanian

    Safis Editing

    Project Coordinator

    Shweta H Birwatkar

    Proofreader

    Safis Editing

    Indexer

    Francy Puthiry

    Graphics

    Tania Dutta

    Production Coordinator

    Nilesh Mohite

    Cover Work

    Nilesh Mohite

    About the Author

    Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies.

    He has worked with companies such as HP, JP Morgan, and Yahoo.

    He has authored Monitoring Hadoop by Packt Publishing (https://www.packtpub.com/big-data-and-business-intelligence/monitoring-hadoop)

    I would like to thank my wife, Navdeep Kaur, and my lovely daughter, Amanat Dhillon, who have always supported me throughout the journey 
of this book.

    About the Reviewers

    Rajiv Tiwari is a freelance big data and cloud architect with over 17 years of experience across big data, analytics, and cloud computing for banks and other financial organizations. He is an electronics engineering graduate from IIT Varanasi, and has been working in England for the past 13 years, mostly in the financial city of London. Rajiv can be contacted on Twitter at @bigdataoncloud.

    He is the author of the book Hadoop for Finance, an exclusive book for using Hadoop in banking and financial services.

    I would like to thank my wife, Seema, and my son, Rivaan, for allowing me to spend their quota of time on reviewing this book.

    Wissem El Khlifi is the first Oracle ACE in Spain and an Oracle Certified Professional DBA with over 12 years of IT experience.

    He earned the Computer Science Engineer degree from FST Tunisia, Master in Computer Science from the UPC Barcelona, and Master in Big Data Science from the UPC Barcelona.

    His area of interest include Cloud Architecture, Big Data Architecture, and Big Data Management and Analysis.

    His career has included the roles of: Java analyst / programmer, Oracle Senior DBA, and big data scientist. He currently works as Senior Big Data and Cloud Architect for Schneider Electric / APC.

    He writes numerous articles on his website http://www.oracle-class.com and is avaialble on twitter at @orawiss.

    www.PacktPub.com

    eBooks, discount offers, and more

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www.packtpub.com/mapt

    Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Customer Feedback

    Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787126730.

    If you'd like to join our team of regular reviewers, you can e-mail us at customerreviews@packtpub.com. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

    Preface

    Hadoop is a distributed system with a large ecosystem, which is growing at an exponential rate, and hence it becomes important to get a grip on things and do a deep dive into the functioning of a Hadoop cluster in production. Whether you are new to Hadoop or a seasoned Hadoop specialist, this recipe book contains recipes to deep dive into Hadoop cluster configuration and optimization.

    What this book covers

    Chapter 1, Hadoop Architecture and Deployment, covers Hadoop's architecture, its components, various installation modes and important daemons, and the services that make Hadoop a robust system. This chapter covers single-node and multinode clusters.

    Chapter 2, Maintaining Hadoop Cluster – HDFS, wraps the storage layer HDFS, block size, replication, cluster health, Quota configuration, rack awareness, and communication channel between nodes.

    Chapter 3, Maintaining Hadoop Cluster – YARN and MapReduce, talks about the processing layer in Hadoop and the resource management framework YARN. This chapter covers how to configure YARN components, submit jobs, configure job history server, and YARN fundamentals.

    Chapter 4, High Availability, covers high availability for a Namenode and Resourcemanager, ZooKeeper configuration, HDFS storage-based policies, HDFS snapshots, and rolling upgrades.

    Chapter 5, Schedulers, talks about YARN schedulers such as fair and capacity scheduler, with detailed recipes on configuring Queues, Queue ACLs, configuration of users and groups, and other Queue administration commands.

    Chapter 6, Backup and Recovery, covers Hadoop metastore, backup and restore procedures on a Namenode, configuration of a secondary Namenode, and various ways of recovering lost Namenodes. This chapter also talks about configuring HDFS and YARN logs for troubleshooting.

    Chapter 7, Data Ingestion and Workflow, talks about Hive configuration and its various modes of operation. This chapter also covers setting up Hive with the credential store and highly available access using ZooKeeper. The recipes in this chapter give details about the process of loading data into Hive, partitioning, bucketing concepts, and configuration with an external metastore. It also covers Oozie installation and Flume configuration for log ingestion.

    Chapter 8, Performance Tuning, covers the performance tuning aspects of HDFS, YARN containers, the operating system, and network parameters, as well as optimizing the cluster for production by comparing benchmarks for various configurations.

    Chapter 9, Hbase and RDBMS, talks about HBase cluster configuration, best practices, HBase tuning, backup, and restore. It also covers migration of data from MySQL to HBase and the procedure to upgrade HBase to the latest release.

    Chapter 10, Cluster Planning, covers Hadoop cluster planning and the best practices for designing clusters are, in terms of disk storage, network, servers, and placement policy. This chapter also covers costing and the impact of SLA driver workloads on cluster planning.

    Chapter 11, Troubleshooting, Diagnostics, and Best Practices, talks about the troubleshooting steps for a Namenode and Datanode, and diagnoses communication errors. It also covers details on logs and how to parse them for errors to extract important key points on issues faced.

    Chapter 12, Security, covers Hadoop security in terms of data encryption, in-transit encryption, ssl configuration, and, more importantly, configuring Kerberos for the Hadoop cluster. This chapter also covers auditing and a recipe on securing ZooKeeper.

    What you need for this book

    To go through the recipes in this book, users need any Linux distribution, which could be Ubuntu, Centos, or any other flavor, as long as it supports running JVM. We use Centos in our recipe, as it is the most commonly used operating system for Hadoop clusters.

    Hadoop runs on both virtualized and physical servers, so it is recommended to have at least 8 GB for the base system, on which about three virtual hosts can be set up. Users do not need to set up all the recipes covered in this book all at once; they can run only those daemons that are necessary for that particular recipe. This way, they can keep the resource requirements to the bare minimum. It is good to have at least four hosts to practice all the recipes in this book. These hosts could be virtual or physical.

    In terms of software, users need JDK 1.7 minimum, and any SSH client, such as PuTTY in Windows or Terminal, to connect to the Hadoop nodes.

    Who this book is for

    If you are a system administrator with a basic understanding of Hadoop and you want to get into Hadoop administration, this book is for you. It's also ideal if you are a Hadoop administrator who wants a quick reference guide to all the Hadoop administration-related tasks and solutions to commonly occurring problems.

    Sections

    In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also).

    To give clear instructions on how to complete a recipe, we use these sections as follows:

    Getting ready

    This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

    How to do it…

    This section contains the steps required to follow the recipe.

    How it works…

    This section usually consists of a detailed explanation of what happened in the previous section.

    There's more…

    This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

    See also

    This section provides helpful links to other useful information for the recipe.

    Conventions

    In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

    Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: You will see a tarball under the hadoop-2.7.3-src/hadoop-dist/target/ folder.

    A block of code is set as follows:

        dfs.hosts.exclude

        /home/hadoop/excludes

        true

    Any command-line input or output is written as follows:

    $ stop-yarn.sh

    Note

    Warnings or important notes appear in a box like this.

    Tip

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

    To send us general feedback, simply e-mail <feedback@packtpub.com>, and mention the book's title in the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a

    Enjoying the preview?
    Page 1 of 1