Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

HDInsight Essentials - Second Edition
HDInsight Essentials - Second Edition
HDInsight Essentials - Second Edition
Ebook371 pages2 hours

HDInsight Essentials - Second Edition

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book
  • Learn how to quickly provision a Hadoop cluster using Windows Azure Cloud Services
  • Build an end-to-end application for a big data problem using open source software
  • Discover more about modern data architecture with this guide, to help you understand the transition from legacy relational Enterprise Data Warehouse
Who This Book Is For

If you want to discover one of the latest tools designed to produce stunning Big Data insights, this book features everything you need to get to grips with your data. Whether you are a data architect, developer, or a business strategist, HDInsight adds value in everything from development, administration, and reporting.

LanguageEnglish
Release dateJan 27, 2015
ISBN9781784396664
HDInsight Essentials - Second Edition

Related to HDInsight Essentials - Second Edition

Related ebooks

System Administration For You

View More

Related articles

Reviews for HDInsight Essentials - Second Edition

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    HDInsight Essentials - Second Edition - Rajesh Nadipalli

    Table of Contents

    HDInsight Essentials Second Edition

    Credits

    About the Author

    About the Reviewers

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    Why subscribe?

    Free access for Packt account holders

    Instant updates on new Packt books

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Errata

    Piracy

    Questions

    1. Hadoop and HDInsight in a Heartbeat

    Data is everywhere

    Business value of big data

    Hadoop concepts

    Brief history of Hadoop

    Core components

    Hadoop cluster layout

    HDFS overview

    Writing a file to HDFS

    Reading a file from HDFS

    HDFS basic commands

    YARN overview

    YARN application life cycle

    YARN workloads

    Hadoop distributions

    HDInsight overview

    HDInsight and Hadoop relationship

    Hadoop on Windows deployment options

    Microsoft Azure HDInsight Service

    HDInsight Emulator

    Hortonworks Data Platform (HDP) for Windows

    Summary

    2. Enterprise Data Lake using HDInsight

    Enterprise Data Warehouse architecture

    Source systems

    Data warehouse

    Storage

    Processing

    User access

    Provisioning and monitoring

    Data governance and security

    Pain points of EDW

    The next generation Hadoop-based Enterprise data architecture

    Source systems

    Data Lake

    Storage

    Processing

    User access

    Provisioning and monitoring

    Data governance, security, and metadata

    Journey to your Data Lake dream

    Ingestion and organization

    Transformation (rules driven)

    Access, analyze, and report

    Tools and technology for Hadoop ecosystem

    Use case powered by Microsoft HDInsight

    Problem statement

    Solution

    Source systems

    Storage

    Processing

    User access

    Benefits

    Summary

    3. HDInsight Service on Azure

    Registering for an Azure account

    Azure storage

    Provisioning an HDInsight cluster

    Cluster topology

    Provisioning using Azure PowerShell

    Creating a storage container

    Provisioning a new HDInsight cluster

    HDInsight management dashboard

    Dashboard

    Monitor

    Configuration

    Exploring clusters using the remote desktop

    Running a sample MapReduce

    Deleting the cluster

    HDInsight Emulator for the development

    Installing HDInsight Emulator

    Installation verification

    Using HDInsight Emulator

    Summary

    4. Administering Your HDInsight Cluster

    Monitoring cluster health

    Name Node status

    The Name Node Overview page

    Datanode Status

    Utilities and logs

    Hadoop Service Availability

    YARN Application Status

    Azure storage management

    Configuring your storage account

    Monitoring your storage account

    Managing access keys

    Deleting your storage account

    Azure PowerShell

    Access Azure Blob storage using Azure PowerShell

    Summary

    5. Ingest and Organize Data Lake

    End-to-end Data Lake solution

    Ingesting to Data Lake using HDFS command

    Connecting to a Hadoop client

    Getting your files on the local storage

    Transferring to HDFS

    Loading data to Azure Blob storage using Azure PowerShell

    Loading files to Data Lake using GUI tools

    Storage access keys

    Storage tools

    CloudXplorer

    Key benefits

    Registering your storage account

    Uploading files to your Blob storage

    Using Sqoop to move data from RDBMS to Data Lake

    Key benefits

    Two modes of using Sqoop

    Using Sqoop to import data (SQL to Hadoop)

    Organizing your Data Lake in HDFS

    Managing file metadata using HCatalog

    Key benefits

    Using HCatalog Command Line to create tables

    Summary

    6. Transform Data in the Data Lake

    Transformation overview

    Tools for transforming data in Data Lake

    HCatalog

    Persisting HCatalog metastore in a SQL database

    Apache Hive

    Hive architecture

    Starting Hive in HDInsight

    Basic Hive commands

    Apache Pig

    Pig architecture

    Starting Pig in HDInsight node

    Basic Pig commands

    Pig or Hive

    MapReduce

    The mapper code

    The reducer code

    The driver code

    Executing MapReduce on HDInsight

    Azure PowerShell for execution of Hadoop jobs

    Transformation for the OTP project

    Cleaning data using Pig

    Executing Pig script

    Registering a refined and aggregate table using Hive

    Executing Hive script

    Reviewing results

    Other tools used for transformation

    Oozie

    Spark

    Summary

    7. Analyze and Report from Data Lake

    Data access overview

    Analysis using Excel and Microsoft Hive ODBC driver

    Prerequisites

    Step 1 – installing the Microsoft Hive ODBC driver

    Step 2 – creating Hive ODBC Data Source

    Step 3 – importing data to Excel

    Analysis using Excel Power Query

    Prerequisites

    Step 1 – installing the Microsoft Power Query for Excel

    Step 2 – importing Azure Blob storage data into Excel

    Step 3 – analyzing data using Excel

    Other BI features in Excel

    PowerPivot

    Power View and Power Map

    Step 1 – importing Azure Blob storage data into Excel

    Step 2 – launch map view

    Step 3 – configure the map

    Power BI Catalog

    Ad hoc analysis using Hive

    Other alternatives for analysis

    RHadoop

    Apache Giraph

    Apache Mahout

    Azure Machine Learning

    Summary

    8. HDInsight 3.1 New Features

    HBase

    HBase positioning in Data Lake and use cases

    Provisioning HDInsight HBase cluster

    Creating a sample HBase schema

    Designing the airline on-time performance table

    Connecting to HBase using the HBase shell

    Creating an HBase table

    Loading data to the HBase table

    Querying data from the HBase table

    HBase additional information

    Storm

    Storm positioning in Data Lake

    Storm key concepts

    Provisioning HDInsight Storm cluster

    Running a sample Storm topology

    Connecting to Storm using Storm shell

    Running the Storm Wordcount topology

    Monitoring status of the Wordcount topology

    Additional information on Storm

    Apache Tez

    Summary

    9. Strategy for a Successful Data Lake Implementation

    Challenges on building a production Data Lake

    The success path for a production Data Lake

    Identifying the big data problem

    Proof of technology for Data Lake

    Form a Data Lake Center of Excellence

    Executive sponsors

    Data Lake consumers

    Development

    Operations and infrastructure

    Architectural considerations

    Extensible and modular

    Metadata-driven solution

    Integration strategy

    Security

    Online resources

    Summary

    Index

    HDInsight Essentials Second Edition


    HDInsight Essentials Second Edition

    Copyright © 2015 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: September 2013

    Second edition: January 2015

    Production reference: 1200115

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78439-942-9

    www.packtpub.com

    Credits

    Author

    Rajesh Nadipalli

    Reviewers

    Simon Elliston Ball

    Anindita Basak

    Rami Vemula

    Commissioning Editor

    Taron Pereira

    Acquisition Editor

    Owen Roberts

    Content Development Editor

    Rohit Kumar Singh

    Technical Editors

    Madhuri Das

    Taabish Khan

    Copy Editor

    Rashmi Sawant

    Project Coordinator

    Mary Alex

    Proofreaders

    Ting Baker

    Ameesha Green

    Indexer

    Rekha Nair

    Production Coordinator

    Melwyn D'sa

    Cover Work

    Melwyn D'sa

    About the Author

    Rajesh Nadipalli currently manages software architecture and delivery of Zaloni's Bedrock Data Management Platform, which enables customers to quickly and easily realize true Hadoop-based Enterprise Data Lakes. Rajesh is also an instructor and a content provider for Hadoop training, including Hadoop development, Hive, Pig, and HBase. In his previous role as a senior solutions architect, he evaluated big data goals for his clients, recommended a target state architecture, and conducted proof of concepts and production implementation. His clients include Verizon, American Express, NetApp, Cisco, EMC, and UnitedHealth Group.

    Prior to Zaloni, Rajesh worked for Cisco Systems for 12 years and held a technical leadership position. His key focus areas have been data management, enterprise architecture, business intelligence, data warehousing, and Extract Transform Load (ETL). He has demonstrated success by delivering scalable data management and BI solutions that empower business to make informed decisions.

    Rajesh authored the first version of the book HDInsight Essentials, Packt Publishing, released in September 2013, the first book in print for HDInsight, providing data architects, developers, and managers with an introduction to the new Hadoop distribution from Microsoft.

    He has over 18 years of IT experience. He holds an MBA from North Carolina State University and a BSc degree in Electronics and Electrical from the University of Mumbai, India.

    I would like to thank my family for their unconditional love, support, and patience during the entire process.

    To my friends and coworkers at Zaloni, thank you for inspiring and encouraging me.

    And finally a shout-out to all the folks at Packt Publishing for being really professional.

    About the Reviewers

    Simon Elliston Ball is a solutions engineer at Hortonworks, where he helps a wide range of companies get the best out of Hadoop. Before that, he was the head of big data at Red Gate, creating tools to make HDInsight and Hadoop easier to work with. He has also spoken extensively on big data and NoSQL at conferences around the world.

    Anindita Basak works as a big data cloud consultant and a big data Hadoop trainer and is highly enthusiastic about Microsoft Azure and HDInsight along with Hadoop open source ecosystem. She works as a specialist for Fortune 500 brands including cloud and big data based companies in the US. She has been playing with Hadoop on Azure since the incubation phase (http://www.hadooponazure.com). Previously, she worked as a module lead for the Alten group and as a senior system analyst at Sonata Software Limited, India, in the Azure Professional Direct Delivery group of Microsoft. She worked as a senior software engineer on implementation and migration of various enterprise applications on the Azure cloud in healthcare, retail, and financial domains. She started her journey with Microsoft Azure in the Microsoft Cloud Integration Engineering (CIE) team and worked as a support engineer in Microsoft India (R&D) Pvt. Ltd.

    With more than 6 years of experience in the Microsoft .NET technology stack, she is solely focused on big data cloud and data science. As a Most Valued Blogger, she loves to share her technical experience and expertise through her blog at http://anindita9.wordpress.com and http://anindita9.azurewebsites.net. You can find more about her on her LinkedIn page and you can follow her at @imcuteani on Twitter.

    She recently worked as a technical reviewer for the books HDInsight Essentials and Microsoft Tabular Modeling Cookbook, both by Packt Publishing. She is currently working on Hadoop Essentials, also by Packt Publishing.

    I

    Enjoying the preview?
    Page 1 of 1