Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Pro Apache Hadoop
Pro Apache Hadoop
Pro Apache Hadoop
Ebook824 pages7 hours

Pro Apache Hadoop

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

Pro Apache Hadoop, Second Edition brings you up to speed on Hadoop – the framework of big data. Revised to cover Hadoop 2.0, the book covers the very latest developments such as YARN (aka MapReduce 2.0), new HDFS high-availability features, and increased scalability in the form of HDFS Federations. All the old content has been revised too, giving the latest on the ins and outs of MapReduce, cluster design, the Hadoop Distributed File System, and more.

This book covers everything you need to build your first Hadoop cluster and begin analyzing and deriving value from your business and scientific data. Learn to solve big-data problems the MapReduce way, by breaking a big problem into chunks and creating small-scale solutions that can be flung across thousands upon thousands of nodes to analyze large data volumes in a short amount of wall-clock time. Learn how to let Hadoop take care of distributing and parallelizing your software—you just focus on the code; Hadoop takes care of the rest.

  • Covers all that is new in Hadoop 2.0
  • Written by a professional involved in Hadoop since day one
  • Takes you quickly to the seasoned pro level on the hottest cloud-computing framework  
LanguageEnglish
PublisherApress
Release dateSep 18, 2014
ISBN9781430248644
Pro Apache Hadoop

Related to Pro Apache Hadoop

Related ebooks

Programming For You

View More

Related articles

Reviews for Pro Apache Hadoop

Rating: 5 out of 5 stars
5/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Pro Apache Hadoop - Jason Venner

    © Sameer Wadkar and Madhu Siddalingaiah 2014

    Sameer Wadkar and Madhu SiddalingaiahPro Apache Hadoop10.1007/978-1-4302-4864-4_1

    1. Motivation for Big Data

    Sameer Wadkar¹  and Madhu Siddalingaiah¹ 

    (1)

    MD, US

    The computing revolution that began more than 2 decades ago has led to large amounts of digital data being amassed by corporations. Advances in digital sensors; proliferation of communication systems, especially mobile platforms and devices; massive scale logging of system events; and rapid movement toward paperless organizations have led to a massive collection of data resources within organizations. And the increasing dependence of businesses on technology ensures that the data will continue to grow at an even faster rate.

    Moore’s Law, which says that the performance of computers has historically doubled approximately every 2 years, initially helped computing resources to keep pace with data growth. However, this pace of improvement in computing resources started tapering off around 2005.

    The computing industry started looking at other options, namely parallel processing to provide a more economical solution. If one computer could not get faster, the goal was to use many computing resources to tackle the same problem in parallel. Hadoop is an implementation of the idea of multiple computers in the network applying MapReduce (a variation of the single instruction, multiple data [SIMD] class of computing technique) to scale data processing.

    The evolution of cloud-based computing through vendors such as Amazon, Google, and Microsoft provided a boost to this concept because we can now rent computing resources for a fraction of the cost it takes to buy them.

    This book is designed to be a practical guide to developing and running software using Hadoop, a project hosted by the Apache Software Foundation and now extended and supported by various vendors such as Cloudera, MapR, and Hortonworks. This chapter will discuss the motivation for Big Data in general and Hadoop in particular.

    What Is Big Data?

    In the context of this book, one useful definition of Big Data is any dataset that cannot be processed or (in some cases) stored using the resources of a single machine to meet the required service level agreements (SLAs). The latter part of this definition is crucial. It is possible to process virtually any scale of data on a single machine. Even data that cannot be stored on a single machine can be brought into one machine by reading it from a shared storage such as a network attached storage (NAS) medium. However, the amount of time it would take to process this data would be prohibitively large with respect to the available time to process this data.

    Consider a simple example. If the average size of the job processed by a business unit is 200 GB, assume that we can read about 50 MB per second. Given the assumption of 50 MB per second, we will need 2 seconds to read 100 MB of data from the disk sequentially, and it would take us approximately 1 hour to read the entire 200 GB of data. Now imagine that this data was required to be processed in under 5 minutes. If the 200 GB required per job could be evenly distributed across 100 nodes, and each node could process its own data (consider a simplified use-case such as simply selecting a subset of data based on a simple criterion: SALES_YEAR>2001), discounting the time taken to perform the CPU processing and assembling the results from 100 nodes, the total processing can be completed in under 1 minute.

    This simplistic example shows that Big Data is context-sensitive and that the context is provided by business need.

    Note

    Dr. Jeff Dean Keynote discusses parallelism in a paper you can find at www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf . To read 1 MB of data sequentially from a local disk requires 20 million nanoseconds. Reading the same data from a 1 Gbps network requires about 250 million nanoseconds (assuming that 2 KB needs 250,000 nanoseconds and 500,000 nanoseconds per round-trip for each 2 KB). Although the link is a bit dated, and the numbers have changed since then, we will use these numbers in the chapter for illustration. The proportions of the numbers with respect to each other, however, have not changed much.

    Key Idea Behind Big Data Techniques

    Although we have made many assumptions in the preceding example, the key takeaway is that we can process data very fast, yet there are significant limitations on how fast we can read the data from persistent storage. Compared with reading/writing node local persistent storage, it is even slower to send data across the network.

    Some of the common characteristics of all Big Data methods are the following:

    Data is distributed across several nodes (Network I/O speed << Local Disk I/O Speed).

    Applications are distributed to data (nodes in the cluster) instead of the other way around.

    As much as possible, data is processed local to the node (Network I/O speed << Local Disk I/O Speed).

    Random disk I/O is replaced by sequential disk I/O (Transfer Rate << Disk Seek Time).

    The purpose of all Big Data paradigms is to parallelize input/output (I/O) to achieve performance improvements.

    Data Is Distributed Across Several Nodes

    By definition, Big Data is data that cannot be processed using the resources of a single machine. One of the selling points of Big Data is the use of commodity machines. A typical commodity machine would have a 2–4 TB disk. Because Big Data refers to datasets much larger than that, the data would be distributed across several nodes.

    Note that it is not really necessary to have tens of terabytes of data for processing to distribute data across several nodes. You will see that Big Data systems typically process data in place on the node. Because a large number of nodes are participating in data processing, it is essential to distribute data across these nodes. Thus, even a 500 GB dataset would be distributed across multiple nodes, even if a single machine in the cluster would be capable of storing the data. The purpose of this data distribution is twofold:

    Each data block is replicated across more than one node (the default Hadoop replication factor is 3). This makes the system resilient to failure. If one node fails, other nodes have a copy of the data hosted on the failed node.

    For parallel processing reasons, several nodes participate in the data processing. Thus, 50 GB of data shared within 10 nodes enables all 10 nodes to process their own subdataset, achieving 5–10 times improvement in performance. The reader may well ask why all the data is not on the network file system (NFS), in which each node can read its portion. The answer is that reading from a local disk is significantly faster than reading from the network. Big Data systems make the local computation possible because the application libraries are copied to each data node before a job (an application instance) is started. We discuss this in the next section.

    Applications Are Moved to the Data

    For those of us who rode the J2EE wave, the three-tier architecture was drilled into us. In the three-tier programming model, the data is processed in the centralized application tier after being brought into it over the network. We are used to the notion of data being distributed but the application being centralized.

    Big Data cannot handle this network overhead. Moving terabytes of data to the application tier will saturate the networks and introduce considerable inefficiencies, possibly leading to system failure. In the Big Data world, the data is distributed across nodes, but the application moves to the data. It is important to note that this process is not easy. Not only does the application need to be moved to the data but all the dependent libraries also need to be moved to the processing nodes. If your cluster has hundreds of nodes, it is easy to see why this can be a maintenance/deployment nightmare. Hence Big Data systems are designed to allow you to deploy the code centrally, and the underlying Big Data system moves the application to the processing nodes prior to job execution.

    Data Is Processed Local to a Node

    This attribute of data being processed local to a node is a natural consequence of the earlier two attributes of Big Data systems. All Big Data programming models are distributed- and parallel-processing based. Network I/O is orders of magnitude slower than disk I/O. Because data has been distributed to various nodes, and application libraries have been moved to the nodes, the goal is to process the data in place.

    Although processing data local to the node is preferred by a typical Big Data system, it is not always possible. Big Data systems will schedule tasks on nodes as close to the data as possible. You will see in the sections to follow that for certain types of systems, certain tasks require fetching data across nodes. At the very least, the results from every node have to be assimilated on a node (the famous reduce phase of MapReduce or something similar for massively parallel programming models). However, the final assimilation phases for a large number of use-cases have very little data compared with the raw data processed in the node-local tasks. Hence the effect of this network overhead is usually (but not always) negligible.

    Sequential Reads Preferred Over Random Reads

    First, you need to understand how data is read from the disk. The disk head needs to be positioned where the data is located on the disk. This process, which takes time, is known as the seek operation. Once the disk head is positioned as needed, the data is read off the disk sequentially. This is called the transfer operation. Seek time is approximately 10 milliseconds; transfer speeds are on the order of 20 milliseconds (per 1 MB). This means that if we were reading 100 MB from separate 1 MB sections of the disk, it would cost us 10 (seek time) * 100 (seeks) = 1 second, plus 20 (transfer rate per 1MB) * 100 = 2 seconds. This is a total of 3 seconds to read 100 MB. However, if we were reading 100 MB sequentially from the disk, it would cost us 10 (seek time) * 1 (seek) = 10 milliseconds + 20*100=2 seconds, for a total of 2.01 seconds.

    Note that we have used the numbers based on the Dr. Jeff Dean’s address, which is from 2009. Admittedly, the numbers have changed; in fact, they have improved since then. However, relative proportions between numbers have not changed, so we will use it for consistency.

    Most throughput–oriented Big Data programming models exploit this feature. Data is swept sequentially off the disk and filtered in the main memory. Contrast this with a typical relational database management system (RDBMS) model that is much more random–read-oriented.

    An Example

    Suppose that you want to get the total sales numbers for the year 2000 ordered by state, and the sales data is distributed randomly across multiple nodes. The Big Data technique to achieve this can be summarized in the following steps:

    1.

    Each node reads in the entire sales data and filters out sales data that is not for the year 2000. Data is distributed randomly across all nodes and read in sequentially on the disk. The filtering happens in main memory, not on the disk, to avoid the cost of seek times.

    2.

    Each node process proceeds to create groups for each state as they are discovered and adds the sales numbers for a given state bucket. (The application is present on all nodes, and data is processed local to a node.)

    3.

    When all the nodes have completed the process of sweeping the sales data from the disk and computing the total sales by state numbers, they send their respective number to a designated node (we call this node the assembler node), which has been agreed upon by all nodes at the beginning of the process.

    4.

    The designated assembler node assembles all the total sales by state number from each node and adds up the values received from each node per state.

    5.

    The assembler node sorts the final numbers by state and delivers the results.

    This process demonstrates typical features of a Big Data system: focusing on maximizing throughput (how much work gets done per unit time) over latency (how fast a request is responded to, one of the critical aspects based on which transactional systems are judged because we want the fastest possible response).

    Big Data Programming Models

    The major types of Big Data programming models you will encounter are the following:

    Massively parallel processing (MPP) database system: EMC’s Greenplum and IBM’s Netezza are examples of such systems.

    In-memory database systems: Examples include Oracle Exalytics and SAP HANA.

    MapReduce systems: These systems include Hadoop, which is the most general-purpose of all the Big Data systems.

    Bulk synchronous parallel (BSP) systems: Examples include Apache HAMA and Apache Giraph.

    Massively Parallel Processing (MPP) Database Systems

    At its core, MPP systems employ some form of splitting data based on values contained in a column or a set of columns. For example, in the earlier example in which sales for the year 2000 ordered by state were computed, we could have partitioned the data by state, so certain nodes would contain data for certain states. This method of partitioning would enable each node to compute the total sales for the year 2000.

    The limitation of such a system should be obvious. You need to decide how the data will be split at design time. The splitting criteria chosen will often be driven by the underlying use-case. As such, it is not suitable for ad hoc querying. Certain queries will execute at a blazing fast speed because they can take advantage of how the data is split between nodes. Others will operate at a crawl speed because the data is not distributed in a manner consistent with how it is accessed to execute the query resulting in data needed to be transferred to the nodes over the network.

    To handle this limitation, it is common for such systems to store the data multiple times, split by different criteria. Depending on the query, the appropriate dataset is picked.

    Following is the way in which the MPP programming model meets the attributes defined earlier for Big Data systems (consider the sales ordered by the state example):

    Data is split by state on separate nodes.

    Each node contains all the necessary application libraries to work on its own subset of the data.

    Each node reads data local to itself. An exception is when you apply a query that does not respect how the data is distributed; in this case, each task needs to fetch its own data from other nodes over the network.

    Data is read sequentially for each task. All the sales data is co-located and swept off the disk. The filter (year = 2000) is applied in memory.

    In-Memory Database Systems

    From an operational perspective, in-memory database systems are identical to MPP systems. The implementation difference is that each node has a significant amount of memory, and most data is preloaded into memory. SAP HANA operates on this principle. Other systems, such as Oracle Exalytics, use specialized hardware to ensure that multiple hosts are housed in a single appliance. At the core, an in-memory database is like an in-memory MPP database with a SQL interface.

    One of the major disadvantages of the commercial implementations of in-memory databases is that there is a considerable hardware and software lock-in. Also, given that the systems use proprietary and very specialized hardware, they are usually expensive. Trying to use commodity hardware for in-memory databases increases the size of the cluster very quickly. Consider, for example, a commodity server that has 25 GB of RAM. Trying to host 1 TB in-memory databases will need more than 40 hosts (accounting for other activities that need to be performed on the server). 1 TB is not even that big, and we are already up to a 40-node cluster.

    The following describes how the in-memory database programming model meets the attributes we defined earlier for the Big Data systems:

    Data is split by state in the earlier example. Each node loads data into memory.

    Each node contains all the necessary application libraries to work on its own subset.

    Each node reads data local to its nodes. The exception is when you apply a query that does not respect how the data is distributed; in this case, each task needs to fetch its own data from other nodes.

    Because data is cached in memory, the Sequential Data Read attribute does not apply except when the data is read into memory the first time.

    MapReduce Systems

    MapReduce is the paradigm on which this book is based. It is by far the most general-purpose of four methods. Some of the important characteristics of Hadoop’s implementation of MapReduce are the following:

    It uses commodity scale hardware. Note that commodity scale does not imply laptops or desktops. The nodes are still enterprise scale, but they use commonly available components.

    Data does not need to be partitioned among nodes based on any predefined criteria.

    The user needs to define only two separate processes: map and reduce.

    We will discuss MapReduce extensively in this book. At a very high level, a MapReduce system needs the user to define a map process and a reduce process. When Hadoop is being used to implement MapReduce, the data is typically distributed in 64 MB–128 MB blocks, and each block is replicated twice (a replication factor of 3 is the default in Hadoop). In the example of computing sales for the year 2000 and ordered by state, the entire sales data would be loaded into the Hadoop Distributed File System (HDFS) as blocks (64 MB–128 MB in size). When the MapReduce process is launched, the system would first transfer all the application libraries (comprising the user-defined map and reduce processes) to each node.

    Each node will schedule a map task that sweeps the blocks comprising the sales data file. Each Mapper (on the respective node) will read records of the block and filter out the records for the year 2000. Each Mapper will then output a record comprised of a key/value pair. Key will be the state and value will be the sales number from the given record if the sales record is for the year 2000.

    Finally, a configurable number of Reducers will receive the key/value pairs from each of the Mappers. Keys will be assigned to specific Reducers to ensure that a given key is received by one and only one Reducer. Each Reducer will then add up the sales value number for all the key/value pairs received. The data format received by the Reducer is key (state), and a list of values for that key (sales records for the year 2000). The output is written back to the HDFS. The client will then sort the result by states after reading it from the HDFS. The last step can be delegated to the Reducer because the Reducer receives its assigned keys in the sorted order. In this example, we need to restrict the number of Reducers to one to achieve this, however. Because communication between Mappers and Reducers causes network I/O, it can lead to bottlenecks. We will discuss this issue in detail later in the book.

    This is how the MapReduce programming model meets the attributes defined earlier for the Big Data systems:

    Data is split into large blocks on HDFS. Because HDFS is a distributed file system the data blocks are distributed across all the nodes redundantly.

    The application libraries, including the map and reduce application code, are propagated to all the task nodes.

    Each node reads data local to its nodes. Mappers are launched on all the nodes and read the data blocks local to themselves (in most cases, the mapping between tasks and disk blocks is up to the scheduler, which may allocate remote blocks to map tasks to keep all nodes busy).

    Data is read sequentially for each task on large block at a time (blocks are typically of size 64 MB–128 MB)

    One of the important limitations of the MapReduce paradigm is that it is not suitable for iterative algorithms. A vast majority of data science algorithms are iterative by nature and eventually converge to a solution. When applied to such algorithms, the MapReduce paradigm requires each iteration to be run as a separate MapReduce job, and each iteration often uses the data produced by its previous iteration. But because each MapReduce job reads fresh from the persistent storage, the iteration needs to store its results in persistent storage for the next iteration to work on. This process leads to unnecessary I/O and significantly impacts the overall throughput. This limitation is addressed by the BSP class of systems, described next.

    Bulk Synchronous Parallel (BSP) Systems

    The BSP class of systems operates very similarly to the MapReduce approach. However, instead of the MapReduce job terminating at the end of its processing cycle, the BSP system is composed of a list of processes (identical to the map processes) that synchronize on a barrier, send data to the Master node, and exchange relevant information. Once the iteration is completed, the Master node will indicate to each processing node to resume the next iteration.

    Synchronizing on a barrier is a commonly used concept in parallel programming. It is used when many threads are responsible for performing their own tasks, but need to agree on a checkpoint before proceeding. This pattern is needed when all threads need to have completed a task up to a certain point before the decision is made to proceed or abort with respect to the rest of the computation (in parallel or in sequence). Synchronization barriers are used all the time in the real world processes. Example, carpool mates often meet at a designated place before proceeding in a single car. The overall process is only as fast as the last person (or thread) arriving at the barrier.

    The BSP method of execution allows each map-like process to cache its previous iteration’s data significantly improving the throughput of the overall process. We will discuss BSP systems in the Data Science chapter of this book. They are relevant to iterative algorithms.

    Big Data and Transactional Systems

    It is important to understand how the concept of transactions has evolved in the context of Big Data. This discussion is relevant to NoSQL databases. Hadoop has HBase as its NoSQL data store. Alternatively, you can use Cassandra or NoSQL systems available in the cloud such as Amazon Dynamo.

    Although most RDBMS users expect ACID properties in databases, these properties come at a cost. When the underlying database needs to handle millions of transactions per second at peak time, it is extremely challenging to respect ACID features in their purest form.

    Note

    ACID is an acronym for atomicity, consistency, isolation, and durability. A detailed discussion can be found at the following link: http://en.wikipedia.org/wiki/ACID .

    Some compromises are necessary, and the motivation behind these compromises is encapsulated in what is known as the CAP theorem (also known as Brewer’s theorem). CAP is an acronym for the following:

    Consistency: All nodes see the same copy of the data at all times.

    Availability: A guarantee that every request receives response about success and failure within a reasonable and well-defined time interval.

    Partition tolerance: The system continues to perform despite failure of its parts.

    The theorem goes on to prove that in any system only two of the preceding features are achievable, not all three. Now, let’s examine various types of systems:

    Consistent and available: A single RDBMS with ACID properties is an example of a system that is consistent and available. It is not partition-tolerant; if the RDBMS goes down, users cannot access the data.

    Consistent and partition-tolerant: A clustered RDBMS is such as system. Distributed transactions ensure that all users will always see the same data (consistency), and the distributed nature of the data will ensure that the system remains available despite loss of nodes. However, by virtue of distributed transactions, the system will be unavailable for durations of time when two-phase commits are being issued. This limits the number of simultaneous transactions that can be supported by the system, which in turn limits the availability of the system.

    Available and partition-tolerant:The type of systems classified as eventually consistent fall into this category. Consider a very popular e-commerce web site such as Amazon.com. Imagine that you are browsing through the product catalogs and notice that two units of a certain item are available for sale. By nature of the buying process, you are aware that between you noticing that a certain number of items are available and issuing the buy request, someone could come in first and buy the items. So there is little incentive for always showing the most updated value because inventory changes. Inventory changes will be propagated to all the nodes serving the users. Preventing the users from browsing inventory while this propagation is taking place in order to provide the most current value of the inventory will limit the availability of the web site, resulting in lost sales. Thus, we have sacrificed consistency for availability, and partition tolerance allows multiple nodes to display the same data (although there may be a small window of time in which each user sees different data, depending on the nodes they are served by).

    These decisions are very critical when developing Big Data systems. MapReduce, which is the main topic of the book, is only one of the components of the Big Data ecosystem. Often it exists in the context of other products such as HBase, in which making the trade-offs discussed in this section are critical to developing a good solution.

    How Much Can We Scale?

    We made several assumptions in our examples earlier in the chapter. For example, we ignored CPU time. For a large number of business problems, computational complexity does not dominate. However, with the growth in computing capability, various domains became practical from an implementation point of view. One example is data mining using complex Bayesian statistical techniques. These problems are indeed computationally expensive. For such problems, we need to increase the number of nodes to perform processing or apply alternative methods.

    Note

    The paradigms used in Big Data computing such as MapReduce have also been extended to other parallel computing methods. For example, general-purpose computation on graphics programming units (GPGPU) computing achieves massive parallelism for compute-intensive problems.

    We also ignored network I/O costs. Using 50 compute nodes to process data also requires the use of a distributed file system and communication costs for assembling data from 50 nodes in the cluster. In all Big Data solutions, I/O costs will dominate. These costs introduce serial dependencies in the computational process.

    A Compute-Intensive Example

    Consider processing 200 GB of data with 50 nodes, in which each node processes 4 GB of data located on a local disk. Each node takes 80 seconds to read the data (at the rate of 50 MB per second). No matter how fast we compute, we cannot finish in under 80 seconds. Assume that the result of the process is a total dataset of size 200 MB, and each node generates 4 MB of this result. which is transferred over a 1 Gbps (1 MB per packet) network to a single node for display. It will take about 3 milliseconds (each 1 MB requires 250 microseconds to transfer over the network, and the network latency per packet is assumed to be 500 microseconds (based on the previously referenced talk by Dr. Jeff Dean) to transfer the data to the destination node. Ignoring computational costs, the total processing time cannot be under 40.003 seconds.

    Now imagine that we have 4000 nodes, and magically each node reads its own 500 MB of data from a local disk and produces 0.1 MB of result set. Notice that we cannot go faster than 1 second if data is read in 50 MB blocks. This translates to maximum performance improvement by a factor of about 4000. In other words for a certain class of problems, if it takes 4000 hours to complete the processing, we cannot do better than 1 hour, no matter how many nodes are thrown at the problem. A factor of 4000 might sound like a lot, but there is an upper limit to how fast we can get. In this simplistic example, we have made many simplifying system assumptions. We also assumed that there are no serial dependencies in the application logic, which is usually a false assumption. Once we add those costs, the maximum performance gain possibly falls drastically.

    Serial dependencies, which are the bane of all parallel computing algorithms, limit the degree of performance improvement. The limitation is well known and documented as the Amdhal’s Law.

    Amdhal’s Law

    Just as the speed of light defines the theoretical limit of how fast we can travel in our universe, Amdhal’s Law defines the limits of performance gain we can achieve by adding more nodes to clusters.

    Note

    See http://en.wikipedia.org/wiki/Amdahl's_law for a full discussion of Amdhal’s Law.

    In a nutshell, the law states that if a given solution can be made perfectly parallelizable up to a proportion P (where P ranges from 0 to 1), the maximum performance improvement we can obtain given an infinite number of nodes (a fancy way of saying a lot of nodes in the cluster) is 1/(1-P). Thus, if we have even 1 percent of the execution that cannot be made, parallel the best improvement we can get is 100 fold. All programs have some serial dependencies, and disk I/O and network I/O will add more. There are limits to how many improvements we can achieve regardless of the methods we use.

    Business Use-Cases for Big Data

    Big Data and Hadoop have several applications in the business world. At the risk of sounding cliché, the three big attributes of Big Data are considered to be these:

    Volume

    Velocity

    Variety

    Volume relates the size of the data processed. If your organization needs to extract, load, and transform 2 TB of data in 2 hours each night, you have a volume problem.

    Velocity relates to speed at which large data arrives. Organizations such as Facebook and Twitter encounter the velocity problem. They get massive amounts of tiny messages per second that need to be processed almost immediately, posted to the social media sites, propagated to related users (family, friends, and followers), events generated, and so on.

    Variety is related to an increasing number of formats that need to be processed. Enterprise search systems have become commonplace in organizations. Open-source software such as Apache Solr has made search-based systems ubiquitous. Most unstructured data is not stand-alone; it has considerable structured data associated with it. For example, consider a simple document such as an e-mail. E-mail has considerable metadata associated with it. Examples include sender, receivers, order of receivers, time sent/received, organizational information about the senders/receivers (for example, a title at the time of sending), and so on.

    Some of this information is even dynamic. For example, if you are analyzing years of e-mail (Area of Legal Practice has several use-cases around this), it is important to know what the title of senders or receivers were when the e-mail was first sent. This feature of dynamic master data is commonplace and leads to several interesting challenges.

    Big Data helps solve everyday problems such as large-scale extract, transform, load (ETL) issues by using commodity software and hardware. In particular, open-source Hadoop, which runs on commodity servers and can scale by adding more nodes, enables ETL (or ELT, as it is commonly called in the Big Data domain) to be performed significantly faster at commodity costs.

    Several open-source products have evolved around Hadoop and the HDFS to support velocity and variety use-cases. New data formats have evolved to manage the I/O performance around massive data processing. This book will discuss the motivations behind such developments and the appropriate use-cases for them.

    Storm (which evolved at Twitter) and Apache Flume (designed for large–scale log analysis) evolved to handle the velocity factor. The choice of which software to use depends on how close to real time the processes need to be. Storm is useful for tackling problems that require more real-time processing than Flume.

    The key message is this: Big Data is an ecosystem of various products that work in concert to solve very complex business problems. Hadoop is often at the center of such solutions. Understanding Hadoop enables you to develop a strong understanding of how to use the other entrants in the Big Data ecosystem.

    Summary

    Big Data has now become mainstream, and the two main drivers behind it are open-source Hadoop software and the advent of the cloud. Both of these developments allowed the mass-scale adoption of Big Data methods to handle business problems at low cost. Hadoop is the cornerstone of all Big Data solutions. Although other programming models, such as MPP and BSP, have sprung up to handle very specific problems, they all depend on Hadoop in some form or other when the scale of data to be processed reaches a multiterabyte scale. Developing a deep understanding of Hadoop enables users of other programming models to be more effective. The goal of this book is to you achieve that.

    The chapters to come will guide you through the specifics of using the Hadoop software as well as offer practical methods for solving problems with Hadoop.

    © Sameer Wadkar and Madhu Siddalingaiah 2014

    Sameer Wadkar and Madhu SiddalingaiahPro Apache Hadoop10.1007/978-1-4302-4864-4_2

    2. Hadoop Concepts

    Sameer Wadkar¹  and Madhu Siddalingaiah¹ 

    (1)

    MD, US

    Applications frequently require more resources than are available on an inexpensive (commodity) machine. Many organizations find themselves with business processes that no longer fit on a single, cost-effective computer. A simple but expensive solution has been to buy specialty machines that cost a lot of memory and have many CPUs. This solution scales as far as what is supported by the fastest machines available, and usually the only limiting factor is your budget. An alternative solution is to build a high-availability cluster, which typically attempts to look like a single machine and usually requires very specialized installation and administration services. Many high-availability clusters are proprietary and expensive.

    A more economical solution for acquiring the necessary computational resources is cloud computing. A common pattern is to have bulk data that needs to be transformed, in which the processing of each data item is essentially independent of other data items; that is, by using a single-instruction, multiple-data (SIMD) scheme. Hadoop provides an open-source framework for cloud computing, as well as a distributed file system.

    This book is designed to be a practical guide to developing and running software using Hadoop, a project hosted by the Apache Software Foundation. This chapter introduces you to the core Hadoop concepts. It is meant to prepare you for the next chapter, in which you will get Hadoop installed and running.

    Introducing Hadoop

    Hadoop is based on the Google paper on MapReduce published in 2004, and its development started in 2005. At the time, Hadoop was developed to support the open-source web search engine project called Nutch. Eventually, Hadoop separated from Nutch and became its own project under the Apache Foundation.

    Today Hadoop is the best–known MapReduce framework in the market. Currently, several companies have grown around Hadoop to provide support, consulting, and training services for the Hadoop software.

    At its core, Hadoop is a Java–based MapReduce framework. However, due to the rapid adoption of the Hadoop platform, there was a need to support the non–Java user community. Hadoop evolved into having the following enhancements and subprojects to support this community and expand its reach into the Enterprise:

    Hadoop Streaming: Enables using MapReduce with any command-line script. This makes MapReduce usable by UNIX script programmers, Python programmers, and so on for development of ad hoc jobs.

    Hadoop Hive: Users of MapReduce quickly realized that developing MapReduce programs is a very programming-intensive task, which makes it error-prone and hard to test. There was a need for more expressive languages such as SQL to enable users to focus on the problem instead of low-level implementations of typical SQL artifacts (for example, the WHERE clause, GROUP BY clause, JOIN clause, etc.). Apache Hive evolved to provide a data warehouse (DW) capability to large datasets. Users can express their queries in Hive Query Language, which is very similar to SQL. The Hive engine converts these queries to low-level MapReduce jobs transparently. More advanced users can develop user-defined functions (UDFs) in Java. Hive also supports standard drivers such as ODBC and JDBC. Hive is also an appropriate platform to use when developing Business Intelligence (BI) types of applications for data stored in Hadoop.

    Hadoop Pig: Although the motivation for Pig was similar to Hive, Hive is a SQL-like language, which is declarative. On the other hand, Pig is a procedural language that works well in data-pipeline scenarios. Pig will appeal to programmers who develop data-processing pipelines (for example, SAS programmers). It is also an appropriate platform to use for extract, load, and transform (ELT) types of applications.

    Hadoop HBase: All the preceding projects, including MapReduce, are batch processes. However, there is a strong need for real–time data lookup in Hadoop. Hadoop did not have a native key/value store. For example, consider a Social Media site such as Facebook. If you want to look up a friend’s profile, you expect to get an answer immediately (not after a long batch job runs). Such use-cases were the motivation for developing the HBase platform.

    We have only just scratched the surface of what Hadoop and its subprojects will allow us to do. However the previous examples should provide perspective on why Hadoop evolved the way it did. Hadoop started out as a MapReduce engine developed for the purpose of indexing massive amounts of text data. It slowly evolved into a general-purpose model to support standard Enterprise use-cases such as DW, BI, ELT, and real-time lookup cache. Although MapReduce is a very useful model, it was the adaptation to standard Enterprise use-cases of the type just described (ETL, DW) that enabled it to penetrate the mainstream computing market. Also important is that organizations are now grappling with processing massive amounts of data.

    For a very long time, Hadoop remained a system in which users submitted jobs that ran on the entire cluster.Jobs would be executed in a First In, First Out (FIFO) mode. However, this lead to situations in which a long-running, less-important job would hog resources and not allow a smaller yet more important job to execute. To solve this problem, more complex job schedulers in Hadoop, such as the Fair Scheduler and Capacity Scheduler were created. But Hadoop 1.x (prior to version 0.23) still had scalability limitations that were a result of some deeply entrenched design decisions.

    Yahoo engineers found that Hadoop had scalability problems when the number of nodes ( http://developer.yahoo.com/blogs/hadoop/scaling-hadoop-4000-nodes-yahoo-410.html ) increased to an order of a few thousand. As these problems became better understood, the Hadoop engineers went back to the drawing board and reassessed some of the core assumptions underlying the original Hadoop design; eventually this lead to a major design overhaul of the core Hadoop platform. Hadoop 2.x (from version 0.23 of Hadoop) is a result of this overhaul.

    This book will cover version 2.x with appropriate references to 1.x, so you can appreciate the motivation for the changes in 2.x.

    Introducing the MapReduce Model

    Hadoop supports the MapReduce model, which was introduced by Google as a method of solving a class of petascale problems with large clusters of commodity machines. The model is based on two distinct steps, both of which are custom and user-defined for an application:

    Map: An initial ingestion and transformation step in which individual input records can be processed in parallel

    Reduce: An aggregation or summarization step in which all associated records must be processed together by a single entity

    The core concept of MapReduce in Hadoop is that input can be split into logical chunks, and each chunk can be initially processed independently by a map task. The results of these individual processing chunks can be physically partitioned into distinct sets, which are then sorted. Each sorted chunk is passed to a reduce task. Figure 2-1 illustrates how the MapReduce model works.

    A978-1-4302-4864-4_2_Fig1_HTML.jpg

    Figure 2-1.

    MapReduce model

    A map task can run on any compute node in the cluster, and multiple map tasks can run in parallel across the cluster. The map task is responsible for transforming the input records into key/value pairs. The output

    Enjoying the preview?
    Page 1 of 1