Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools
Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools
Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools
Ebook317 pages6 hours

Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The book 'Data Processing and Modeling with Hadoop' explains how a distributed system works and its benefits in the big data era in a straightforward and clear manner. After reading the book, you will be able to plan and organize projects involving a massive amount of data.

The book describes the standards and technologies that aid in data management and compares them to other technology business standards. The reader receives practical guidance on how to segregate and separate data into zones, as well as how to develop a model that can aid in data evolution. It discusses security and the measures that are utilized to reduce the impact of security. Self-service analytics, Data Lake, Data Vault 2.0, and Data Mesh are discussed in the book.

After reading this book, the reader will have a thorough understanding of how to structure a data lake, as well as the ability to plan, organize, and carry out the implementation of a data-driven business with full governance and security.
LanguageEnglish
Release dateNov 10, 2021
ISBN9789391392369
Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools

Related to Data Processing and Modeling with Hadoop

Related ebooks

Computers For You

View More

Related articles

Reviews for Data Processing and Modeling with Hadoop

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Processing and Modeling with Hadoop - Vinicius Aquino do Vale

    CHAPTER 1

    Understanding the Current Moment

    Introduction

    In this chapter, we will give a brief introduction of the current moment. Let's understand exactly where we are and what path we should look for to take the company forward towards a data-driven culture.

    Structure

    In this chapter, we will discuss the following topics:

    A little context

    Why use it?

    Solving problems

    Hadoop ecosystem

    Mounting a data lake

    What does the data tell us?

    Objectives

    After studying this chapter, you should be able to:

    Learn a little about the history of data

    Understand the potential of data and its importance for the future

    Solve everyday problems with Hadoop

    Learn how to start building a data lake

    Extract business ideas with data

    A little context

    We started this book contextualizing our entry into the Information Age, which occurred so abruptly that we didn't even notice when it started. Suddenly, we wake up and are bombarded with information that did not exist before. We receive and send thousands of information every day using apps like WhatsApp, TikTok, WeChat, Facebook, Instagram, Twitter, LinkedIn, Waze, and others.

    Although it is obvious today, 20 years ago—in the early 2000s—we didn't even know its meaning; and it was only 10 years ago that we started to hear a little about subjects such as big data, Internet of Things, and Artificial Intelligence. Despite studying this a long time ago, there was no processing capacity to execute this information. But today, with current technologies, it is possible to put into practice and see things that were previously impossible to observe, much less understood.

    In this context, companies realized the need to have tools that were capable of processing immense volumes, and that could still offer results in acceptable times so that data-based decision-making was possible. We entered the Information Age, where the large volume of data brought information that was not previously known or was not possible to see, taking humanity out of ignorance and leading it towards wisdom.

    Figure 1.1: Data cycle

    The beginning

    In the beginning, the data was just data; it did not represent something tangible, there was no value in the data, and companies had no reason to invest in data. Over time, data became information and this transformation brought an intrinsic value to the business. Some companies knew how to take advantage of it, while others ignored and succumbed to the new times.

    Companies that understood the importance of data began massively collecting it; they collected all kinds of information, from any type of source, from anywhere that could be used to understand the profile of a group or users without prejudice. It is only information that was previously ignored or did not generate usable value. Today, this information makes sense for many companies, especially those that want to understand how a certain group or profile interacts in the online and even offline world. For this reason, information has undergone another transformation—it has become knowledge.

    The companies that had access to this knowledge took advantage of this "Open Sea" to understand and know the habits of groups and people and knowledge that gave new impetus to their business. Facebook is a classic example of this: realizing that many users were migrating and using other applications, and this demand is showing an increase, Facebook acquired WhatsApp and Instagram before these applications could gain on a global scale. They used their users' knowledge of habits together with market knowledge to realize that they needed to adapt so as not to be left behind.

    When the knowledge with the data was consolidated, ideas began to emerge on how to use it in the best possible way; at this moment, we migrate to wisdom. The migration to wisdom is still happening and there are a few companies that really know where it will lead. The use of artificial intelligence is still in its initial stages, but it is the path that all companies will have to follow if they want to continue to exist.

    Nowadays

    Reaching the stage of wisdom is not simple and requires knowledge in several areas that were not previously discussed. To understand how to extract the right information about the business, it is necessary, first of all, to know how the business works, how the business rules are applied, and how users interact with the business. From there, it is possible to know the target audience and how to get the attention of this group. But to understand the target audience, it is necessary to know how to use mathematical and statistical models in order to find the patterns that are sought, and to apply them in the generation of insights that will bring value to the business. All of this should happen as quickly as possible because, in the Information Age, every minute can represent profit or loss, so environments need to be prepared to receive and understand this infinity of data. In this sense, the merger of three important areas—Computer Science (IT), Mathematics/Statistics, and Business Knowledge—occurs and causes a new area to emerge: Data Science. It appears exactly to help transform ideas into wisdom, coming with almost divine purposes.

    Figure 1.2: Skills

    The migration to wisdom requires, above all, experienced and trained professionals for this job because working with this amount of data is not a simple task. Most of the time, it requires changes in the structure of the company, new adaptations in the processes, and new leadership in many cases. Not to mention that prepare data governance and a security control with authentication and authorization are necessary actions to control of your data without transforming this asset into cost.

    Like any system, the data needs to be somewhere, so the most suitable location for it is known as the Single Source of Truth (SSOT) and can involve several approaches that are used in each type of situation. They are environments such as Enterprise Service Bus (ESB), Master Data Management (MDM), and Data Warehouse (DW) or data lake, each with its advantages and disadvantages.

    When we look at the IT industry today, we do not imagine a company without one, but in the past, the IT industry was only one arm of the company which often did not participate in the business or in decision making. It was just there to support the company's daily needs. Today, all that has changed: the IT area has become the essence of the business, and it is often the business itself. The biggest companies in the world today are technology companies—companies that do not offer physical products, but only offer data, or rather, information about a certain group or people interactions.

    Most businesses today depend on IT. What is data in this context?

    Data is the company's asset; it is the new data-based economy that is being born or has already been born. It is simply, a new way of negotiating, and the one who has more wisdom will be the winner of this marathon.

    In this context, we will talk about tools that help companies in these actions, as well as to prepare the environment for infinite interactions with data, and how to extract value from this information by converting it into knowledge and wisdom, bringing valuable insights to the business. Is your company prepared for this new era?

    Why use it?

    Having understood the context in which we are involved, we realized that the world was undergoing changes, and these changes gave us new business opportunities, new ways of working, and a new way of seeing the world. Who would have thought that companies like Uber, Lyft, AirBnb, Didi Chuxing, Netflix, SpaceX, Tesla, and WeWork, among others, would one day exist and change the way of doing business, change the way people interact with each other, and change the way we see the world?

    Had it ever crossed your mind that one day you would rent a place to stay, other than a hotel or inn, call a public transport to pick you up and take you to places you wanted to go, stream a movie at home, or maybe even go to space for a sightseeing tour? Do you remember how it all started? We don't, and you probably don't either.

    Today, we use these services as if they were always embedded in our lives. But what else can we learn from analyzing these companies in depth? Let's look at SpaceX's example: when Elon Musk created it, he had the goal of landing a miniature experimental greenhouse and growing plants on Mars, before putting his first Falcon 1 rocket into operation. There were several failures that he almost gave up on the project, but these errors helped him move forward because through the collected data, he could improve and perfect his rockets to do what many thought would be impossible.

    Today, SpaceX's achievements include the first privately funded liquid fuel rocket to reach Earth orbit (Falcon 1 in 2008); the first privately funded company to launch, orbit, and recover a spaceship (Dragon in 2010); the first private company to send a spacecraft to the International Space Station (ISS) (Dragon in 2012); the first propulsive landing of an orbital rocket (Falcon 9 in 2015); and the first reuse of an orbital rocket (Falcon 9 in 2017).

    In March 2017, SpaceX had already transported ten missions to the ISS under a space cargo refueling contract. NASA also awarded SpaceX a new development contract in 2011 to demonstrate a Dragon that would be used to transport astronauts to the ISS and safely return them to Earth.

    Let's look at another example: Tesla—the company of which Elon Musk is the CEO—produces electric cars, but what stands out the most is the ability these vehicles have to be autonomous. In October 2016, Tesla announced that all the cars it was producing—the Model S, Model X, and the new Model 3, which was launched in mid-2017—would have the necessary hardware that allows drivers to not have to touch the steering wheel. The models have eight cameras with 360º visibility at a distance of up to 250 meters around the car, and twelve ultrasonic sensors updated to "complement the view". In the front, a radar with advanced processing capabilities provides additional data about what surrounds the car on a redundant frequency, capable of feeding the system with visual information even if it is raining, foggy, or there is dust in the air.

    How important is the data in these two examples? It is clear that the data drove the business—it was the data that drove these companies to success. Of course, data is not miraculous, but it helped a lot in targeting.

    Let's look at another example: Netflix. Basically, it offers movies and TV series in a streaming model, but that's not just what they do. Each user who interacts with the application generates events, and these events are treated and analyzed so that they can guide the company in the acquisition of new titles, or, until recently, in the creation of its own content. With that, Netflix managed to nominate 8 of its own productions at the 2020 Oscars, radically changing the way we see films and TV shows. However, in order to decide which productions to invest in, or which actors and plots to hire, Netflix analyzed several user-generated events to find the necessary standards that would please its audience, and they managed this with mastery.

    Since data is the protagonist of the 21st century, how much information is generated every minute?

    Figure 1.3: Data in a minute

    We have never generated as much data as we do now; this is the power of data. Earlier, what we saw was just the tip of the iceberg, but today we can already see the complete iceberg. With this amount of information, we can learn how society interacts, or rather, society shows us what their desires are and from there we can build businesses that fit the user’s daily needs.

    Understanding this power and the dynamics of humanity, companies are able to offer individualized products and services, thus providing a unique experience for each user.

    Obviously, we know that this data can be used for good or evil, and that privacy ends up in favor of collective knowledge. Privacy is one of the biggest obstacles faced by the massification of data, wherein the threat to privacy is represented by the increase of storage and integration of personally identifiable information. The world and its current legislation are not prepared for the possibilities that big data offers to aggregate, analyze, and draw conclusions from data hitherto sparse.

    Big data has already been listed as an essential tool in manipulating elections and spreading fake news. This is due to technology's inherent ability to gather and segment a specific target audience, making big data an ethically questionable methodology since it can be used to manipulate masses and obtain partial results according to the motivation of experts.

    Solving problems

    From the moment a company starts to adopt data in decision-making, the objective becomes finding answers to problems that did not exist before; not that the problems didn't exist, but they just weren't considered problems earlier. Today, several sectors of society are investing in big data, says a study by the International Data Corporation (IDC). The objective behind big data is to improve the provision of information to managers, ensuring that there is support in decision-making with real and accurate data.

    The following are some of the big data applications in different sectors:

    The movie Moneyball is based on Michael Lewis’ book Moneyball: The Art of Winning an Unfair Game, which tells the story of Billy Beane, the general manager of the Oakland Athletics baseball team, where he uses big data to assemble a top team without spending a lot. The film focuses on Beane's attempts to create a competitive team for the 2002 Oakland season, despite the team's unfavorable financial situation, using a sophisticated statistical analysis of the players.

    The UPS company, after analyzing the routes of its drivers, prohibited them from turning left. According to the company, this has saved about 38 million liters of fuel per year, avoiding the emission of 20 thousand tons of carbon dioxide. In addition, they deliver 350 thousand more packages.

    In the Haiti earthquake, American researchers used the geolocation of 2 million SIM chips to assist in humanitarian missions.

    To improve nuclear physics laboratories, the company CERN (European Council of Nuclear Research), created the largest particle accelerator in the world, called the Large Hadron Collider. With it, a huge amount of data

    Enjoying the preview?
    Page 1 of 1