Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools
()
About this ebook
The book describes the standards and technologies that aid in data management and compares them to other technology business standards. The reader receives practical guidance on how to segregate and separate data into zones, as well as how to develop a model that can aid in data evolution. It discusses security and the measures that are utilized to reduce the impact of security. Self-service analytics, Data Lake, Data Vault 2.0, and Data Mesh are discussed in the book.
After reading this book, the reader will have a thorough understanding of how to structure a data lake, as well as the ability to plan, organize, and carry out the implementation of a data-driven business with full governance and security.
Related to Data Processing and Modeling with Hadoop
Related ebooks
Querying MariaDB: Use SQL Operations, Data Extraction, and Custom Queries to Make your MariaDB Database Analytics more Accessible (English Edition) Rating: 0 out of 5 stars0 ratingsApache Spark 2.x Cookbook Rating: 0 out of 5 stars0 ratingsData Lake for Enterprises Rating: 0 out of 5 stars0 ratingsInstant MapReduce Patterns – Hadoop Essentials How-to Rating: 0 out of 5 stars0 ratingsLearn SQL with MySQL: Retrieve and Manipulate Data Using SQL Commands with Ease Rating: 0 out of 5 stars0 ratingsApache Hive Essentials Rating: 0 out of 5 stars0 ratingsHadoop Real-World Solutions Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsBig Data for Enterprise Architects Rating: 5 out of 5 stars5/5Architecting Big Data & Analytics Solutions - Integrated with IoT & Cloud Rating: 5 out of 5 stars5/5Pentaho Data Integration Beginner's Guide Rating: 4 out of 5 stars4/5Learning Apache Spark 2 Rating: 0 out of 5 stars0 ratingsThriving in a Data World: A Guide for Leaders and Managers Rating: 0 out of 5 stars0 ratingsFast Data Processing with Spark 2 - Third Edition Rating: 0 out of 5 stars0 ratingsHadoop Essentials Rating: 5 out of 5 stars5/5Building the Data Warehouse Rating: 5 out of 5 stars5/5Enterprise Application Integration: A Wiley Tech Brief Rating: 2 out of 5 stars2/5Python High Performance - Second Edition Rating: 0 out of 5 stars0 ratingsHands-on Cloud Analytics with Microsoft Azure Stack Rating: 0 out of 5 stars0 ratingsModern Big Data Architectures: A Multi-Agent Systems Perspective Rating: 0 out of 5 stars0 ratingsData warehouse Complete Self-Assessment Guide Rating: 4 out of 5 stars4/5
Computers For You
The Invisible Rainbow: A History of Electricity and Life Rating: 4 out of 5 stars4/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 0 out of 5 stars0 ratingsElon Musk Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5CompTIA Security+ Practice Questions Rating: 2 out of 5 stars2/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratings101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands Rating: 5 out of 5 stars5/5Master Builder Roblox: The Essential Guide Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Practical Lock Picking: A Physical Penetration Tester's Training Guide Rating: 5 out of 5 stars5/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5The Designer's Web Handbook: What You Need to Know to Create for the Web Rating: 0 out of 5 stars0 ratingsGrokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Learning the Chess Openings Rating: 5 out of 5 stars5/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles Rating: 4 out of 5 stars4/5What Video Games Have to Teach Us About Learning and Literacy. Second Edition Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratings
Reviews for Data Processing and Modeling with Hadoop
0 ratings0 reviews
Book preview
Data Processing and Modeling with Hadoop - Vinicius Aquino do Vale
CHAPTER 1
Understanding the Current Moment
Introduction
In this chapter, we will give a brief introduction of the current moment. Let's understand exactly where we are and what path we should look for to take the company forward towards a data-driven culture.
Structure
In this chapter, we will discuss the following topics:
A little context
Why use it?
Solving problems
Hadoop ecosystem
Mounting a data lake
What does the data tell us?
Objectives
After studying this chapter, you should be able to:
Learn a little about the history of data
Understand the potential of data and its importance for the future
Solve everyday problems with Hadoop
Learn how to start building a data lake
Extract business ideas with data
A little context
We started this book contextualizing our entry into the Information Age, which occurred so abruptly that we didn't even notice when it started. Suddenly, we wake up and are bombarded with information that did not exist before. We receive and send thousands of information every day using apps like WhatsApp, TikTok, WeChat, Facebook, Instagram, Twitter, LinkedIn, Waze, and others.
Although it is obvious today, 20 years ago—in the early 2000s—we didn't even know its meaning; and it was only 10 years ago that we started to hear a little about subjects such as big data, Internet of Things, and Artificial Intelligence. Despite studying this a long time ago, there was no processing capacity to execute this information. But today, with current technologies, it is possible to put into practice and see things that were previously impossible to observe, much less understood.
In this context, companies realized the need to have tools that were capable of processing immense volumes, and that could still offer results in acceptable times so that data-based decision-making was possible. We entered the Information Age, where the large volume of data brought information that was not previously known or was not possible to see, taking humanity out of ignorance and leading it towards wisdom.
Figure 1.1: Data cycle
The beginning
In the beginning, the data was just data; it did not represent something tangible, there was no value in the data, and companies had no reason to invest in data. Over time, data became information and this transformation brought an intrinsic value to the business. Some companies knew how to take advantage of it, while others ignored and succumbed to the new times.
Companies that understood the importance of data began massively collecting it; they collected all kinds of information, from any type of source, from anywhere that could be used to understand the profile of a group or users without prejudice. It is only information that was previously ignored or did not generate usable value. Today, this information makes sense for many companies, especially those that want to understand how a certain group or profile interacts in the online and even offline world. For this reason, information has undergone another transformation—it has become knowledge.
The companies that had access to this knowledge took advantage of this "Open Sea" to understand and know the habits of groups and people and knowledge that gave new impetus to their business. Facebook is a classic example of this: realizing that many users were migrating and using other applications, and this demand is showing an increase, Facebook acquired WhatsApp and Instagram before these applications could gain on a global scale. They used their users' knowledge of habits together with market knowledge to realize that they needed to adapt so as not to be left behind.
When the knowledge with the data was consolidated, ideas began to emerge on how to use it in the best possible way; at this moment, we migrate to wisdom. The migration to wisdom is still happening and there are a few companies that really know where it will lead. The use of artificial intelligence is still in its initial stages, but it is the path that all companies will have to follow if they want to continue to exist.
Nowadays
Reaching the stage of wisdom is not simple and requires knowledge in several areas that were not previously discussed. To understand how to extract the right information about the business, it is necessary, first of all, to know how the business works, how the business rules are applied, and how users interact with the business. From there, it is possible to know the target audience and how to get the attention of this group. But to understand the target audience, it is necessary to know how to use mathematical and statistical models in order to find the patterns that are sought, and to apply them in the generation of insights that will bring value to the business. All of this should happen as quickly as possible because, in the Information Age, every minute can represent profit or loss, so environments need to be prepared to receive and understand this infinity of data. In this sense, the merger of three important areas—Computer Science (IT), Mathematics/Statistics, and Business Knowledge—occurs and causes a new area to emerge: Data Science. It appears exactly to help transform ideas into wisdom, coming with almost divine purposes.
Figure 1.2: Skills
The migration to wisdom requires, above all, experienced and trained professionals for this job because working with this amount of data is not a simple task. Most of the time, it requires changes in the structure of the company, new adaptations in the processes, and new leadership in many cases. Not to mention that prepare data governance and a security control with authentication and authorization are necessary actions to control of your data without transforming this asset into cost.
Like any system, the data needs to be somewhere, so the most suitable location for it is known as the Single Source of Truth (SSOT) and can involve several approaches that are used in each type of situation. They are environments such as Enterprise Service Bus (ESB), Master Data Management (MDM), and Data Warehouse (DW) or data lake, each with its advantages and disadvantages.
When we look at the IT industry today, we do not imagine a company without one, but in the past, the IT industry was only one arm of the company which often did not participate in the business or in decision making. It was just there to support the company's daily needs. Today, all that has changed: the IT area has become the essence of the business, and it is often the business itself. The biggest companies in the world today are technology companies—companies that do not offer physical products, but only offer data, or rather, information about a certain group or people interactions.
Most businesses today depend on IT. What is data in this context?
Data is the company's asset; it is the new data-based economy that is being born or has already been born. It is simply, a new way of negotiating, and the one who has more wisdom will be the winner of this marathon.
In this context, we will talk about tools that help companies in these actions, as well as to prepare the environment for infinite interactions with data, and how to extract value from this information by converting it into knowledge and wisdom, bringing valuable insights to the business. Is your company prepared for this new era?
Why use it?
Having understood the context in which we are involved, we realized that the world was undergoing changes, and these changes gave us new business opportunities, new ways of working, and a new way of seeing the world. Who would have thought that companies like Uber, Lyft, AirBnb, Didi Chuxing, Netflix, SpaceX, Tesla, and WeWork, among others, would one day exist and change the way of doing business, change the way people interact with each other, and change the way we see the world?
Had it ever crossed your mind that one day you would rent a place to stay, other than a hotel or inn, call a public transport to pick you up and take you to places you wanted to go, stream a movie at home, or maybe even go to space for a sightseeing tour? Do you remember how it all started? We don't, and you probably don't either.
Today, we use these services as if they were always embedded in our lives. But what else can we learn from analyzing these companies in depth? Let's look at SpaceX's example: when Elon Musk created it, he had the goal of landing a miniature experimental greenhouse and growing plants on Mars, before putting his first Falcon 1 rocket into operation. There were several failures that he almost gave up on the project, but these errors helped him move forward because through the collected data, he could improve and perfect his rockets to do what many thought would be impossible.
Today, SpaceX's achievements include the first privately funded liquid fuel rocket to reach Earth orbit (Falcon 1 in 2008); the first privately funded company to launch, orbit, and recover a spaceship (Dragon in 2010); the first private company to send a spacecraft to the International Space Station (ISS) (Dragon in 2012); the first propulsive landing of an orbital rocket (Falcon 9 in 2015); and the first reuse of an orbital rocket (Falcon 9 in 2017).
In March 2017, SpaceX had already transported ten missions to the ISS under a space cargo refueling contract. NASA also awarded SpaceX a new development contract in 2011 to demonstrate a Dragon that would be used to transport astronauts to the ISS and safely return them to Earth.
Let's look at another example: Tesla—the company of which Elon Musk is the CEO—produces electric cars, but what stands out the most is the ability these vehicles have to be autonomous. In October 2016, Tesla announced that all the cars it was producing—the Model S, Model X, and the new Model 3, which was launched in mid-2017—would have the necessary hardware that allows drivers to not have to touch the steering wheel. The models have eight cameras with 360º visibility at a distance of up to 250 meters around the car, and twelve ultrasonic sensors updated to "complement the view". In the front, a radar with advanced processing capabilities provides additional data about what surrounds the car on a redundant frequency, capable of feeding the system with visual information even if it is raining, foggy, or there is dust in the air.
How important is the data in these two examples? It is clear that the data drove the business—it was the data that drove these companies to success. Of course, data is not miraculous, but it helped a lot in targeting.
Let's look at another example: Netflix. Basically, it offers movies and TV series in a streaming model, but that's not just what they do. Each user who interacts with the application generates events, and these events are treated and analyzed so that they can guide the company in the acquisition of new titles, or, until recently, in the creation of its own content. With that, Netflix managed to nominate 8 of its own productions at the 2020 Oscars, radically changing the way we see films and TV shows. However, in order to decide which productions to invest in, or which actors and plots to hire, Netflix analyzed several user-generated events to find the necessary standards that would please its audience, and they managed this with mastery.
Since data is the protagonist of the 21st century, how much information is generated every minute?
Figure 1.3: Data in a minute
We have never generated as much data as we do now; this is the power of data. Earlier, what we saw was just the tip of the iceberg, but today we can already see the complete iceberg. With this amount of information, we can learn how society interacts, or rather, society shows us what their desires are and from there we can build businesses that fit the user’s daily needs.
Understanding this power and the dynamics of humanity, companies are able to offer individualized products and services, thus providing a unique experience for each user.
Obviously, we know that this data can be used for good or evil, and that privacy ends up in favor of collective knowledge. Privacy is one of the biggest obstacles faced by the massification of data, wherein the threat to privacy is represented by the increase of storage and integration of personally identifiable information. The world and its current legislation are not prepared for the possibilities that big data offers to aggregate, analyze, and draw conclusions from data hitherto sparse.
Big data has already been listed as an essential tool in manipulating elections and spreading fake news. This is due to technology's inherent ability to gather and segment a specific target audience, making big data an ethically questionable methodology since it can be used to manipulate masses and obtain partial results according to the motivation of experts.
Solving problems
From the moment a company starts to adopt data in decision-making, the objective becomes finding answers to problems that did not exist before; not that the problems didn't exist, but they just weren't considered problems earlier. Today, several sectors of society are investing in big data, says a study by the International Data Corporation (IDC). The objective behind big data is to improve the provision of information to managers, ensuring that there is support in decision-making with real and accurate data.
The following are some of the big data applications in different sectors:
The movie Moneyball is based on Michael Lewis’ book Moneyball: The Art of Winning an Unfair Game, which tells the story of Billy Beane, the general manager of the Oakland Athletics baseball team, where he uses big data to assemble a top team without spending a lot. The film focuses on Beane's attempts to create a competitive team for the 2002 Oakland season, despite the team's unfavorable financial situation, using a sophisticated statistical analysis of the players.
The UPS company, after analyzing the routes of its drivers, prohibited them from turning left. According to the company, this has saved about 38 million liters of fuel per year, avoiding the emission of 20 thousand tons of carbon dioxide. In addition, they deliver 350 thousand more packages.
In the Haiti earthquake, American researchers used the geolocation of 2 million SIM chips to assist in humanitarian missions.
To improve nuclear physics laboratories, the company CERN (European Council of Nuclear Research), created the largest particle accelerator in the world, called the Large Hadron Collider. With it, a huge amount of data