Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
()
About this ebook
Digital, cloud, and artificial intelligence (AI) have disrupted how we use data. This disruption has changed the way we need to provision, curate, and publish data for the multiple use cases in today's technology-driven environment. This text will cover how to design, develop, and evolve a data platform for all the uses of enterprise data needed in today's digital organization.
This book focuses on explaining what a data platform is, what value it provides, how is it engineered, and how to deploy a data platform and support organization. In this context, Introduction to Data Platforms
reviews the current requirements for data in the digital age and quantifies the use cases;
discusses the evolution of data over the past twenty years, which is a core driver of the modern data platform;
defines what a data platform is and defines the architectural components and layers of a data platform;
provides the architectural layers or capabilities of a data platform;
reviews cloud- and commercial-software vendors that populate the data-platform space;
provides a step-by-step approach to engineering, deploying, supporting, and evolving a data-platform environment;
provides a step-by-step approach to migrating legacy data warehouses, data marts, and data lakes/sandboxes to a data platform; and
reviews organizational structures for managing data platform environments.
Related to Introduction to Data Platforms
Related ebooks
Making Big Data Work for Your Business: A guide to effective Big Data analytics Rating: 0 out of 5 stars0 ratingsManaging Data in Motion: Data Integration Best Practice Techniques and Technologies Rating: 0 out of 5 stars0 ratingsData Virtualization: Selected Writings Rating: 0 out of 5 stars0 ratingsBuilding Big Data Applications Rating: 0 out of 5 stars0 ratingsBig Data for Enterprise Architects Rating: 5 out of 5 stars5/5Mastering Snowflake Platform: Generate, fetch, and automate Snowflake data as a skilled data practitioner (English Edition) Rating: 0 out of 5 stars0 ratingsBig Data: Unleashing the Power of Data to Transform Industries and Drive Innovation Rating: 0 out of 5 stars0 ratingsBanking on Cloud Data Platforms: A Guide Rating: 0 out of 5 stars0 ratingsData Lake Development with Big Data Rating: 0 out of 5 stars0 ratingsData Warehousing Fundamentals for IT Professionals Rating: 3 out of 5 stars3/5Data Virtualization for Business Intelligence Systems: Revolutionizing Data Integration for Data Warehouses Rating: 4 out of 5 stars4/5Learn Data Warehousing in 24 Hours Rating: 0 out of 5 stars0 ratings(Excerpts From) Investigating Performance: Design and Outcomes With Xapi Rating: 0 out of 5 stars0 ratingsThriving in a Data World: A Guide for Leaders and Managers Rating: 0 out of 5 stars0 ratingsBig Data Architecture A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsData Architects A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsData vault modeling Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsBuilding the Data Warehouse Rating: 5 out of 5 stars5/5Data Quality Strategies A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsBigData Analytics: Solution Or Resolution? Rating: 3 out of 5 stars3/5MDM and Metadata Standard Requirements Rating: 0 out of 5 stars0 ratingsMastering Knowledge Management: A Comprehensive Guide to Achieving Competitive Advantage Rating: 0 out of 5 stars0 ratingsData model Second Edition Rating: 0 out of 5 stars0 ratingsAzure Databricks Strategy A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsSQL Server Reporting Services Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratings
Computers For You
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5CompTIA Security+ Practice Questions Rating: 2 out of 5 stars2/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 0 out of 5 stars0 ratingsElon Musk Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands Rating: 5 out of 5 stars5/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsThe Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5Master Builder Roblox: The Essential Guide Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Practical Lock Picking: A Physical Penetration Tester's Training Guide Rating: 5 out of 5 stars5/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5The Designer's Web Handbook: What You Need to Know to Create for the Web Rating: 0 out of 5 stars0 ratingsWeb Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles Rating: 4 out of 5 stars4/5Learning the Chess Openings Rating: 5 out of 5 stars5/5Remote/WebCam Notarization : Basic Understanding Rating: 3 out of 5 stars3/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5
Reviews for Introduction to Data Platforms
0 ratings0 reviews
Book preview
Introduction to Data Platforms - Anthony David Giordano
Copyright © 2022 Anthony David Giordano
All rights reserved
First Edition
Fulton Books
Meadville, PA
Published by Fulton Books 2022
ISBN 979-8-88505-386-0 (paperback)
ISBN 979-8-88505-387-7 (digital)
Printed in the United States of America
I would like to dedicate this book to my daughters—Katie and Kelsie; they teach me something new and wonderful every day.
CONTENTS
Preface
Acknowledgments
Introduction: The Rise of the Data Platform
The need for a data platform
The driver of a new approach: strategic inflexibility
The purpose of a book on data platforms
A blueprint
for a data platform
Part 1: The Evolution of the Data Platform
Chapter 1: What Is a Data Platform?
The business drivers for a data platform
Definitions of data platform
Reasons why organizations have not built a data platform
Chapter 2: The Evolution of Use Cases for Data
The first evolution: the transactional data era
The second evolution: the data-warehousing era
The third evolution: the anti-data-warehouse era—the data lake era
Three-one evolution: digital data—the API-friendly data hub
Operational data: infusing AI into operational processes
A comprehensive, integrated approach: the data platform
Part 2: Capabilities of a Data Platform
Chapter 3: An Approach for a Data Platform
Overview of a reference architecture
Data platform architectures today
Data Fabric
Data Mesh
Detailed view of the data-platform reference architecture
Intelligent-Integration Capabilities
Data-Marketplace Capabilities
Insights Capabilities
Digital-Orchestration Capabilities
Experience Capabilities
Scaling the data platform horizontally and vertically
Horizontal Use Cases
Vertical Use Cases
Scaling a data platform across a multicloud environment
Chapter 4: The Intelligent-Integration Capability
An intelligent-integration processing framework
The intelligence
in the integration process
Ingestion services
Batch Ingestion Services
Real-Time Ingestion Services
Profiling and Metadata-Capture Services
Data-quality services
Master-data management services
Curation services
Publish services
Management support services
Configuring an intelligent-integration environment
Chapter 5: The Data Marketplace
Raw layer
Conform layer
Consumption layer
Physical vs. Virtual Layer
Chapter 6: The Other Data-Platform Components
Insights Components
Data-Visualization Layer
Data-Science Predictive-Modeling Layer
Detailed view of the digital orchestration capabilities
Detailed view of the experience component
Part 3: Implementing a Data Platform
Chapter 7: How to Build a Data Platform
The need for a new approach for data platforms
Configure and evolve versus waterfall
Overview of a data-platform methodology
Evolution vs. Manage: Data Ops
Insights migration approaches
Chapter 8: Data-Platform Use Cases
Case study 1: a digital transformation of a retail bank
Case study 2: a data science and data governance transformation of a pharmaceutical company
Pharmaceutical Company
Chapter 9: Data-Platform Cloud Implementations
Detailed review of AWS data-platform technologies
Detailed review of Microsoft Azure’s data-platform technologies
Detailed review of Google Cloud data-platform technologies
Detailed review of IBM’s data-platform technologies
Commercial data platforms
C3.ai
Palantir
Other notable cloud-based data technologies
Snowflake
Databricks
Best practices on data-platform cloud implementations
Chapter 10: An Operating Model for a Modern Data Platform
The impact of the enterprise organizational structure
Primary Functions of a Data Organization
Data-Platform-Architecture Management Services
Information Governance Services
Data Development and Evolution Services
The Need for Change Management
Afterword
PREFACE
This text provides an overview for information technology executives, chief data officers, and data practitioners on a detailed review of what a data platform is, with the benefits and reasons why they should seriously consider migrating their current data estate to one. Throughout the text, there will be case studies for each of the topics on designing, implementing, managing, and evolving a data platform.
The text starts with an explanation of how the use cases for data have evolved over the past twenty years, starting with transactional data design to simple business-intelligence (BI) reporting and eventually evolving into today’s multipurpose, multi-use-case, real-time data environments instantiated as data platforms. These use cases include traditional reporting (it’s not going away), data visualization, data science, digital, and operational (integrated with ML/AI capabilities). It illustrates how the architectures for data have evolved over the past twenty years into next-generation concepts that have allowed a greater use of data that is more strategic and integral in the digital world, in which we are now doing business.
The text covers how information architecture has evolved from its early days of simple transactional concepts to the current focus of data fabric
and data mesh.
The text covers in detail the core layers or components of a data platform and how data is ingested, qualified, curated, and conformed into both enterprise and application layers, which create multiuse data environments that reduce redundancy and cost while ensuring flexibility. The text brokers a pragmatic conversation on when to use enterprise versus the application of data layers in a data-platform environment. In covering data-fabric concepts, it covers the benefits and cost of when to physicalize and when to virtualize data. The book covers the essential nondata layers or capabilities of a data platform that illustrate how to integrate a data platform into a broader digital ecosystem and how to engineer it to drive value out of it for each of the multiple use cases.
It will review commercial data technologies including the cloud vendors’ native technological approaches for a data platform, which include a conversation on how to best migrate your current data estate to a data platform. Finally, it covers how to create a data organization to deploy, sustain, and evolve a modern data platform.
Intended audience
This text serves many different audiences. It can be used by experienced information management executives and chief data officers for a better understanding of the business case for a data platform or simply present one with the best practices for blueprinting, engineering, implementing, populating, and operating a data platform. The intended audiences include the following:
chief information and technology officers
chief data officers
data and analytic consultants
data solution architects and data engineers
program/project managers
other information management practitioners
Scope of the text
This book focuses on explaining what a data platform is, what value it provides, how it is engineered, and how to deploy a data platform and a support organization.
With that goal in mind, An Introduction to Data Platforms
reviews the current requirements for data in the digital age and quantifies the use cases;
discusses the evolution of data over the past twenty years, which is a core driver of the modern data platforms;
defines what a data platform is and the architectural components and layers of a data platform;
provides the architectural layers or capabilities of a data platform;
reviews cloud and commercial software vendors that populate the data-platform space;
provides a step-by-step approach to engineering, deploying, supporting, and evolving a data-platform environment;
provides a step-by-step approach to migrating legacy data warehouses, data marts, and data lakes/sandboxes to a data platform; and
reviews organizational structures for managing data-platform environments.
ACKNOWLEDGMENTS
The art and science required for a data platform in the digital age requires a significant amount of experience in the field and countless hours of configuring data technologies on multiple clouds into easy-to-use capabilities. The architectural principles and data-management processes defined in this book are a result of actual project work that is a product of those countless hours of implementing architectures and hardening those architectural concepts and processes that, today, run in all evolved data platforms in our organizations. These efforts can only be performed in collaboration with knowledgeable, dedicated, and experienced practitioners. In particular, I would like to acknowledge Mehdi Charafeddine, Glenn Finch, Jay Houghton, Ron Koch, and Ron Shelby—all of whom played an integral part in the development of this book.
INTRODUCTION:
The Rise of the Data Platform
The need for a data platform
They say that change is inevitable, and it is. Some changes are visceral and are so revolutionary that everyone instantly sees it, recognizes it, and embraces it. Others are so subtle that when they occur, only the savvy see it and exploit it in order to take the competitive advantage that the change provides. Data platforms are that next quiet evolution in technology that will provide greater strategic flexibility with your data and better data governance and quality in a more cost-effective manner. A data platform is a common data environment that provisions multiple business use cases.
The driver of a new approach: strategic inflexibility
The era of the data platform has started in an already very mature data-management world. There are very few organizations today that do not have a host of data technologies in their environments. In fact, that is the problem: the era of the greenfield
data environment passed twenty years ago, if not longer. Today, most organizations have multiple nonintegrated legacy data warehouses and marts, Hadoop clusters / data lakes, or NoSQL stores all performing some function in their environment but at a significant cost in terms of data integration and duplication. While all these technologies perform their specific purpose, in aggregate, they provide an inflexible, expensive infrastructure that tends to be difficult to extend and is poorly understood. Inevitably, there is always a specter of data-quality issues in these environments that always results when there are multiple data stores with multiple data-integration environments. This proliferation of data technologies and approaches has created significant challenges beyond just the data-quality issue. These symptoms
include the following:
Long, costly data science modeling timelines. Finding the right training data then crafting it into a usable data set takes up 75–80 percent of the time for a data science experiment.
Lack of trusted data and metrics. Organizations often are paralyzed with the issue of having multiple reports with the same data and different totals, resulting in the data-quality issues mentioned earlier.
Lack of consistent metadata and reusable components. The good news is that many organizations now have a rudimentary metadata catalog. The bad news is that often, it is for perhaps one part of one data warehouse in their data portfolio. Very few organizations have metadata cataloguing capabilities for all their data technologies in their portfolio. The ability to capture all metadata on ingestion that is maintained and, most importantly, reused is a function that most organizations have not implemented and matured. Understand what data you have cataloged, where it is, and its definitions in the increasingly heterogeneous, hybrid, cloud-based data landscape. Metadata and model-management reusability concepts are particularly true in the data science space. Data science is a capability that has matured beyond the artisan
phase, where every model needs to be developed from the ground up. Organizations that have built processes to develop an assets-based approach with reusable components are winning in the field. Having prebuilt data science blueprints and in-house and commercial algorithm libraries are providing many organizations the ability to increase their time-to-value and providing them with a competitive advantage.
Inability to integrate with digital channels. As many organizations continue in their digital transformation, they are finding that their legacy data environments are not agile and flexible enough to enable their organizational data for digital channels. Digital channels require real-time provisioning, decision-making, and action. Old batch-data warehouses that are producing daily reports and query environments are simply not engineered for the flexibility, speed, and throughput necessary for today’s digital environments. Most digital architectures of today portray data hubs with both batch and real-time ingestion and API (applications programming interface) layers that can orchestrate data in the digital channels.
With all these challenges, many are turning to the cloud to solve these issues. Many organizations are expecting the cloud to be their silver bullet: just move all these environments to the public cloud, and their cost, quality, and management issues will all be solved. The fact is that moving all these different environments to the cloud will not reduce their cost but very likely triple their cost. The reason why many of these organizations will find the cloud is most likely going to increase their costs is that migrating a Teradata data warehouse and Hadoop data lake means moving all the data structures, data, and data-integration processes to the that target cloud environment. Moving data to the cloud is not cheap. Unlike most organizations that do not truly manage next traffic from a cost perspective, cloud vendors do. In an on-premises environment that sources data from the same customer source system to three data wares, a Hadoop data lake for data science and a Cassandra-based digital environment has to pay for all of these data movements to the cloud.
The purpose of a book on data platforms
The purpose of this book is to define what a data platform is, what the components or layers in a data platform are, what the technologies and processes are for each layer, and the supporting organizational structure needed to sustain and evolve a data platform. It will cover the value a data platform will provide in comparison to a collection of data warehouses, data marts, and data lakes. It will start with a section on the evolution of the data platform and on how the different use cases for data have evolved transactional and analytics architectures over time with disruptive changes to the modern data platform. This includes a review of the influences of early transactional and analytic processing, which are still critical use cases and design patterns in the data platform. It reviews the anti-data-warehouse era,
where organizations used Hadoop clusters to build data lakes along with data science sandboxes. The rise of digital processing created a whole new use case of sending events (both transactional and nontransactional) bidirectionally on digital channels using AI-embedded models to predict or recommend next-step activities. These required data technologies were engineered for those stateless use cases and are easily enabled with stateless APIs such as REST.
A blueprint
for a data platform
The growing set of use cases for data and its increased importance in digital channels has generated the need for an architectural approach that provides commonality and consistency at an enterprise level but with the flexibility to easily enable components of data and analytics into digital channels via APIs. This need has generated multilayered blueprints, or, as referred to in the information technology community, a reference architecture. There are many reference architectures being discussed for data in the industry and are often referred to as a data fabric or data mesh. For this book, it will be referred to as architecture for a data platform. This reference architecture is designed to address the multiple use cases for data in the digital age, including digital, operational, and analytic data use cases, where each use case can stand independently or be integrated into a broader data framework. It will cover the following component layers:
Intelligent-integration capabilities. This covers the types of data ingested in a modern data platform, including batch and real-time technologies with automated AI-infused profiling capabilities. This includes a review on the expanded need for curated data beyond the traditional transformations
in the traditional process of ETL (extract, transform, and load). Integration is now intelligent
with AI (artificial intelligence) capabilities assisting in the curation processes to conform, calculate, and aggregate data based on use cases. It also covers AI-infused data quality, master-data management, data science sandbox engineering, and bidirectional digital interactions.
The data marketplace. This section will address the different data designs and technological approaches needed to meet the multiple use cases for data in the digital environment. It will also address the recent trend to discuss the opportunity and the reality of data virtualization.
Insights. The insights capability derives business value from the data marketplace. It develops different types of insights based on need, and this is to guide business decisions using data visualization and standard reporting, both through data science modeling. This is the interface where data is transformed into usable information. It takes a pragmatic look at the shift from thousands of BI (business intelligence) reports to modern data visualization tools for the digital age, and how the shift to embedding predictive models into digital channels, which creates intelligent workflow,
is the next evolution of insights.
Digital orchestration component. This digital integration capability includes topics such a as APIs, that connect the data platform into digital channels and applications. It includes a review of integrating AI and ML applications in open-source capabilities such as Kubeflow as well as event-based interactions with nontraditional data sources such as IoT (Internet of Things) edge-based devices.
Experience component. This component combines insights, data, and orchestration capabilities into an organization’s digital channels. Examples of the experience layer are programmatic marketing (inbound and outbound) and e-commerce interactions.
This book also covers commercial data-platform technologies and cloud vendors such as AWS, Microsoft Azure, Google Cloud, and IBM’s data-platform offers.
It provides approaches and techniques on how to build out a data-platform environment, both greenfield
and legacy data environment. Since most organizations today have an existing analytics data environment, it provides a point of view on how to migrate legacy data environment into a modern data platform.
Finally, the book covers the types of data-management operating models and organizational roles that are needed to build, sustain, and evolve a modern data-platform environment that address those many use cases of data needed in a digital organization.
PART 1
The Evolution of the Data Platform
What Is a Data Platform?
The Evolution of the Use Cases for Data
CHAPTER 1
What Is a Data Platform?
The first section of this text, The Evolution of a Data Platform,
sets the stage by reviewing the evolution of data usage, which is driving the need for a new way to provision and store data, such as the data platform, for today’s digital environment. It covers how the industry has progressed in its use of data from static reports to real-time decision-making. It analyzes why those organizations that have chosen to not take advantage of this new capability will be at a competitive disadvantage in terms of strategic flexibility, digital enablement, and cost management. Next, it builds the technical case for a data platform by delving into earlier versions of data environments such as the data warehouse, data mart, data lake, and data science sandboxes. The book then reviews the architectural evolution of data architectures. It covers the certain business and technical problems they solved and those they created that drove the need for the next evolution. This evolution of capabilities and constraints has led to the concept of the data platform.
Chapter 1, What Is a Data Platform?
provides the technical and business case for a data platform based on the evolving needs and their use cases for a data platform based on disruptive forces such as digital and artificial intelligence (AI). It will define what a data platform is and the risks of not having one. It will also cover reasons why organizations have not built a data platform.
The business drivers for a data platform
The business need for a data platform is based on the new uses for data, centered on three main factors: digital transformation, the advent of artificial intelligence, and the mass migration to the cloud. The discussion on data today first starts with a conversation on the digital transformation. Digital transformation is not new. In fact, it can be accurately stated that it is at least twenty-five years old. Early-adopter organizations that started as or moved to digital have gained a significant competitive advantage in their industries. Meanwhile, the rest of the world has recognized the imperative of going digital in the past ten years and has started transformation programs of some sort, trying to catch up. Social media, digital marketing, and e-commerce are all visceral aspects of the world’s pivot to digital. The COVID-19 pandemic has accelerated the world’s economy into those digital channels of working and purchasing as the stay-at-home orders descended from national to local governments. The fuel for this digital revolution is data. Every event on a digital channel is an opportunity to quantify and analyze behaviors that will drive usage, cost savings, or additional revenue.
Digital is not the only driver for a data platform; artificial intelligence (AI) / machine learning is every bit as disruptive as digital in its use of data and is a key enabler for real-time decision-making in digital channels. To develop these AI processes, the data scientist requires data science sandboxes for training and test data.
The final driver for a data platform is the cloud. The promise of lower cost and less management is driving organizations to plan and move their data estates to the cloud with legacy- and digitally driven use cases.
Figure 1.1. The multiple use cases for data.
A data platform provides a multipurpose environment to provision and provide data for all these use cases in a common, cost-effective manner that does not require massive duplication, ensures higher quality data, and reduces operational costs. The need for a common data environment to meet these use cases becomes readily apparent when one considers the technical and business drivers. One of the many (and maybe not the best) reasons