Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Ebook382 pages4 hours

Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Digital, cloud, and artificial intelligence (AI) have disrupted how we use data. This disruption has changed the way we need to provision, curate, and publish data for the multiple use cases in today's technology-driven environment. This text will cover how to design, develop, and evolve a data platform for all the uses of enterprise data needed in today's digital organization.

This book focuses on explaining what a data platform is, what value it provides, how is it engineered, and how to deploy a data platform and support organization. In this context, Introduction to Data Platforms

reviews the current requirements for data in the digital age and quantifies the use cases;

discusses the evolution of data over the past twenty years, which is a core driver of the modern data platform;

defines what a data platform is and defines the architectural components and layers of a data platform;

provides the architectural layers or capabilities of a data platform;

reviews cloud- and commercial-software vendors that populate the data-platform space;

provides a step-by-step approach to engineering, deploying, supporting, and evolving a data-platform environment;

provides a step-by-step approach to migrating legacy data warehouses, data marts, and data lakes/sandboxes to a data platform; and

reviews organizational structures for managing data platform environments.

LanguageEnglish
Release dateNov 3, 2022
ISBN9798885053877
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world

Related to Introduction to Data Platforms

Related ebooks

Computers For You

View More

Related articles

Reviews for Introduction to Data Platforms

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Introduction to Data Platforms - Anthony David Giordano

    Title Page

    Copyright © 2022 Anthony David Giordano

    All rights reserved

    First Edition

    Fulton Books

    Meadville, PA

    Published by Fulton Books 2022

    ISBN 979-8-88505-386-0 (paperback)

    ISBN 979-8-88505-387-7 (digital)

    Printed in the United States of America

    I would like to dedicate this book to my daughters—Katie and Kelsie; they teach me something new and wonderful every day.

    CONTENTS

    Preface

    Acknowledgments

    Introduction: The Rise of the Data Platform

    The need for a data platform

    The driver of a new approach: strategic inflexibility

    The purpose of a book on data platforms

    A blueprint for a data platform

    Part 1: The Evolution of the Data Platform

    Chapter 1: What Is a Data Platform?

    The business drivers for a data platform

    Definitions of data platform

    Reasons why organizations have not built a data platform

    Chapter 2: The Evolution of Use Cases for Data

    The first evolution: the transactional data era

    The second evolution: the data-warehousing era

    The third evolution: the anti-data-warehouse era—the data lake era

    Three-one evolution: digital data—the API-friendly data hub

    Operational data: infusing AI into operational processes

    A comprehensive, integrated approach: the data platform

    Part 2: Capabilities of a Data Platform

    Chapter 3: An Approach for a Data Platform

    Overview of a reference architecture

    Data platform architectures today

    Data Fabric

    Data Mesh

    Detailed view of the data-platform reference architecture

    Intelligent-Integration Capabilities

    Data-Marketplace Capabilities

    Insights Capabilities

    Digital-Orchestration Capabilities

    Experience Capabilities

    Scaling the data platform horizontally and vertically

    Horizontal Use Cases

    Vertical Use Cases

    Scaling a data platform across a multicloud environment

    Chapter 4: The Intelligent-Integration Capability

    An intelligent-integration processing framework

    The intelligence in the integration process

    Ingestion services

    Batch Ingestion Services

    Real-Time Ingestion Services

    Profiling and Metadata-Capture Services

    Data-quality services

    Master-data management services

    Curation services

    Publish services

    Management support services

    Configuring an intelligent-integration environment

    Chapter 5: The Data Marketplace

    Raw layer

    Conform layer

    Consumption layer

    Physical vs. Virtual Layer

    Chapter 6: The Other Data-Platform Components

    Insights Components

    Data-Visualization Layer

    Data-Science Predictive-Modeling Layer

    Detailed view of the digital orchestration capabilities

    Detailed view of the experience component

    Part 3: Implementing a Data Platform

    Chapter 7: How to Build a Data Platform

    The need for a new approach for data platforms

    Configure and evolve versus waterfall

    Overview of a data-platform methodology

    Evolution vs. Manage: Data Ops

    Insights migration approaches

    Chapter 8: Data-Platform Use Cases

    Case study 1: a digital transformation of a retail bank

    Case study 2: a data science and data governance transformation of a pharmaceutical company

    Pharmaceutical Company

    Chapter 9: Data-Platform Cloud Implementations

    Detailed review of AWS data-platform technologies

    Detailed review of Microsoft Azure’s data-platform technologies

    Detailed review of Google Cloud data-platform technologies

    Detailed review of IBM’s data-platform technologies

    Commercial data platforms

    C3.ai

    Palantir

    Other notable cloud-based data technologies

    Snowflake

    Databricks

    Best practices on data-platform cloud implementations

    Chapter 10: An Operating Model for a Modern Data Platform

    The impact of the enterprise organizational structure

    Primary Functions of a Data Organization

    Data-Platform-Architecture Management Services

    Information Governance Services

    Data Development and Evolution Services

    The Need for Change Management

    Afterword

    PREFACE

    This text provides an overview for information technology executives, chief data officers, and data practitioners on a detailed review of what a data platform is, with the benefits and reasons why they should seriously consider migrating their current data estate to one. Throughout the text, there will be case studies for each of the topics on designing, implementing, managing, and evolving a data platform.

    The text starts with an explanation of how the use cases for data have evolved over the past twenty years, starting with transactional data design to simple business-intelligence (BI) reporting and eventually evolving into today’s multipurpose, multi-use-case, real-time data environments instantiated as data platforms. These use cases include traditional reporting (it’s not going away), data visualization, data science, digital, and operational (integrated with ML/AI capabilities). It illustrates how the architectures for data have evolved over the past twenty years into next-generation concepts that have allowed a greater use of data that is more strategic and integral in the digital world, in which we are now doing business.

    The text covers how information architecture has evolved from its early days of simple transactional concepts to the current focus of data fabric and data mesh. The text covers in detail the core layers or components of a data platform and how data is ingested, qualified, curated, and conformed into both enterprise and application layers, which create multiuse data environments that reduce redundancy and cost while ensuring flexibility. The text brokers a pragmatic conversation on when to use enterprise versus the application of data layers in a data-platform environment. In covering data-fabric concepts, it covers the benefits and cost of when to physicalize and when to virtualize data. The book covers the essential nondata layers or capabilities of a data platform that illustrate how to integrate a data platform into a broader digital ecosystem and how to engineer it to drive value out of it for each of the multiple use cases.

    It will review commercial data technologies including the cloud vendors’ native technological approaches for a data platform, which include a conversation on how to best migrate your current data estate to a data platform. Finally, it covers how to create a data organization to deploy, sustain, and evolve a modern data platform.

    Intended audience

    This text serves many different audiences. It can be used by experienced information management executives and chief data officers for a better understanding of the business case for a data platform or simply present one with the best practices for blueprinting, engineering, implementing, populating, and operating a data platform. The intended audiences include the following:

    chief information and technology officers

    chief data officers

    data and analytic consultants

    data solution architects and data engineers

    program/project managers

    other information management practitioners

    Scope of the text

    This book focuses on explaining what a data platform is, what value it provides, how it is engineered, and how to deploy a data platform and a support organization.

    With that goal in mind, An Introduction to Data Platforms

    reviews the current requirements for data in the digital age and quantifies the use cases;

    discusses the evolution of data over the past twenty years, which is a core driver of the modern data platforms;

    defines what a data platform is and the architectural components and layers of a data platform;

    provides the architectural layers or capabilities of a data platform;

    reviews cloud and commercial software vendors that populate the data-platform space;

    provides a step-by-step approach to engineering, deploying, supporting, and evolving a data-platform environment;

    provides a step-by-step approach to migrating legacy data warehouses, data marts, and data lakes/sandboxes to a data platform; and

    reviews organizational structures for managing data-platform environments.

    ACKNOWLEDGMENTS

    The art and science required for a data platform in the digital age requires a significant amount of experience in the field and countless hours of configuring data technologies on multiple clouds into easy-to-use capabilities. The architectural principles and data-management processes defined in this book are a result of actual project work that is a product of those countless hours of implementing architectures and hardening those architectural concepts and processes that, today, run in all evolved data platforms in our organizations. These efforts can only be performed in collaboration with knowledgeable, dedicated, and experienced practitioners. In particular, I would like to acknowledge Mehdi Charafeddine, Glenn Finch, Jay Houghton, Ron Koch, and Ron Shelby—all of whom played an integral part in the development of this book.

    INTRODUCTION:

    The Rise of the Data Platform

    The need for a data platform

    They say that change is inevitable, and it is. Some changes are visceral and are so revolutionary that everyone instantly sees it, recognizes it, and embraces it. Others are so subtle that when they occur, only the savvy see it and exploit it in order to take the competitive advantage that the change provides. Data platforms are that next quiet evolution in technology that will provide greater strategic flexibility with your data and better data governance and quality in a more cost-effective manner. A data platform is a common data environment that provisions multiple business use cases.

    The driver of a new approach: strategic inflexibility

    The era of the data platform has started in an already very mature data-management world. There are very few organizations today that do not have a host of data technologies in their environments. In fact, that is the problem: the era of the greenfield data environment passed twenty years ago, if not longer. Today, most organizations have multiple nonintegrated legacy data warehouses and marts, Hadoop clusters / data lakes, or NoSQL stores all performing some function in their environment but at a significant cost in terms of data integration and duplication. While all these technologies perform their specific purpose, in aggregate, they provide an inflexible, expensive infrastructure that tends to be difficult to extend and is poorly understood. Inevitably, there is always a specter of data-quality issues in these environments that always results when there are multiple data stores with multiple data-integration environments. This proliferation of data technologies and approaches has created significant challenges beyond just the data-quality issue. These symptoms include the following:

    Long, costly data science modeling timelines. Finding the right training data then crafting it into a usable data set takes up 75–80 percent of the time for a data science experiment.

    Lack of trusted data and metrics. Organizations often are paralyzed with the issue of having multiple reports with the same data and different totals, resulting in the data-quality issues mentioned earlier.

    Lack of consistent metadata and reusable components. The good news is that many organizations now have a rudimentary metadata catalog. The bad news is that often, it is for perhaps one part of one data warehouse in their data portfolio. Very few organizations have metadata cataloguing capabilities for all their data technologies in their portfolio. The ability to capture all metadata on ingestion that is maintained and, most importantly, reused is a function that most organizations have not implemented and matured. Understand what data you have cataloged, where it is, and its definitions in the increasingly heterogeneous, hybrid, cloud-based data landscape. Metadata and model-management reusability concepts are particularly true in the data science space. Data science is a capability that has matured beyond the artisan phase, where every model needs to be developed from the ground up. Organizations that have built processes to develop an assets-based approach with reusable components are winning in the field. Having prebuilt data science blueprints and in-house and commercial algorithm libraries are providing many organizations the ability to increase their time-to-value and providing them with a competitive advantage.

    Inability to integrate with digital channels. As many organizations continue in their digital transformation, they are finding that their legacy data environments are not agile and flexible enough to enable their organizational data for digital channels. Digital channels require real-time provisioning, decision-making, and action. Old batch-data warehouses that are producing daily reports and query environments are simply not engineered for the flexibility, speed, and throughput necessary for today’s digital environments. Most digital architectures of today portray data hubs with both batch and real-time ingestion and API (applications programming interface) layers that can orchestrate data in the digital channels.

    With all these challenges, many are turning to the cloud to solve these issues. Many organizations are expecting the cloud to be their silver bullet: just move all these environments to the public cloud, and their cost, quality, and management issues will all be solved. The fact is that moving all these different environments to the cloud will not reduce their cost but very likely triple their cost. The reason why many of these organizations will find the cloud is most likely going to increase their costs is that migrating a Teradata data warehouse and Hadoop data lake means moving all the data structures, data, and data-integration processes to the that target cloud environment. Moving data to the cloud is not cheap. Unlike most organizations that do not truly manage next traffic from a cost perspective, cloud vendors do. In an on-premises environment that sources data from the same customer source system to three data wares, a Hadoop data lake for data science and a Cassandra-based digital environment has to pay for all of these data movements to the cloud.

    The purpose of a book on data platforms

    The purpose of this book is to define what a data platform is, what the components or layers in a data platform are, what the technologies and processes are for each layer, and the supporting organizational structure needed to sustain and evolve a data platform. It will cover the value a data platform will provide in comparison to a collection of data warehouses, data marts, and data lakes. It will start with a section on the evolution of the data platform and on how the different use cases for data have evolved transactional and analytics architectures over time with disruptive changes to the modern data platform. This includes a review of the influences of early transactional and analytic processing, which are still critical use cases and design patterns in the data platform. It reviews the anti-data-warehouse era, where organizations used Hadoop clusters to build data lakes along with data science sandboxes. The rise of digital processing created a whole new use case of sending events (both transactional and nontransactional) bidirectionally on digital channels using AI-embedded models to predict or recommend next-step activities. These required data technologies were engineered for those stateless use cases and are easily enabled with stateless APIs such as REST.

    A blueprint for a data platform

    The growing set of use cases for data and its increased importance in digital channels has generated the need for an architectural approach that provides commonality and consistency at an enterprise level but with the flexibility to easily enable components of data and analytics into digital channels via APIs. This need has generated multilayered blueprints, or, as referred to in the information technology community, a reference architecture. There are many reference architectures being discussed for data in the industry and are often referred to as a data fabric or data mesh. For this book, it will be referred to as architecture for a data platform. This reference architecture is designed to address the multiple use cases for data in the digital age, including digital, operational, and analytic data use cases, where each use case can stand independently or be integrated into a broader data framework. It will cover the following component layers:

    Intelligent-integration capabilities. This covers the types of data ingested in a modern data platform, including batch and real-time technologies with automated AI-infused profiling capabilities. This includes a review on the expanded need for curated data beyond the traditional transformations in the traditional process of ETL (extract, transform, and load). Integration is now intelligent with AI (artificial intelligence) capabilities assisting in the curation processes to conform, calculate, and aggregate data based on use cases. It also covers AI-infused data quality, master-data management, data science sandbox engineering, and bidirectional digital interactions.

    The data marketplace. This section will address the different data designs and technological approaches needed to meet the multiple use cases for data in the digital environment. It will also address the recent trend to discuss the opportunity and the reality of data virtualization.

    Insights. The insights capability derives business value from the data marketplace. It develops different types of insights based on need, and this is to guide business decisions using data visualization and standard reporting, both through data science modeling. This is the interface where data is transformed into usable information. It takes a pragmatic look at the shift from thousands of BI (business intelligence) reports to modern data visualization tools for the digital age, and how the shift to embedding predictive models into digital channels, which creates intelligent workflow, is the next evolution of insights.

    Digital orchestration component. This digital integration capability includes topics such a as APIs, that connect the data platform into digital channels and applications. It includes a review of integrating AI and ML applications in open-source capabilities such as Kubeflow as well as event-based interactions with nontraditional data sources such as IoT (Internet of Things) edge-based devices.

    Experience component. This component combines insights, data, and orchestration capabilities into an organization’s digital channels. Examples of the experience layer are programmatic marketing (inbound and outbound) and e-commerce interactions.

    This book also covers commercial data-platform technologies and cloud vendors such as AWS, Microsoft Azure, Google Cloud, and IBM’s data-platform offers.

    It provides approaches and techniques on how to build out a data-platform environment, both greenfield and legacy data environment. Since most organizations today have an existing analytics data environment, it provides a point of view on how to migrate legacy data environment into a modern data platform.

    Finally, the book covers the types of data-management operating models and organizational roles that are needed to build, sustain, and evolve a modern data-platform environment that address those many use cases of data needed in a digital organization.

    PART 1

    The Evolution of the Data Platform

    What Is a Data Platform?

    The Evolution of the Use Cases for Data

    CHAPTER 1

    What Is a Data Platform?

    The first section of this text, The Evolution of a Data Platform, sets the stage by reviewing the evolution of data usage, which is driving the need for a new way to provision and store data, such as the data platform, for today’s digital environment. It covers how the industry has progressed in its use of data from static reports to real-time decision-making. It analyzes why those organizations that have chosen to not take advantage of this new capability will be at a competitive disadvantage in terms of strategic flexibility, digital enablement, and cost management. Next, it builds the technical case for a data platform by delving into earlier versions of data environments such as the data warehouse, data mart, data lake, and data science sandboxes. The book then reviews the architectural evolution of data architectures. It covers the certain business and technical problems they solved and those they created that drove the need for the next evolution. This evolution of capabilities and constraints has led to the concept of the data platform.

    Chapter 1, What Is a Data Platform? provides the technical and business case for a data platform based on the evolving needs and their use cases for a data platform based on disruptive forces such as digital and artificial intelligence (AI). It will define what a data platform is and the risks of not having one. It will also cover reasons why organizations have not built a data platform.

    The business drivers for a data platform

    The business need for a data platform is based on the new uses for data, centered on three main factors: digital transformation, the advent of artificial intelligence, and the mass migration to the cloud. The discussion on data today first starts with a conversation on the digital transformation. Digital transformation is not new. In fact, it can be accurately stated that it is at least twenty-five years old. Early-adopter organizations that started as or moved to digital have gained a significant competitive advantage in their industries. Meanwhile, the rest of the world has recognized the imperative of going digital in the past ten years and has started transformation programs of some sort, trying to catch up. Social media, digital marketing, and e-commerce are all visceral aspects of the world’s pivot to digital. The COVID-19 pandemic has accelerated the world’s economy into those digital channels of working and purchasing as the stay-at-home orders descended from national to local governments. The fuel for this digital revolution is data. Every event on a digital channel is an opportunity to quantify and analyze behaviors that will drive usage, cost savings, or additional revenue.

    Digital is not the only driver for a data platform; artificial intelligence (AI) / machine learning is every bit as disruptive as digital in its use of data and is a key enabler for real-time decision-making in digital channels. To develop these AI processes, the data scientist requires data science sandboxes for training and test data.

    The final driver for a data platform is the cloud. The promise of lower cost and less management is driving organizations to plan and move their data estates to the cloud with legacy- and digitally driven use cases.

    Figure 1.1. The multiple use cases for data.

    A data platform provides a multipurpose environment to provision and provide data for all these use cases in a common, cost-effective manner that does not require massive duplication, ensures higher quality data, and reduces operational costs. The need for a common data environment to meet these use cases becomes readily apparent when one considers the technical and business drivers. One of the many (and maybe not the best) reasons

    Enjoying the preview?
    Page 1 of 1