Practical DataOps: Delivering Agile Data Science at Scale

Ebook470 pages5 hours

Practical DataOps: Delivering Agile Data Science at Scale

Name: Practical DataOps: Delivering Agile Data Science at Scale
Author: Harvinder Atwal
ISBN: 9781484251041

By Harvinder Atwal

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Gain a practical introduction to DataOps, a new discipline for delivering data science at scale inspired by practices at companies such as Facebook, Uber, LinkedIn, Twitter, and eBay. Organizations need more than the latest AI algorithms, hottest tools, and best people to turn data into insight-driven action and useful analytical data products. Processes and thinking employed to manage and use data in the 20th century are a bottleneck for working effectively with the variety of data and advanced analytical use cases that organizations have today. This book provides the approach and methods to ensure continuous rapid use of data to create analytical data products and steer decision making.
Practical DataOps shows you how to optimize the data supply chain from diverse raw data sources to the final data product, whether the goal is a machine learning model or other data-orientated output. The book provides an approach to eliminate wasted effort and improve collaboration between data producers, data consumers, and the rest of the organization through the adoption of lean thinking and agile software development principles.
This book helps you to improve the speed and accuracy of analytical application development through data management and DevOps practices that securely expand data access, and rapidly increase the number of reproducible data products through automation, testing, and integration. The book also shows how to collect feedback and monitor performance to manage and continuously improve your processes and output.

What You Will Learn

Develop a data strategy for your organization to help it reach its long-term goals
Recognize and eliminate barriers to delivering data to users at scale
Work on the right things for the right stakeholders through agile collaboration
Create trust in data via rigorous testing and effective data management
Build a culture of learning and continuous improvement through monitoring deployments and measuring outcomes
Create cross-functional self-organizing teams focused on goals not reporting lines
Build robust, trustworthy, data pipelines in support of AI, machine learning, and other analytical data products

Who This Book Is For
Data science and advanced analytics experts, CIOs, CDOs (chief data officers), chief analytics officers, business analysts, business team leaders, and IT professionals (data engineers, developers, architects, and DBAs) supporting data teams who want to dramatically increase the value their organization derives from data. The book is ideal for data professionals who want to overcome challenges of long delivery time, poor data quality, high maintenance costs, and scaling difficulties in getting data science output and machine learning into customer-facing production.

Skip carousel

Databases

LanguageEnglish

PublisherApress

Release dateDec 9, 2019

ISBN9781484251041

Author

Harvinder Atwal

Related authors

Skip carousel

Related to Practical DataOps

Related ebooks

Skip carousel

The Decision Maker's Handbook to Data Science: A Guide for Non-Technical Executives, Managers, and Founders
Ebook
The Decision Maker's Handbook to Data Science: A Guide for Non-Technical Executives, Managers, and Founders
byStylianos Kampakis
Rating: 0 out of 5 stars
0 ratings
Machine Learning for Decision Makers: Cognitive Computing Fundamentals for Better Decision Making
Ebook
Machine Learning for Decision Makers: Cognitive Computing Fundamentals for Better Decision Making
byPatanjali Kashyap
Rating: 0 out of 5 stars
0 ratings
Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake
Ebook
Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake
bySaurabh Gupta
Rating: 0 out of 5 stars
0 ratings
Data Lake Development with Big Data
Ebook
Data Lake Development with Big Data
byPasupuleti Pradeep
Rating: 0 out of 5 stars
0 ratings
Big Data: Understanding How Data Powers Big Business
Ebook
Big Data: Understanding How Data Powers Big Business
byBill Schmarzo
Rating: 2 out of 5 stars
2/5
Business Value in an Ocean of Data: Data Mining from a User Perspective
Ebook
Business Value in an Ocean of Data: Data Mining from a User Perspective
byBulcsú Fajszi
Rating: 0 out of 5 stars
0 ratings
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Ebook
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
byNeal Fishman
Rating: 0 out of 5 stars
0 ratings
Thriving in a Data World: A Guide for Leaders and Managers
Ebook
Thriving in a Data World: A Guide for Leaders and Managers
bySangeeta Krishnan
Rating: 0 out of 5 stars
0 ratings
Effective Data Science Infrastructure: How to make data scientists productive
Ebook
Effective Data Science Infrastructure: How to make data scientists productive
byVille Tuulos
Rating: 0 out of 5 stars
0 ratings
The Big Data-Driven Business: How to Use Big Data to Win Customers, Beat Competitors, and Boost Profits
Ebook
The Big Data-Driven Business: How to Use Big Data to Win Customers, Beat Competitors, and Boost Profits
byRussell Glass
Rating: 0 out of 5 stars
0 ratings
Managing Data in Motion: Data Integration Best Practice Techniques and Technologies
Ebook
Managing Data in Motion: Data Integration Best Practice Techniques and Technologies
byApril Reeve
Rating: 0 out of 5 stars
0 ratings
Getting Data Science Done: Managing Projects From Ideas to Products
Ebook
Getting Data Science Done: Managing Projects From Ideas to Products
by"John" "Hawkins"
Rating: 0 out of 5 stars
0 ratings
Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses
Ebook
Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses
byMichele Chambers
Rating: 0 out of 5 stars
0 ratings
Snowflake Security: Securing Your Snowflake Data Cloud
Ebook
Snowflake Security: Securing Your Snowflake Data Cloud
byBen Herzberg
Rating: 0 out of 5 stars
0 ratings
Building Intelligent Information Systems Software: Introducing the Unit Modeler Development Technology
Ebook
Building Intelligent Information Systems Software: Introducing the Unit Modeler Development Technology
byThomas D. Feigenbaum
Rating: 0 out of 5 stars
0 ratings
Artificial Intelligence for Business: A Roadmap for Getting Started with AI
Ebook
Artificial Intelligence for Business: A Roadmap for Getting Started with AI
byJason L. Anderson
Rating: 0 out of 5 stars
0 ratings
The Case for the Chief Data Officer: Recasting the C-Suite to Leverage Your Most Valuable Asset
Ebook
The Case for the Chief Data Officer: Recasting the C-Suite to Leverage Your Most Valuable Asset
byPeter Aiken
Rating: 4 out of 5 stars
4/5
Building Big Data Applications
Ebook
Building Big Data Applications
byKrish Krishnan
Rating: 0 out of 5 stars
0 ratings
MongoDB Recipes: With Data Modeling and Query Building Strategies
Ebook
MongoDB Recipes: With Data Modeling and Query Building Strategies
bySubhashini Chellappan
Rating: 0 out of 5 stars
0 ratings
Data Lake A Complete Guide - 2019 Edition
Ebook
Data Lake A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Advanced Platform Development with Kubernetes: Enabling Data Management, the Internet of Things, Blockchain, and Machine Learning
Ebook
Advanced Platform Development with Kubernetes: Enabling Data Management, the Internet of Things, Blockchain, and Machine Learning
byCraig Johnston
Rating: 0 out of 5 stars
0 ratings
Edge Data Fabric Third Edition
Ebook
Edge Data Fabric Third Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Learning Azure DocumentDB
Ebook
Learning Azure DocumentDB
byBecker Riccardo
Rating: 0 out of 5 stars
0 ratings
Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions
Ebook
Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions
bySudhir Rawat
Rating: 0 out of 5 stars
0 ratings
Deploying AI in the Enterprise: IT Approaches for Design, DevOps, Governance, Change Management, Blockchain, and Quantum Computing
Ebook
Deploying AI in the Enterprise: IT Approaches for Design, DevOps, Governance, Change Management, Blockchain, and Quantum Computing
byEberhard Hechler
Rating: 0 out of 5 stars
0 ratings
DataOps A Complete Guide - 2020 Edition
Ebook
DataOps A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
The Definitive Guide to Azure Data Engineering: Modern ELT, DevOps, and Analytics on the Azure Cloud Platform
Ebook
The Definitive Guide to Azure Data Engineering: Modern ELT, DevOps, and Analytics on the Azure Cloud Platform
byRon C. L'Esteve
Rating: 0 out of 5 stars
0 ratings
Cloud Computing Basics: A Non-Technical Introduction
Ebook
Cloud Computing Basics: A Non-Technical Introduction
byAnders Lisdorf
Rating: 0 out of 5 stars
0 ratings
Data Management Strategy A Complete Guide - 2019 Edition
Ebook
Data Management Strategy A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
Ebook
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
byMayank Malhotra
Rating: 0 out of 5 stars
0 ratings

Databases For You

Skip carousel

CompTIA DataSys+ Study Guide: Exam DS0-001
Ebook
CompTIA DataSys+ Study Guide: Exam DS0-001
byMike Chapple
Rating: 0 out of 5 stars
0 ratings
Spring in Action, Sixth Edition
Ebook
Spring in Action, Sixth Edition
byCraig Walls
Rating: 5 out of 5 stars
5/5
COBOL Basic Training Using VSAM, IMS and DB2
Ebook
COBOL Basic Training Using VSAM, IMS and DB2
byRobert Wingate
Rating: 5 out of 5 stars
5/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Practical Data Analysis
Ebook
Practical Data Analysis
byHector Cuesta
Rating: 4 out of 5 stars
4/5
Business Intelligence Strategy and Big Data Analytics: A General Management Perspective
Ebook
Business Intelligence Strategy and Big Data Analytics: A General Management Perspective
bySteve Williams
Rating: 5 out of 5 stars
5/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
Ebook
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
byAJIT DASH
Rating: 3 out of 5 stars
3/5
HTML, CSS, Bootstrap, Php, Javascript and MySql: All you need to know to create a dynamic site
Ebook
HTML, CSS, Bootstrap, Php, Javascript and MySql: All you need to know to create a dynamic site
byOlga Maria Stefania Cucaro
Rating: 4 out of 5 stars
4/5
COMPUTER SCIENCE FOR ROOKIES
Ebook
COMPUTER SCIENCE FOR ROOKIES
byAngel Bahabwa
Rating: 0 out of 5 stars
0 ratings
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
SQL Clearly Explained
Ebook
SQL Clearly Explained
byJan L. Harrington
Rating: 5 out of 5 stars
5/5
Building a Scalable Data Warehouse with Data Vault 2.0
Ebook
Building a Scalable Data Warehouse with Data Vault 2.0
byDaniel Linstedt
Rating: 4 out of 5 stars
4/5
Serverless Architectures on AWS, Second Edition
Ebook
Serverless Architectures on AWS, Second Edition
byPeter Sbarski
Rating: 5 out of 5 stars
5/5
Data Mining: Concepts and Techniques
Ebook
Data Mining: Concepts and Techniques
byJiawei Han
Rating: 4 out of 5 stars
4/5
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator
Ebook
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator
byBrian Peasland
Rating: 0 out of 5 stars
0 ratings
Access 2019 For Dummies
Ebook
Access 2019 For Dummies
byLaurie A. Ulrich
Rating: 0 out of 5 stars
0 ratings
Relational Database Design and Implementation
Ebook
Relational Database Design and Implementation
byJan L. Harrington
Rating: 5 out of 5 stars
5/5
Learn SQL Server Administration in a Month of Lunches
Ebook
Learn SQL Server Administration in a Month of Lunches
byDon Jones
Rating: 0 out of 5 stars
0 ratings
Blockchain Basics: A Non-Technical Introduction in 25 Steps
Ebook
Blockchain Basics: A Non-Technical Introduction in 25 Steps
byDaniel Drescher
Rating: 5 out of 5 stars
5/5
Getting Started with SQL Server 2014 Administration
Ebook
Getting Started with SQL Server 2014 Administration
byGethyn Ellis
Rating: 0 out of 5 stars
0 ratings
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
Ebook
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
byJohn Ladley
Rating: 4 out of 5 stars
4/5
The SQL Workshop: Learn to create, manipulate and secure data and manage relational databases with SQL
Ebook
The SQL Workshop: Learn to create, manipulate and secure data and manage relational databases with SQL
byFrank Solomon
Rating: 0 out of 5 stars
0 ratings
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
Ebook
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
byWilliam Sullivan
Rating: 5 out of 5 stars
5/5
A Concise Guide to Object Orientated Programming
Ebook
A Concise Guide to Object Orientated Programming
byalasdair gilchrist
Rating: 0 out of 5 stars
0 ratings
Access 2010 All-in-One For Dummies
Ebook
Access 2010 All-in-One For Dummies
byAlison Barrows
Rating: 4 out of 5 stars
4/5
Go in Action
Ebook
Go in Action
byErik St. Martin
Rating: 5 out of 5 stars
5/5
Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics
Ebook
Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics
byDan Clark
Rating: 0 out of 5 stars
0 ratings
Python and SQLite Development
Ebook
Python and SQLite Development
byAgus Kurniawan
Rating: 0 out of 5 stars
0 ratings
The Visual Imperative: Creating a Visual Culture of Data Discovery
Ebook
The Visual Imperative: Creating a Visual Culture of Data Discovery
byLindy Ryan
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

The Top Trends in 2022 for Data Leaders from DataRobot, Databricks, and Google: On this episode of The Data Chief, top data and analytics executives from DataRobot, Databricks, and Google join Cindi to discuss trends shaping the future of analytics and provide bold predictions for the upcoming year.
Podcast episode
The Top Trends in 2022 for Data Leaders from DataRobot, Databricks, and Google: On this episode of The Data Chief, top data and analytics executives from DataRobot, Databricks, and Google join Cindi to discuss trends shaping the future of analytics and provide bold predictions for the upcoming year.
byThe Data Chief
0 ratings
0% found this document useful
Azure Databricks: I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were...
Podcast episode
Azure Databricks: I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were...
byData Skeptic
0 ratings
0% found this document useful
Reviewing Microsoft Insight & GitHub Universe
Podcast episode
Reviewing Microsoft Insight & GitHub Universe
byThe Cloudcast
0 ratings
0% found this document useful
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
Podcast episode
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
byData Engineering Podcast
100%
100% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
GKE Cost Optimization with Kaslin Fields and Anthony Bushong: This week on the podcast, fellow Googlers Kaslin Fields and Anthony Bushong chat with hosts Mark Mirchandani and Stephanie Wong about how to budget and optimize spending with Google Kubernetes Engine.
Podcast episode
GKE Cost Optimization with Kaslin Fields and Anthony Bushong: This week on the podcast, fellow Googlers Kaslin Fields and Anthony Bushong chat with hosts Mark Mirchandani and Stephanie Wong about how to budget and optimize spending with Google Kubernetes Engine.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Investing In Understanding The Customer Journey At American Express: An interview with Purvi Shah about the Customer 360 project at American Express and their journey into the cloud for enterprise data management
Podcast episode
Investing In Understanding The Customer Journey At American Express: An interview with Purvi Shah about the Customer 360 project at American Express and their journey into the cloud for enterprise data management
byData Engineering Podcast
0 ratings
0% found this document useful
"Beware the simple questions" - A live recording that level sets of Data Science.
Podcast episode
"Beware the simple questions" - A live recording that level sets of Data Science.
byMaking Data Simple
0 ratings
0% found this document useful
Deep Tech Episode: Understanding Search, GenAI, RAG methodology, and vector databases with Nixon Cheaz, Engineering Lead at IBM's Experience Engine.
Podcast episode
Deep Tech Episode: Understanding Search, GenAI, RAG methodology, and vector databases with Nixon Cheaz, Engineering Lead at IBM's Experience Engine.
byMaking Data Simple
0 ratings
0% found this document useful
This Week In Machine Learning & AI - 5/20/16: AI at Google I/O, Amazon's Deep Learning DSSTNE: This Week In Machine Learning & AI - May 20, 2016…
Podcast episode
This Week In Machine Learning & AI - 5/20/16: AI at Google I/O, Amazon's Deep Learning DSSTNE: This Week In Machine Learning & AI - May 20, 2016…
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
#608: Generative AI Roundup - August 2023: Simon takes you on a tour of your GenAI options. From software development, to AI policy, to trialli
Podcast episode
#608: Generative AI Roundup - August 2023: Simon takes you on a tour of your GenAI options. From software development, to AI policy, to trialli
byAWS Podcast
0 ratings
0% found this document useful
2155: Databricks - The Story Behind the Lakehouse Company: Many are citing open source as the future. The UK Government's National Data Strategy even talks about the importance of opening public sector datasets to form the backbone of innovation, efficiency, and growth. This is a trend that Databricks...
Podcast episode
2155: Databricks - The Story Behind the Lakehouse Company: Many are citing open source as the future. The UK Government's National Data Strategy even talks about the importance of opening public sector datasets to form the backbone of innovation, efficiency, and growth. This is a trend that Databricks...
byThe Tech Talks Daily Podcast
0 ratings
0% found this document useful
Managing non-REST APIs like GraphQL and gRPC with Nandan Sridhar and David Feuer: Alexandrina Garcia-Verdin and Stephanie Wong host this week's episode all about managing non-REST APIs.
Podcast episode
Managing non-REST APIs like GraphQL and gRPC with Nandan Sridhar and David Feuer: Alexandrina Garcia-Verdin and Stephanie Wong host this week's episode all about managing non-REST APIs.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
#85 Building Data Literacy at Starbucks
Podcast episode
#85 Building Data Literacy at Starbucks
byDataFramed
0 ratings
0% found this document useful
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
Podcast episode
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Render.com with Anurag Goel: As cloud providers enable greater levels of specificity and control, they empower compliance-driven enterprise companies. This level of parameterization is downright inhospitable to a new software engineer and can be a cognitive barrier to entry for a...
Podcast episode
Render.com with Anurag Goel: As cloud providers enable greater levels of specificity and control, they empower compliance-driven enterprise companies. This level of parameterization is downright inhospitable to a new software engineer and can be a cognitive barrier to entry for a...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
This Week In Machine Learning & AI - 5/27/16: The White House on AI & Aggressive Self-Driving Cars: This Week in Machine Learning & AI brings you the…
Podcast episode
This Week In Machine Learning & AI - 5/27/16: The White House on AI & Aggressive Self-Driving Cars: This Week in Machine Learning & AI brings you the…
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
Podcast episode
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
byData Engineering Podcast
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
Data Operations vs. Data Analytics: Are we doing data and analytics correctly? Self service, centralization vs decentralization, analytics vs operations… so many aspects that data teams need to consider. Join this week’s episode of Catalog & Cocktails with hos...
Podcast episode
Data Operations vs. Data Analytics: Are we doing data and analytics correctly? Self service, centralization vs decentralization, analytics vs operations… so many aspects that data teams need to consider. Join this week’s episode of Catalog & Cocktails with hos...
byCatalog & Cocktails: The Honest, No-BS Data Podcast
0 ratings
0% found this document useful
Be wiser than 98% of the world with this episode, continuing last week's conversation. More on GenAI, Hallucinations, RAG, Use Cases, LLMs, SLMs and costs with Armand Ruiz, Director watsonx Client Engineering and John Webb, Principal Client Engineering.
Podcast episode
Be wiser than 98% of the world with this episode, continuing last week's conversation. More on GenAI, Hallucinations, RAG, Use Cases, LLMs, SLMs and costs with Armand Ruiz, Director watsonx Client Engineering and John Webb, Principal Client Engineering.
byMaking Data Simple
0 ratings
0% found this document useful
Massively Parallel Data Processing In Python Without The Effort Using Bodo: An interview about how Bodo converts standard Python code to native MPI automatically for massive speed ups in data processing workloads
Podcast episode
Massively Parallel Data Processing In Python Without The Effort Using Bodo: An interview about how Bodo converts standard Python code to native MPI automatically for massive speed ups in data processing workloads
byData Engineering Podcast
0 ratings
0% found this document useful
Balancing long-term vision with near-term action with Vercel’s VP of Data: Alex Viana, VP of Data at Vercel, has had a truly unique career. Starting with a role at the Hubble Space Telescope, Alex found his way into the data space by way of data security and searching for leaked data assets. Today, he leads the data organization at Vercel, where he views building – teams, technology processes, and metrics – as his primary responsibility. In this episode Alex shares his thoughts on leading data teams at different (but fast-growing) tech companies, the importance of building scalable data platforms, delivering value through stakeholder engagement, and balancing long-term vision with short-term action as a key to success.
Podcast episode
Balancing long-term vision with near-term action with Vercel’s VP of Data: Alex Viana, VP of Data at Vercel, has had a truly unique career. Starting with a role at the Hubble Space Telescope, Alex found his way into the data space by way of data security and searching for leaked data assets. Today, he leads the data organization at Vercel, where he views building – teams, technology processes, and metrics – as his primary responsibility. In this episode Alex shares his thoughts on leading data teams at different (but fast-growing) tech companies, the importance of building scalable data platforms, delivering value through stakeholder engagement, and balancing long-term vision with short-term action as a key to success.
byThe Data Chief
0 ratings
0% found this document useful
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
Best Integration Practices for Architecture Automation | BiZZdesign
Podcast episode
Best Integration Practices for Architecture Automation | BiZZdesign
byEnterprise Architecture Podcast
0 ratings
0% found this document useful
#84 Building High-Impact Data Teams at Capital One
Podcast episode
#84 Building High-Impact Data Teams at Capital One
byDataFramed
0 ratings
0% found this document useful
Show 212 - Jesse Anderson - Big Data: Today’s episode is an interview with Jesse Anderson, a preeminent expert who teaches software engineers how to become data scientists and data engineers. He has years under his belt teaching at Fortune 100 companies and startups alike. Jesse is a data...
Podcast episode
Show 212 - Jesse Anderson - Big Data: Today’s episode is an interview with Jesse Anderson, a preeminent expert who teaches software engineers how to become data scientists and data engineers. He has years under his belt teaching at Fortune 100 companies and startups alike. Jesse is a data...
byThe Ultimate Entrepreneur
0 ratings
0% found this document useful
DataFramed Careers Series Special Announcement!
Podcast episode
DataFramed Careers Series Special Announcement!
byDataFramed
0 ratings
0% found this document useful
#456: Data Architectures with AWS Hero Elliott Cordo: AWS Data Hero and Head of Data at Capsule, Elliott Cordo, has built many ground-up data architecture
Podcast episode
#456: Data Architectures with AWS Hero Elliott Cordo: AWS Data Hero and Head of Data at Capsule, Elliott Cordo, has built many ground-up data architecture
byAWS Podcast
0 ratings
0% found this document useful
The Patient Priority with Stefan Larsson and Jennifer Clawson
Podcast episode
The Patient Priority with Stefan Larsson and Jennifer Clawson
byThinkers & Ideas
0 ratings
0% found this document useful

Skip carousel

The Future Of The Data Economy
The European Business Review
Article
The Future Of The Data Economy
Jun 1, 2022
6 min read
Data-driven Decision Making That Uses Data, Mind And Heart
The European Business Review
Article
Data-driven Decision Making That Uses Data, Mind And Heart
Jan 31, 2020
14 min read
Data In A Digital World
NZ Marketing
Article
Data In A Digital World
Sep 23, 2019
3 min read
Microsoft The Reinvention Of The Tech Giant
AppleMagazine
Article
Microsoft The Reinvention Of The Tech Giant
Jan 25, 2019
5 min read
Why Is ELT Better For Cloud Data Warehousing?
Techfastly
Article
Why Is ELT Better For Cloud Data Warehousing?
Apr 1, 2021
2 min read
Budget Strategies for Maximizing Big Data
Entrepreneur
Article
Budget Strategies for Maximizing Big Data
Jun 1, 2016
1 min read
Unhappy Truckers and Other Algorithmic Problems: Transportation optimization starts with math, but ends in understanding human behavior.
Nautilus
Article
Unhappy Truckers and Other Algorithmic Problems: Transportation optimization starts with math, but ends in understanding human behavior.
Jul 18, 2013
When Bob Santilli, a senior project manager at UPS, was invited in 2009 to his daughter’s fifth grade class on Career Day, he struggled with how to describe exactly what he did for a living. Eventually, he decided he would show the class a travel opt
11 min read
Embracing AI in Financial Services
Rotman Management
Article
Embracing AI in Financial Services
Jan 1, 2020
You are the Chief Science Officer at RBC and you also oversee its AI research institute. Describe the bank’s interest in this arena. There are many aspects to our interest in AI. First of all, financial services is a very data-driven business. From t
6 min read
DeFi A Growing Arm Behind NFT
Techfastly
Article
DeFi A Growing Arm Behind NFT
Nov 1, 2021
4 min read
Grafana Terminology
Linux Format
Article
Grafana Terminology
Jan 14, 2020
A Grafana data source is a database, file or service that provides data to Grafana – it cannot operate without data. A Grafana panel is the basic building block of Grafana. Panels are made of visualisations or queries. A Grafana query is used for req
1 min read
Getting Smarter
Business Today
Article
Getting Smarter
Feb 7, 2022
8 min read
DataStax The Real-time Data Company, Unveiled “Change Data Capture” (CDC) for Astra DB
Techfastly
Article
DataStax The Real-time Data Company, Unveiled “Change Data Capture” (CDC) for Astra DB
May 1, 2022
3 min read
Salesforce Adding Einstein Analytics Al To Tableau Platform
Techfastly
Article
Salesforce Adding Einstein Analytics Al To Tableau Platform
Feb 4, 2021
3 min read
AWS Vs Azure What’s The Difference?
PC Pro Magazine
Article
AWS Vs Azure What’s The Difference?
Sep 11, 2022
7 min read
An easy-to-Understand Overview of Popular extended BPF Tools: BCC, Falco, and More
Techfastly
Article
An easy-to-Understand Overview of Popular extended BPF Tools: BCC, Falco, and More
Apr 1, 2022
7 min read
Four Pathways To 'Future Ready' That Pay Off
The European Business Review
Article
Four Pathways To 'Future Ready' That Pay Off
Apr 3, 2019
7 min read
EBPF To Enhance Kubernetes Monitoring
Techfastly
Article
EBPF To Enhance Kubernetes Monitoring
Apr 1, 2022
The introduction of Docker and Kubernetes has brought a dramatic revolution in the IT industry. Unlike the traditional methods of developing and deploying software, Kubernetes or K8s uses scaling and automated deployment. Thanks to the Linux function
4 min read
Your Data-Driven Marketing Is Harmful. I Should Know: I Ran Marketing at Google and Instagram
Entrepreneur
Article
Your Data-Driven Marketing Is Harmful. I Should Know: I Ran Marketing at Google and Instagram
Jan 1, 2020
12 min read
The Nerds Who Conquered Silicon Valley
MoneyWeek
Article
The Nerds Who Conquered Silicon Valley
Apr 2, 2021
2 min read
85 Big Sur Secrets
TechLife
Article
85 Big Sur Secrets
Jun 28, 2021
1 min read
The Art Of AI Maturity: Five Success Factors
Rotman Management
Article
The Art Of AI Maturity: Five Success Factors
Jan 1, 2023
TODAY, MUCH OF WHAT WE TAKE FOR GRANTED in our daily lives stems from machine learning. Every time you use a wayfinding app to get from point A to point B, use dictation to convert speech to text, or unlock your phone using face ID, you’re relying on
10 min read
Top Five AI-ML Books For Business Leaders
Techfastly
Article
Top Five AI-ML Books For Business Leaders
Aug 2, 2021
5 min read
Building A 'Just In Case' Supply Chain
Business Today
Article
Building A 'Just In Case' Supply Chain
Aug 19, 2021
Our Covid-19 vaccine shot is no less a supply-chain miracle than it is a medical one — moving billions of vials around the world, in the right ambient conditions, with complete track-and-trace, and the ability to match the ‘arm to the vial-code’. The
5 min read
Out Of The Bag
New Zealand Listener
Article
Out Of The Bag
Mar 19, 2023
11 min read
The Streaming Revolution Will Be Televised
PC Pro Magazine
Article
The Streaming Revolution Will Be Televised
May 7, 2022
As Storm Eunice battered Britain in February, an unlikely celebrity was born. Jerry Dyer didn’t stay at home to protect his wheelie bin and his garden furniture. Instead, he took his van to a field just outside Heathrow Airport and pointed his camera
9 min read
Industry 4.0: India Has Everything To Be Successful
Business Today
Article
Industry 4.0: India Has Everything To Be Successful
Jul 8, 2019
4 min read
Does The Metaverse… Matter?
Facility Management
Article
Does The Metaverse… Matter?
Jun 2, 2022
7 min read
AI As A Service
PC Pro Magazine
Article
AI As A Service
Jul 9, 2020
2 min read
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
PC Pro Magazine
Article
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
Jul 8, 2021
6 min read
Salesforce Buys Slack in a $27.4B Deal
Techfastly
Article
Salesforce Buys Slack in a $27.4B Deal
Feb 4, 2021
4 min read

Related categories

Skip carousel

Reviews for Practical DataOps

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Practical DataOps - Harvinder Atwal

Part IGetting Started

H. AtwalPractical DataOpshttps://doi.org/10.1007/978-1-4842-5104-1_1

1. The Problem with Data Science

Harvinder Atwal¹

(1)

Isleworth, UK

Before adopting DataOps as a solution, it’s important to understand the problem we’re trying to solve. When you view articles online, hear presentations at conferences, or read of the success of leading data-driven organizations like Facebook, Amazon, Netflix, and Google (FANG) , delivering successful data science seems a simple process. The reality is very different.

While there are undoubtedly success stories, there is also plenty of evidence that substantial investment in data science is not generating the returns expected for a majority of organizations. There are multiple causes, but they stem from two root causes. First, a 20th-century information architecture approach to handling data and analytics in the 21st century. Second, the lack of knowledge and organizational support for data science and analytics. The common (20th-century) mantras espoused in the industry to overcome these problems make matters worse, not better.

Is There a Problem?

It is possible to create competitive advantage and solve worthy problems using data. Many organizations are managing to generate legitimate success stories from their investments in data science and data analytics:

VP of Product Innovation Carlos Uribe-Gomez and Chief Product Officer Neil Hunt published a paper that says some of its recommendation algorithms save Netflix $1 billion each year in reduced churn.¹

One of Monsanto’s data science initiatives to improve global transportation and logistics delivers annual savings and cost avoidance of nearly $14 million, while simultaneously reducing C02 emissions by 350 metric tons (MT).²

Alphabet’s DeepMind, better known for its AlphaGo program, has developed an artificial intelligence (AI) system in partnership with London’s Moorfield Eye Hospital to refer treatment for over 50 sight-threatening diseases as accurately as world-leading expert doctors.³

Not wanting to be left behind, most organizations are now spending heavily on expensive technology and hiring costly teams of data scientists, data engineers, and data analysts to make sense of their data and drive decisions. What was once a niche activity in even the largest organizations is now seen as a core competency. The investment and job position growth rates are staggering considering global GDP is only growing at 3.5% annually:

International Data Corp. (IDC) expects worldwide revenue for big data and business analytics solutions to reach $260 billion in 2022, a compound annual growth rate of 11.9% over the 2017–2022 period.⁴

LinkedIn’s emerging jobs reports rank machine learning engineers, data scientists, and big data engineers as three of the top four fastest-growing jobs in the United States between 2012 and 2017. Data scientist roles grew over 650% over that period!⁵

The Reality

Despite massive monetary outlay, only a minority of organizations achieve meaningful results. Case studies demonstrating quantifiable outcomes are isolated exceptions, even allowing for reluctance to disclose competitive advantages. Exponential growth in the volume of data, rapid increases in solutions spending, and improvements in technology and algorithms have not led to an increase in data analytics productivity.

There is some indication that the success rate of data analytics projects is declining. In 2016, Forrester concluded that only 22% of companies saw high revenue growth and profit from their investments in data science.⁶ Also, in 2016, Gartner estimated that 60% of big data projects fail, but it gets worse. In 2017, Gartner’s Nick Heudecker issued a correction. The 60% estimate was too conservative, the real failure rate was closer to 85%.⁷ Although much of the survey data is related to big data, I still think Nick’s results are relevant. Outside of the data science field, most people mistakenly think of big data, data science, and data analytics as interchangeable terms and will be responding as such.

There may be multiple reasons for the meager rate of return and failure to improve productivity despite serious investment in data science and analytics. Explosive growth in data capture may result in the acquisition of ever-lower marginal value data. Technology, software libraries, and algorithms may not be keeping pace with the volume and complexity of data captured. The skill levels of data scientists could be insufficient. Processes might not be evolving to take advantage of data-driven opportunities. Finally, organizational and cultural barriers could be preventing data exploitation.

Data Value

There is no indication that the marginal value of data collected has declined. Much of the additional data captured is increasingly from new sources such as Internet of Things (IoT) device sensors or mobile devices, nonstructured data, and text, image, or semi-structured documents generated by event logs. Higher volume and variety of data acquired is expanding the opportunity for data scientists to extract knowledge and drive decisions.

However, there is evidence that poor data quality remains a serious challenge. In Figure Eight’s 2018 Data Scientist Report, 55% of data scientists cited quality/quantity of training data as being their biggest challenge.⁸ The rate had changed little since the inaugural 2015 report when 52.3% of data scientists cited poor quality data as their biggest daily obstacle.⁹ Dirty data was also cited as the number one barrier in Kaggle’s 2017 The State of Data & Machine Learning Survey of 16,000 respondents, while data unavailable or difficult to access was the fifth most significant barrier and mentioned by 30.2% of respondents.¹⁰

Technology, Software, and Algorithms

There is no indication that technology, software libraries, and algorithms are failing to keep up with the volume and complexity of data captured. Technology and software libraries continue to evolve to handle increasingly challenging problems while adding simplified interfaces to hide complexity from users or increase automation. Where once running a complex on-premise Hadoop cluster was the only choice for working with multiple terabytes of data, now the same workloads can be run on managed Spark or SQL query-engines-as-a-service on the cloud with no infrastructure engineering requirement.

Software libraries like Keras make working with deep learning libraries such as Google’s popular TensorFlow much easier. Vendors like DataRobot have automated the production of machine learning models. Advances in deep learning algorithms and architectures, and large neural networks with many layers, such as convolutional neural networks (CNNs) and long short-term memory networks (LSTM networks), have enabled a step-change in natural language processing (NLP), machine translation, image recognition, voice processing, and real-time video analysis. In theory, all these developments should be improving productivity and return on investment (ROI) of data science investment. Maybe organizations are using outdated or wrong technology.

Data Scientists

As a relatively new field, the inexperience of data scientists may be a problem. In Kaggle’s The State of Data & Machine Learning Survey, the modal age range of data scientists was just 24–26 years old, and the median age was 30. Median age varied by country; for the United States, it was 32. However, this is still far lower than the median age of the American worker at 41 years old. Educational attainment was not a problem though, 15.6% had a doctorate, 42% held a master’s, and 32% a bachelor’s degree.¹⁰ Since all forms of advanced analytics were marginal before 2010, there is also a deficiency of experienced managers. As a result, we have many extremely bright data scientists short of experience in dealing with organizational culture and lacking in senior analytical leadership.

Data Science Processes

It is challenging to find survey data on the processes and methodologies used to deliver data science. KDnuggets’ 2014 survey showed cross industry standard process for data mining (CRISP-DM) as the top methodology for analytics, data mining, and data science projects used by 43% of respondents.¹¹ The next most popular approach was not a method at all, but respondents following their homegrown process. The SAS Institute’s Sample, Explore, Modify, Model and Assess (SEMMA) model was third, but in rapid decline as the use is tightly coupled to SAS products.

The challenge with CRISP-DM and other data mining methodologies like Knowledge Discovery Databases (KDD) is they treat data science as a much more linear process than it is. They encourage data scientists to spend significant time planning and analyzing for a single near-perfect delivery, which may not be what the customer ultimately wants. No attention is focused on minimum viable product, feedback from customers, or iteration to ensure that you’re spending time wisely working on the right thing. They also treat deployment and monitoring as a throw it over the fence problem, where work is passed to other teams for completion with little communication or collaboration, reducing chances of successful delivery.

In response, many groups have proposed new methodologies, including Microsoft with their Team Data Science Process (TDSP).¹² TDSP is a significant improvement over previous approaches and recognizes that data science delivery needs to be agile, iterative, standardized, and collaborative. Unfortunately, TDSP does not seem to be gaining much traction. TDSP and similar methodologies are also restricted to the data science lifecycle. There is an opportunity for a methodology that encompasses end-to-end data lifecycle, from acquisition to retirement.

Organizational Culture

Emotional, situational, and cultural factors heavily influence business decisions. FORTUNE Knowledge Group’s survey of more than 700 high-level executives from a variety of disciplines across nine industries demonstrates the barriers to data-driven decision-making. A majority (61%) of executives agree that when making decisions, human insights must precede hard analytics. Sixty-two percent of respondents contend that it’s often necessary to rely on gut feelings and that soft factors should be given the same weight as hard factors. Worryingly, two-thirds (66%) of IT executives say decisions are often made out of a desire to conform to the way things have always been done.¹³ These are not isolated findings. NewVantage Partners’ Big data Executive Survey 2017 found cultural challenges remain an impediment to successful business adoption:

More than 85% of respondents report that their firms have started programs to create data-driven cultures, but only 37% report success thus far. Big data technology is not the problem; management understanding, organizational alignment, and general organizational resistance are the culprits. If only people were as malleable as data.¹⁴

It is no surprise that very few companies have followed the example of Amazon and replaced highly paid white-collar decision-makers with algorithms despite the enormous success it has achieved.¹⁵

The challenge in delivering successful data science has much less to do with technology, but cultural attitude where many organizations alternately treat data science as a box-ticking exercise or part of the never-ending pursuit for the perfect solution to all their challenges. Nor is the problem with the effectiveness of algorithms. Algorithms and technology are well ahead of our ability to feed them high data quality, overcome people barriers (skills, culture, and organization), and implement data-centric processes. However, these symptoms are themselves the result of deeper root causes such as lack of knowledge of the best way to use data to make decisions, legacy perceptions from last century’s approach to handling data and delivering analytics, and a shortage of support for data analytics.

The Knowledge Gap

Multiple knowledge gaps make it hard to embed data science in organizations when implementation starts at the very top of an organization. Nevertheless, it is too easy to blame business leaders and IT professionals for their failure to deliver results. The knowledge gap is a two-way street because data scientists must share the blame.

The Data Scientist Knowledge Gap

Data science aims to facilitate better decisions, leading to beneficial actions by extracting knowledge from data. To enable better decisions, data scientists need a good understanding of the business domain to enable better understanding of the business problem, identify the right data and prepare it (often detecting quality issues for the first time), employ the right algorithms on the data, validate their approaches, convince stakeholders to act, operationalize their output, and measure results. This breadth of scope necessitates an extensive range of skills such as the ability to collaborate and coordinate with multiple functions within the organization in addition to their own job area, critical and scientific thinking, coding and software development skills, and knowledge of a wide range of machine learning and statistical algorithms. Moreover, the ability to communicate complex ideas to a nontechnical audience and business acumen if in a commercial setting is crucial. In the data science profession, someone with the combination of all these skills is known as a unicorn.

Since finding a unicorn is rare if not impossible (they don’t sign up to LinkedIn or attend meetups), organizations try to find the next best thing. They hire people with programming (Python or R), analysis, machine learning, and statistical and computer science skills, which happen to be the five most sought-after skills by employers.¹⁶ These skills should be the differentiator between data scientists and everyone else. Unfortunately, this belief reinforces the mistaken conviction among junior data scientists that specialist technical skills should be the focus, but this tends to create dangerously homogenous teams.

Create me a machine translation attention model using a bi-directional long short-term memory (LSTM) with an attention layer outputting to a stacked post-attention LSTM feeding a softmax layer for predictions, said no CEO, ever.

I’m always staggered when interviewing by the number of candidates who say their primary objective is a role that allows them to create deep learning, reinforcement learning, and [insert your algorithm here] models. Their aim is not to solve a real problem or help customers, but to apply today’s hottest technique. Data science is too valuable to treat as a paid hobby.

There is a disconnect between the skills data scientists think they need and what they really need. Unfortunately, technical skills are nowhere near enough to drive real success and beneficial actions from data-driven decisions. Faced with hard-to-access poor-quality data, lack of management support, no clear questions to answer, or results ignored by decision-makers, data scientists without senior data science leadership aren’t equipped to change the culture. Some look for greener pastures and get a new job, only to realize that similar challenges exist in most organizations. Others focus on the part of the process they can control, the modeling.

In data science, there is an overemphasis on machine learning or deep learning, and especially among junior data scientists, the belief that working in solitary isolation to maximize model accuracy score(s) on a test dataset is the definition of success. This behavior is encouraged by training courses, online articles, and especially Kaggle. High test set prediction accuracy seems a bizarre interpretation of success to me. It is usually better, in my experience, to try ten solution scenarios rather than spend weeks on premature optimization of a single problem solution because you don’t know in advance what is going to work. It is when you get feedback from consumers and measure results that you see if you have driven beneficial action or even useful learning. At that point, you can decide the value of expending further effort to optimize.

The aim must be to get a minimum viable product into production. A perfect model on a laptop that never goes into production wastes effort so is worse than a model that does not exist. There are domains where model accuracy is paramount like medical diagnosis precision, fraud detection, and AdTech, but these are a minority compared to applications where doing anything is a significant improvement over doing nothing. Even in domains benefiting disproportionately from optimizing model accuracy, quantifying real-world impact is still more important.

Getting a model into production requires different technical skills than creating the model. The most important of which are relevant software development skills. For many data scientists, who mainly have non-software development backgrounds, coding is just a means to an end. They are unaware that coding and software engineering are disciplines with their own set of best practices. Alternatively, if they are aware, they tend to see writing reusable code, version control, and testing or documentation as obstacles to be avoided.

Weak development skills cause difficulties, especially for reproducibility, performance, and quality of work. The barriers to getting models into production are not the sole responsibility of data scientists. Often, they do not have access to the tools and systems required to get their models into production and thereby must rely on other teams to facilitate implementation. Naïve data scientists ignore the gulf between local development and server-based production and treat it as a throw it over the fence problem by not thinking through the implications of their choice of programming language. This inexperience causes avoidable friction and failure.

IT Knowledge Gap

Superficially, data science and software development share similarities. Both involve code, data, databases, and computing environments. So, data scientists require some software development skills. However, there is a crucial distinction demonstrated in Figure 1-1 that shows the difference between machine learning and regular programming.

../images/476438_1_En_1_Chapter/476438_1_En_1_Fig1_HTML.jpg

Figure 1-1

Difference between regular programming and machine learning

In regular programming, rules or logic are applied to input data to generate an output (Output = f(Inputs) such as Z = X + Y) based on well-understood requirements. In machine learning, examples of outputs and their input data along with their individual properties, known as features, feed a machine learning algorithm. A machine learning algorithm attempts to learn the rules that generate output from inputs by minimizing a cost function via a training process that will never achieve perfect accuracy on real-life data. Once suitably trained, regular programming can use machine learning model rules to make predictions for new input data. The difference between regular programming and machine learning has profound implications for data quality, data access, testing, development processes, and even computing requirements.

Garbage in, garbage out applies to regular programming. However, high-quality data is essential for machine learning as the algorithm is dependent upon good data to learn rules. Poor-quality data will lead to inferior training and predictions. Generally, more data allows a machine learning algorithm to decipher more complexity and generate more accurate predictions. Moreover, more features inherent in the data enables an algorithm to improve predictive accuracy. Data scientists can also engineer additional features from existing data based on domain knowledge and experience.

Model training is iterative and computationally expensive. High capacity memory and more powerful CPUs allow you to use more data and sophisticated algorithms. The languages and libraries used to create machine learning models are specialized for data analytics, typically the R and Python programming languages and their libraries and packages, respectively. However, once a model has been created, deployment processes are much more familiar to a software developer.

In regular programming, logic is the most important part. That is, ensuring that the code is correct is critical. Development and testing environments often do not need high-performance computers, and sample data with sufficient coverage is enough to complete testing. In data science, both the data and code are critical. There is no correct answer to test for, only an acceptable level of accuracy. Often, minimal code (compared to regular programming) is required to fit and validate a model to high test accuracy (e.g., 95%). The complexity lies in ensuring that data is available, understood, and correct.

Even for training, data must be production data, or the model will not predict well on new unseen data with a different data distribution. Sample data is not useful unless data scientists select test data as part of a deliberate strategy (e.g. random sampling, cross-validation, stratified sampling) specific to the problem. Although these requirements relate to machine learning, they generalize to all data scientist work in general.

The needs of data scientists are often misinterpreted by IT, even by those who are supportive as nice-to-haves. Data scientists are frequently asked to justify why they need access to multiple data sources, complete production data, specific software, and powerful computers when other developers don’t need them and reporting analysts have mined data for years by just running SQL queries on relational databases. IT is frustrated that data scientists don’t understand the reasons behind IT practices. Data scientists are frustrated because it’s not easy to justify the value of what they consider necessities upfront. More than once the question But why do you need this data? has made my heart sink.

It is rare to see IT processes designed to support advanced analytical processes. It starts with the way data is captured. Many developers see capturing new data items as a burden with an associated cost in planning, analysis, design, implementation, and maintenance time. Most organizations, therefore, collect data foremost to support operational processes like customer relationship management (CRM), financial management, supply chain management, e-commerce, and marketing. Frequently, this data resides in separate silos, each with its strict data governance strategy.

Often, data will go through an ETL (Extract, Transform, Load) process to transform it into structured data (typically a tabular format) before loading it into a data warehouse to make it accessible for analytics. There are some drawbacks to this approach for data science. Only a subset of data makes its way through the ETL where it is typically prioritized for reporting. Adding new data items can take months of development. As such, raw data is unavailable to data scientists. Raw data is what they need!

Traditional data warehouses usually only handle structured data in relational schemas (with data split across multiple joinable tables to avoid duplication and improve performance) and can struggle to manage the scale of data we have available today. They also don’t handle modern use cases that require unstructured data like text or sometimes even machine-generated semi-structured data formats like JSON (JavaScript Object Notation). One solution is the creation of data lakes where data is stored in raw native format and goes through an ELT (Extract, Load, Transform) process when needed, with the transformation being dependent on the use case.

When a data lake is not available, data scientists must extract the data themselves and combine it on a local machine or work with data engineers to build pipelines into an environment with the tools, computing resource, and storage they need. Requests to access data, provision environments, and install tools and software are often the responsibility of separate teams with varying concerns for security, costs, and governance. As such, data scientists need to work with different groups to deploy and schedule their models, dashboards, and APIs. With processes in place incongruent with data science needs, costs are greatly elevated.

The entire data lifecycle splits across many IT teams, and each in isolation makes rational decisions based on its functional silo objectives. Such silo objectives do not serve data scientists. For data scientists who need data pipelines from raw data to final data product, significant challenges arise. They need to justify their requirements and negotiate with multiple stakeholders to complete a delivery. Even if they are successful, they will still be dependent on other teams for many tasks and at the mercy of backlogs and prioritization. No one person or function is responsible for the entire pipeline leading to delays, bottlenecks, and operational risk.

Data security and privacy are cited occasionally as obstacles to prevent access and processing of data. There are genuine concerns to ensure compliance with regulations, respect user privacy, protect reputation, defend competitive advantage, and prevent malicious damage. However, such concerns can also be used to take a risk-averse route and not implement solutions that allow for the safe, legitimate, and ethical use of data. More typically, problems occur when data security and privacy policies are implemented without undertaking a thorough cost-benefit analysis and fully understanding the impact on data analytics.

Technology Knowledge Gap

Although technology is not the only barrier to the successful implementation of data science, it is still crucial to get tooling right. Figure 1-2 shows the typical hardware and software layers in a data lifecycle from raw data to useful business applications.

../images/476438_1_En_1_Chapter/476438_1_En_1_Fig2_HTML.jpg

Figure 1-2

Typical hardware and software layers in the data lifecycle

Many software and hardware requirements need to come together to create data products. There must be a holistic understanding of requirements and investment balance across all the lifecycle layers. Unfortunately, it is easy to focus on one part of the jigsaw to the detriment of others. Large enterprises tend to concentrate on the big data technologies used to build applications. They obsess over Kafka, Spark, and Kubernetes, but fail to provide their data scientists sufficient access to data, software libraries, and tools they need. Smaller organizations are more likely to provide their data scientists with the software tools they need, but may fail to invest in storage and processing technologies, leaving analytics processing isolated on laptops.

Even if they do get the investment in tools right, organizations can still underestimate the supporting resources needed to build, maintain, and optimize the stack. Without sufficient talent in data engineering, data governance, DevOps, database administration, solutions architecture, and infrastructure engineering, it is next to impossible to utilize the tools

Enjoying the preview?

Page 1 of 1

Practical DataOps: Delivering Agile Data Science at Scale

About this ebook

Harvinder Atwal

Related authors

Related to Practical DataOps

Related ebooks

Databases For You

Related podcast episodes

Related articles

Related categories

Reviews for Practical DataOps

What did you think?

Book preview

Practical DataOps - Harvinder Atwal

1. The Problem with Data Science

Is There a Problem?

The Reality

Data Value

Technology, Software, and Algorithms

Data Scientists

Data Science Processes

Organizational Culture

The Knowledge Gap

The Data Scientist Knowledge Gap

IT Knowledge Gap

Technology Knowledge Gap