Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Practical DataOps: Delivering Agile Data Science at Scale
Practical DataOps: Delivering Agile Data Science at Scale
Practical DataOps: Delivering Agile Data Science at Scale
Ebook470 pages5 hours

Practical DataOps: Delivering Agile Data Science at Scale

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Gain a practical introduction to DataOps, a new discipline for delivering data science at scale inspired by practices at companies such as Facebook, Uber, LinkedIn, Twitter, and eBay. Organizations need more than the latest AI algorithms, hottest tools, and best people to turn data into insight-driven action and useful analytical data products. Processes and thinking employed to manage and use data in the 20th century are a bottleneck for working effectively with the variety of data and advanced analytical use cases that organizations have today. This book provides the approach and methods to ensure continuous rapid use of data to create analytical data products and steer decision making.
Practical DataOps shows you how to optimize the data supply chain from diverse raw data sources to the final data product, whether the goal is a machine learning model or other data-orientated output. The book provides an approach to eliminate wasted effort and improve collaboration between data producers, data consumers, and the rest of the organization through the adoption of lean thinking and agile software development principles.
This book helps you to improve the speed and accuracy of analytical application development through data management and DevOps practices that securely expand data access, and rapidly increase the number of reproducible data products through automation, testing, and integration. The book also shows how to collect feedback and monitor performance to manage and continuously improve your processes and output. 

What You Will Learn
  • Develop a data strategy for your organization to help it reach its long-term goals
  • Recognize and eliminate barriers to delivering data to users at scale
  • Work on the right things for the right stakeholders through agile collaboration
  • Create trust in data via rigorous testing and effective data management
  • Build a culture of learning and continuous improvement through monitoring deployments and measuring outcomes
  • Create cross-functional self-organizing teams focused on goals not reporting lines
  • Build robust, trustworthy, data pipelines in support of AI, machine learning, and other analytical data products

Who This Book Is For
Data science and advanced analytics experts, CIOs, CDOs (chief data officers), chief analytics officers, business analysts, business team leaders, and IT professionals (data engineers, developers, architects, and DBAs) supporting data teams who want to dramatically increase the value their organization derives from data. The book is ideal for data professionals who want to overcome challenges of long delivery time, poor data quality, high maintenance costs, and scaling difficulties in getting data science output and machine learning into customer-facing production.
LanguageEnglish
PublisherApress
Release dateDec 9, 2019
ISBN9781484251041
Practical DataOps: Delivering Agile Data Science at Scale

Related to Practical DataOps

Related ebooks

Databases For You

View More

Related articles

Reviews for Practical DataOps

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Practical DataOps - Harvinder Atwal

    Part IGetting Started

    © Harvinder Atwal 2020

    H. AtwalPractical DataOpshttps://doi.org/10.1007/978-1-4842-5104-1_1

    1. The Problem with Data Science

    Harvinder Atwal¹ 

    (1)

    Isleworth, UK

    Before adopting DataOps as a solution, it’s important to understand the problem we’re trying to solve. When you view articles online, hear presentations at conferences, or read of the success of leading data-driven organizations like Facebook, Amazon, Netflix, and Google (FANG) , delivering successful data science seems a simple process. The reality is very different.

    While there are undoubtedly success stories, there is also plenty of evidence that substantial investment in data science is not generating the returns expected for a majority of organizations. There are multiple causes, but they stem from two root causes. First, a 20th-century information architecture approach to handling data and analytics in the 21st century. Second, the lack of knowledge and organizational support for data science and analytics. The common (20th-century) mantras espoused in the industry to overcome these problems make matters worse, not better.

    Is There a Problem?

    It is possible to create competitive advantage and solve worthy problems using data. Many organizations are managing to generate legitimate success stories from their investments in data science and data analytics:

    VP of Product Innovation Carlos Uribe-Gomez and Chief Product Officer Neil Hunt published a paper that says some of its recommendation algorithms save Netflix $1 billion each year in reduced churn.¹

    One of Monsanto’s data science initiatives to improve global transportation and logistics delivers annual savings and cost avoidance of nearly $14 million, while simultaneously reducing C02 emissions by 350 metric tons (MT).²

    Alphabet’s DeepMind, better known for its AlphaGo program, has developed an artificial intelligence (AI) system in partnership with London’s Moorfield Eye Hospital to refer treatment for over 50 sight-threatening diseases as accurately as world-leading expert doctors.³

    Not wanting to be left behind, most organizations are now spending heavily on expensive technology and hiring costly teams of data scientists, data engineers, and data analysts to make sense of their data and drive decisions. What was once a niche activity in even the largest organizations is now seen as a core competency. The investment and job position growth rates are staggering considering global GDP is only growing at 3.5% annually:

    International Data Corp. (IDC) expects worldwide revenue for big data and business analytics solutions to reach $260 billion in 2022, a compound annual growth rate of 11.9% over the 2017–2022 period.

    LinkedIn’s emerging jobs reports rank machine learning engineers, data scientists, and big data engineers as three of the top four fastest-growing jobs in the United States between 2012 and 2017. Data scientist roles grew over 650% over that period!

    The Reality

    Despite massive monetary outlay, only a minority of organizations achieve meaningful results. Case studies demonstrating quantifiable outcomes are isolated exceptions, even allowing for reluctance to disclose competitive advantages. Exponential growth in the volume of data, rapid increases in solutions spending, and improvements in technology and algorithms have not led to an increase in data analytics productivity.

    There is some indication that the success rate of data analytics projects is declining. In 2016, Forrester concluded that only 22% of companies saw high revenue growth and profit from their investments in data science.⁶ Also, in 2016, Gartner estimated that 60% of big data projects fail, but it gets worse. In 2017, Gartner’s Nick Heudecker issued a correction. The 60% estimate was too conservative, the real failure rate was closer to 85%.⁷ Although much of the survey data is related to big data, I still think Nick’s results are relevant. Outside of the data science field, most people mistakenly think of big data, data science, and data analytics as interchangeable terms and will be responding as such.

    There may be multiple reasons for the meager rate of return and failure to improve productivity despite serious investment in data science and analytics. Explosive growth in data capture may result in the acquisition of ever-lower marginal value data. Technology, software libraries, and algorithms may not be keeping pace with the volume and complexity of data captured. The skill levels of data scientists could be insufficient. Processes might not be evolving to take advantage of data-driven opportunities. Finally, organizational and cultural barriers could be preventing data exploitation.

    Data Value

    There is no indication that the marginal value of data collected has declined. Much of the additional data captured is increasingly from new sources such as Internet of Things (IoT) device sensors or mobile devices, nonstructured data, and text, image, or semi-structured documents generated by event logs. Higher volume and variety of data acquired is expanding the opportunity for data scientists to extract knowledge and drive decisions.

    However, there is evidence that poor data quality remains a serious challenge. In Figure Eight’s 2018 Data Scientist Report, 55% of data scientists cited quality/quantity of training data as being their biggest challenge.⁸ The rate had changed little since the inaugural 2015 report when 52.3% of data scientists cited poor quality data as their biggest daily obstacle.⁹ Dirty data was also cited as the number one barrier in Kaggle’s 2017 The State of Data & Machine Learning Survey of 16,000 respondents, while data unavailable or difficult to access was the fifth most significant barrier and mentioned by 30.2% of respondents.¹⁰

    Technology, Software, and Algorithms

    There is no indication that technology, software libraries, and algorithms are failing to keep up with the volume and complexity of data captured. Technology and software libraries continue to evolve to handle increasingly challenging problems while adding simplified interfaces to hide complexity from users or increase automation. Where once running a complex on-premise Hadoop cluster was the only choice for working with multiple terabytes of data, now the same workloads can be run on managed Spark or SQL query-engines-as-a-service on the cloud with no infrastructure engineering requirement.

    Software libraries like Keras make working with deep learning libraries such as Google’s popular TensorFlow much easier. Vendors like DataRobot have automated the production of machine learning models. Advances in deep learning algorithms and architectures, and large neural networks with many layers, such as convolutional neural networks (CNNs) and long short-term memory networks (LSTM networks), have enabled a step-change in natural language processing (NLP), machine translation, image recognition, voice processing, and real-time video analysis. In theory, all these developments should be improving productivity and return on investment (ROI) of data science investment. Maybe organizations are using outdated or wrong technology.

    Data Scientists

    As a relatively new field, the inexperience of data scientists may be a problem. In Kaggle’s The State of Data & Machine Learning Survey, the modal age range of data scientists was just 24–26 years old, and the median age was 30. Median age varied by country; for the United States, it was 32. However, this is still far lower than the median age of the American worker at 41 years old. Educational attainment was not a problem though, 15.6% had a doctorate, 42% held a master’s, and 32% a bachelor’s degree.¹⁰ Since all forms of advanced analytics were marginal before 2010, there is also a deficiency of experienced managers. As a result, we have many extremely bright data scientists short of experience in dealing with organizational culture and lacking in senior analytical leadership.

    Data Science Processes

    It is challenging to find survey data on the processes and methodologies used to deliver data science. KDnuggets’ 2014 survey showed cross industry standard process for data mining (CRISP-DM) as the top methodology for analytics, data mining, and data science projects used by 43% of respondents.¹¹ The next most popular approach was not a method at all, but respondents following their homegrown process. The SAS Institute’s Sample, Explore, Modify, Model and Assess (SEMMA) model was third, but in rapid decline as the use is tightly coupled to SAS products.

    The challenge with CRISP-DM and other data mining methodologies like Knowledge Discovery Databases (KDD) is they treat data science as a much more linear process than it is. They encourage data scientists to spend significant time planning and analyzing for a single near-perfect delivery, which may not be what the customer ultimately wants. No attention is focused on minimum viable product, feedback from customers, or iteration to ensure that you’re spending time wisely working on the right thing. They also treat deployment and monitoring as a throw it over the fence problem, where work is passed to other teams for completion with little communication or collaboration, reducing chances of successful delivery.

    In response, many groups have proposed new methodologies, including Microsoft with their Team Data Science Process (TDSP).¹² TDSP is a significant improvement over previous approaches and recognizes that data science delivery needs to be agile, iterative, standardized, and collaborative. Unfortunately, TDSP does not seem to be gaining much traction. TDSP and similar methodologies are also restricted to the data science lifecycle. There is an opportunity for a methodology that encompasses end-to-end data lifecycle, from acquisition to retirement.

    Organizational Culture

    Emotional, situational, and cultural factors heavily influence business decisions. FORTUNE Knowledge Group’s survey of more than 700 high-level executives from a variety of disciplines across nine industries demonstrates the barriers to data-driven decision-making. A majority (61%) of executives agree that when making decisions, human insights must precede hard analytics. Sixty-two percent of respondents contend that it’s often necessary to rely on gut feelings and that soft factors should be given the same weight as hard factors. Worryingly, two-thirds (66%) of IT executives say decisions are often made out of a desire to conform to the way things have always been done.¹³ These are not isolated findings. NewVantage Partners’ Big data Executive Survey 2017 found cultural challenges remain an impediment to successful business adoption:

    More than 85% of respondents report that their firms have started programs to create data-driven cultures, but only 37% report success thus far. Big data technology is not the problem; management understanding, organizational alignment, and general organizational resistance are the culprits. If only people were as malleable as data.¹⁴

    It is no surprise that very few companies have followed the example of Amazon and replaced highly paid white-collar decision-makers with algorithms despite the enormous success it has achieved.¹⁵

    The challenge in delivering successful data science has much less to do with technology, but cultural attitude where many organizations alternately treat data science as a box-ticking exercise or part of the never-ending pursuit for the perfect solution to all their challenges. Nor is the problem with the effectiveness of algorithms. Algorithms and technology are well ahead of our ability to feed them high data quality, overcome people barriers (skills, culture, and organization), and implement data-centric processes. However, these symptoms are themselves the result of deeper root causes such as lack of knowledge of the best way to use data to make decisions, legacy perceptions from last century’s approach to handling data and delivering analytics, and a shortage of support for data analytics.

    The Knowledge Gap

    Multiple knowledge gaps make it hard to embed data science in organizations when implementation starts at the very top of an organization. Nevertheless, it is too easy to blame business leaders and IT professionals for their failure to deliver results. The knowledge gap is a two-way street because data scientists must share the blame.

    The Data Scientist Knowledge Gap

    Data science aims to facilitate better decisions, leading to beneficial actions by extracting knowledge from data. To enable better decisions, data scientists need a good understanding of the business domain to enable better understanding of the business problem, identify the right data and prepare it (often detecting quality issues for the first time), employ the right algorithms on the data, validate their approaches, convince stakeholders to act, operationalize their output, and measure results. This breadth of scope necessitates an extensive range of skills such as the ability to collaborate and coordinate with multiple functions within the organization in addition to their own job area, critical and scientific thinking, coding and software development skills, and knowledge of a wide range of machine learning and statistical algorithms. Moreover, the ability to communicate complex ideas to a nontechnical audience and business acumen if in a commercial setting is crucial. In the data science profession, someone with the combination of all these skills is known as a unicorn.

    Since finding a unicorn is rare if not impossible (they don’t sign up to LinkedIn or attend meetups), organizations try to find the next best thing. They hire people with programming (Python or R), analysis, machine learning, and statistical and computer science skills, which happen to be the five most sought-after skills by employers.¹⁶ These skills should be the differentiator between data scientists and everyone else. Unfortunately, this belief reinforces the mistaken conviction among junior data scientists that specialist technical skills should be the focus, but this tends to create dangerously homogenous teams.

    Create me a machine translation attention model using a bi-directional long short-term memory (LSTM) with an attention layer outputting to a stacked post-attention LSTM feeding a softmax layer for predictions, said no CEO, ever.

    I’m always staggered when interviewing by the number of candidates who say their primary objective is a role that allows them to create deep learning, reinforcement learning, and [insert your algorithm here] models. Their aim is not to solve a real problem or help customers, but to apply today’s hottest technique. Data science is too valuable to treat as a paid hobby.

    There is a disconnect between the skills data scientists think they need and what they really need. Unfortunately, technical skills are nowhere near enough to drive real success and beneficial actions from data-driven decisions. Faced with hard-to-access poor-quality data, lack of management support, no clear questions to answer, or results ignored by decision-makers, data scientists without senior data science leadership aren’t equipped to change the culture. Some look for greener pastures and get a new job, only to realize that similar challenges exist in most organizations. Others focus on the part of the process they can control, the modeling.

    In data science, there is an overemphasis on machine learning or deep learning, and especially among junior data scientists, the belief that working in solitary isolation to maximize model accuracy score(s) on a test dataset is the definition of success. This behavior is encouraged by training courses, online articles, and especially Kaggle. High test set prediction accuracy seems a bizarre interpretation of success to me. It is usually better, in my experience, to try ten solution scenarios rather than spend weeks on premature optimization of a single problem solution because you don’t know in advance what is going to work. It is when you get feedback from consumers and measure results that you see if you have driven beneficial action or even useful learning. At that point, you can decide the value of expending further effort to optimize.

    The aim must be to get a minimum viable product into production. A perfect model on a laptop that never goes into production wastes effort so is worse than a model that does not exist. There are domains where model accuracy is paramount like medical diagnosis precision, fraud detection, and AdTech, but these are a minority compared to applications where doing anything is a significant improvement over doing nothing. Even in domains benefiting disproportionately from optimizing model accuracy, quantifying real-world impact is still more important.

    Getting a model into production requires different technical skills than creating the model. The most important of which are relevant software development skills. For many data scientists, who mainly have non-software development backgrounds, coding is just a means to an end. They are unaware that coding and software engineering are disciplines with their own set of best practices. Alternatively, if they are aware, they tend to see writing reusable code, version control, and testing or documentation as obstacles to be avoided.

    Weak development skills cause difficulties, especially for reproducibility, performance, and quality of work. The barriers to getting models into production are not the sole responsibility of data scientists. Often, they do not have access to the tools and systems required to get their models into production and thereby must rely on other teams to facilitate implementation. Naïve data scientists ignore the gulf between local development and server-based production and treat it as a throw it over the fence problem by not thinking through the implications of their choice of programming language. This inexperience causes avoidable friction and failure.

    IT Knowledge Gap

    Superficially, data science and software development share similarities. Both involve code, data, databases, and computing environments. So, data scientists require some software development skills. However, there is a crucial distinction demonstrated in Figure 1-1 that shows the difference between machine learning and regular programming.

    ../images/476438_1_En_1_Chapter/476438_1_En_1_Fig1_HTML.jpg

    Figure 1-1

    Difference between regular programming and machine learning

    In regular programming, rules or logic are applied to input data to generate an output (Output = f(Inputs) such as Z = X + Y) based on well-understood requirements. In machine learning, examples of outputs and their input data along with their individual properties, known as features, feed a machine learning algorithm. A machine learning algorithm attempts to learn the rules that generate output from inputs by minimizing a cost function via a training process that will never achieve perfect accuracy on real-life data. Once suitably trained, regular programming can use machine learning model rules to make predictions for new input data. The difference between regular programming and machine learning has profound implications for data quality, data access, testing, development processes, and even computing requirements.

    Garbage in, garbage out applies to regular programming. However, high-quality data is essential for machine learning as the algorithm is dependent upon good data to learn rules. Poor-quality data will lead to inferior training and predictions. Generally, more data allows a machine learning algorithm to decipher more complexity and generate more accurate predictions. Moreover, more features inherent in the data enables an algorithm to improve predictive accuracy. Data scientists can also engineer additional features from existing data based on domain knowledge and experience.

    Model training is iterative and computationally expensive. High capacity memory and more powerful CPUs allow you to use more data and sophisticated algorithms. The languages and libraries used to create machine learning models are specialized for data analytics, typically the R and Python programming languages and their libraries and packages, respectively. However, once a model has been created, deployment processes are much more familiar to a software developer.

    In regular programming, logic is the most important part. That is, ensuring that the code is correct is critical. Development and testing environments often do not need high-performance computers, and sample data with sufficient coverage is enough to complete testing. In data science, both the data and code are critical. There is no correct answer to test for, only an acceptable level of accuracy. Often, minimal code (compared to regular programming) is required to fit and validate a model to high test accuracy (e.g., 95%). The complexity lies in ensuring that data is available, understood, and correct.

    Even for training, data must be production data, or the model will not predict well on new unseen data with a different data distribution. Sample data is not useful unless data scientists select test data as part of a deliberate strategy (e.g. random sampling, cross-validation, stratified sampling) specific to the problem. Although these requirements relate to machine learning, they generalize to all data scientist work in general.

    The needs of data scientists are often misinterpreted by IT, even by those who are supportive as nice-to-haves. Data scientists are frequently asked to justify why they need access to multiple data sources, complete production data, specific software, and powerful computers when other developers don’t need them and reporting analysts have mined data for years by just running SQL queries on relational databases. IT is frustrated that data scientists don’t understand the reasons behind IT practices. Data scientists are frustrated because it’s not easy to justify the value of what they consider necessities upfront. More than once the question But why do you need this data? has made my heart sink.

    It is rare to see IT processes designed to support advanced analytical processes. It starts with the way data is captured. Many developers see capturing new data items as a burden with an associated cost in planning, analysis, design, implementation, and maintenance time. Most organizations, therefore, collect data foremost to support operational processes like customer relationship management (CRM), financial management, supply chain management, e-commerce, and marketing. Frequently, this data resides in separate silos, each with its strict data governance strategy.

    Often, data will go through an ETL (Extract, Transform, Load) process to transform it into structured data (typically a tabular format) before loading it into a data warehouse to make it accessible for analytics. There are some drawbacks to this approach for data science. Only a subset of data makes its way through the ETL where it is typically prioritized for reporting. Adding new data items can take months of development. As such, raw data is unavailable to data scientists. Raw data is what they need!

    Traditional data warehouses usually only handle structured data in relational schemas (with data split across multiple joinable tables to avoid duplication and improve performance) and can struggle to manage the scale of data we have available today. They also don’t handle modern use cases that require unstructured data like text or sometimes even machine-generated semi-structured data formats like JSON (JavaScript Object Notation). One solution is the creation of data lakes where data is stored in raw native format and goes through an ELT (Extract, Load, Transform) process when needed, with the transformation being dependent on the use case.

    When a data lake is not available, data scientists must extract the data themselves and combine it on a local machine or work with data engineers to build pipelines into an environment with the tools, computing resource, and storage they need. Requests to access data, provision environments, and install tools and software are often the responsibility of separate teams with varying concerns for security, costs, and governance. As such, data scientists need to work with different groups to deploy and schedule their models, dashboards, and APIs. With processes in place incongruent with data science needs, costs are greatly elevated.

    The entire data lifecycle splits across many IT teams, and each in isolation makes rational decisions based on its functional silo objectives. Such silo objectives do not serve data scientists. For data scientists who need data pipelines from raw data to final data product, significant challenges arise. They need to justify their requirements and negotiate with multiple stakeholders to complete a delivery. Even if they are successful, they will still be dependent on other teams for many tasks and at the mercy of backlogs and prioritization. No one person or function is responsible for the entire pipeline leading to delays, bottlenecks, and operational risk.

    Data security and privacy are cited occasionally as obstacles to prevent access and processing of data. There are genuine concerns to ensure compliance with regulations, respect user privacy, protect reputation, defend competitive advantage, and prevent malicious damage. However, such concerns can also be used to take a risk-averse route and not implement solutions that allow for the safe, legitimate, and ethical use of data. More typically, problems occur when data security and privacy policies are implemented without undertaking a thorough cost-benefit analysis and fully understanding the impact on data analytics.

    Technology Knowledge Gap

    Although technology is not the only barrier to the successful implementation of data science, it is still crucial to get tooling right. Figure 1-2 shows the typical hardware and software layers in a data lifecycle from raw data to useful business applications.

    ../images/476438_1_En_1_Chapter/476438_1_En_1_Fig2_HTML.jpg

    Figure 1-2

    Typical hardware and software layers in the data lifecycle

    Many software and hardware requirements need to come together to create data products. There must be a holistic understanding of requirements and investment balance across all the lifecycle layers. Unfortunately, it is easy to focus on one part of the jigsaw to the detriment of others. Large enterprises tend to concentrate on the big data technologies used to build applications. They obsess over Kafka, Spark, and Kubernetes, but fail to provide their data scientists sufficient access to data, software libraries, and tools they need. Smaller organizations are more likely to provide their data scientists with the software tools they need, but may fail to invest in storage and processing technologies, leaving analytics processing isolated on laptops.

    Even if they do get the investment in tools right, organizations can still underestimate the supporting resources needed to build, maintain, and optimize the stack. Without sufficient talent in data engineering, data governance, DevOps, database administration, solutions architecture, and infrastructure engineering, it is next to impossible to utilize the tools

    Enjoying the preview?
    Page 1 of 1