Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Capitalizing Data Science: A Guide to Unlocking the Power of Data for Your Business and Products (English Edition)
Capitalizing Data Science: A Guide to Unlocking the Power of Data for Your Business and Products (English Edition)
Capitalizing Data Science: A Guide to Unlocking the Power of Data for Your Business and Products (English Edition)
Ebook505 pages4 hours

Capitalizing Data Science: A Guide to Unlocking the Power of Data for Your Business and Products (English Edition)

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Can you foresee how your company and its products will benefit from data science? How can the results of using AI and ML in business be tracked and questioned? Do questions like ‘how do you build a data science team?’ keep popping into your head?
All these strategic concerns and challenges are addressed in this book.

Firstly, the book explores the evolution of decision-making based on empirical evidence. The book then helps compare the data-supported era with the current data-led era. It also discusses how to successfully run a data science project, the lifecycle of a data science project, and what it looks like. The book dives fairly in-depth into various today's data-led applications, highlights example datasets, discusses obstacles, and explains machine learning models and algorithms intuitively.

This book covers structural and organizational considerations for making a data science team. The book helps recommend the use of optimal data science organization structure based on the company's level of development. Finally, the book explains data science's effects on businesses by assisting technological leaders.
LanguageEnglish
Release dateDec 3, 2022
ISBN9789355511591
Capitalizing Data Science: A Guide to Unlocking the Power of Data for Your Business and Products (English Edition)

Related to Capitalizing Data Science

Related ebooks

Computers For You

View More

Related articles

Reviews for Capitalizing Data Science

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Capitalizing Data Science - Mathangi Sri Ramachandran

    CHAPTER 1

    Data-Driven Decisions from Beginning to Now

    Introduction

    Data and Data Science is ubiquitous. From driving cars to finding out the nearest restaurants, data science is the backbone of technology today. In order to appreciate the current state better, we need to turn to the pages of history. This chapter traces the genesis of data science and its evolution by highlighting some of the key applications over a period of time. Broadly, there are two phases in the history of data-driven decisioning in organizations—one in which data plays a supporting role and another in which data plays the central and pivotal role. In this chapter, we will discuss the use cases in both phases and also what led to the current boom in the adoption of data science across different organizations. Toward the end of the chapter, we discuss the current challenges that could be addressed to further improve the impact that AI and Data Science can together create

    Data-driven decisions and their phases

    If I were to ask you whether you would like to have "coffee with sugar or without—your response would be a function of your preference, the numerous articles that you would have read on the impact of sugar, your exercise routine that week, existing health conditions and so forth. All these data are processed in your brain before you provide an answer. Human beings seem, by nature, to be data-driven. Data for making decisions have been prevalent for centuries now. It is not a recent concept. For example- there is an early mention of data in ancient India—in both the Rig Veda and in Arthashastra" wherein it referred to the governance with the help of data ( https://www.drishtiias.com/to-the-points/Paper2/census-in-india )

    However, what we have witnessed in the recent past has been an increase in the intensity and penetration of data in decision-making processes for commercial purposes. We could possibly trace the history of using data for day-to-day decisioning in a company called Manchester Guardian Society in 1826. This company used to publish the creditworthiness of customers as a newsletter every week. Banks could then use these newsletters to make decisions for their customers. This company later on, became "Experian. Experian" is one of the pioneers of credit bureau companies in the world. Credit bureaus provide financial risk information about consumers to financial institutions. As we can see, banks have been the forerunners in using data for decisions and have been making large-scale decisions for more than a century now. Once banks started using data for decisioning soon, other industries followed suit.

    We can divide data-driven decisioning in organizations into two distinct phases—"Human led and Data Supported decisions", the other one being Data led and Human guided decisions. (For ease of reading, let us refer to data-driven decisioning as D3 from henceforth). From the 1950s till about the early 2000s, the former type of decisioning was the most prevalent industry, and in the last 10+ years, we see the prevalence of the latter.

    Human-led and data-supported decisions

    In the first phase of D3, data was restricted to a support function. Data was used for testing hypotheses and to validate human decisions. Analysis was more guided by domain expertise and understanding of business processes. This was also a phase of instrumenting and storing large-scale data. There was advancement in "Business Intelligence and tools that support dashboarding and reporting. Toward the beginning of 2000, there was a push toward deriving actionable results from this large storage of data. However, data remained a back-office function and did not come into mainstream decision-making. This was an era of statistical analysis". Organizations were hiring statisticians who could help them with experimenting and analyzing data. Some of the key industries that benefited from the use of statistics would be banking, health care—especially clinical trials, manufacturing, and retail. Let us discuss some of the use cases in detail here:

    Risk scores in banking: Predictive models like "regression, CHAID, and CART were used to understand the risk profile of the customers. Given the transaction history of the user and the data from the credit bureau, the problem is to predict the default probability of the user. Statisticians built decision trees that could predict the risk score of a user, which were then used to approve or decline the loans. Decision trees is a popular tool that tries to classify the target/dependent variable (in this case, default probability) by using the relationship of independent variables (such as age, income, and so on) on the dependent variable. An example could be to explain the default rate with age and income ranges. A decision tree is created with default rate as the dependent variable and age, and income" being the independent variables. In this case, shown in figure 1.1, the default probability is higher in the lower age bucket than in the higher age bucket. After the split on the age bucket, the default probability is explained by income buckets. Please refer to the following figure:

    Figure 1.1: Decision tree for default rates

    If the bank wants to reduce its default rate to 2%, then it needs to target users with Age > 30 and "Medium and High" income ranges. This tree could be arrived at using algorithms that find the best split at each node, or this could also be built using what a risk manager thinks is the right split to go with. As the default rate gets better, the set of users who are eligible will also decrease. The risk manager arrives at a risk and user coverage trade-off to come up with a suitable set of rules. These trees are very easy to explain and to make a decision. Such techniques were widely used to take decisions on credit approvals in the banking industry.

    Clinical trials in health care: Before a new drug is launched in the market, it is tested on a set of patients to determine whether it will act on the underlying condition and produce the desired result. Patients are divided into two groups—"test and control. The test set of patients is administered the medicine in question, and the control set of patients are administered a placebo drug. Placebo is an interesting idea where you give a drug with no therapeutic value to patients. The results are then compared between these two groups to know if the drug is producing a significant difference in the test" group.

    Interestingly, the idea of "placebo" as a control group treatment originated as early as in 1800. When a ship captain administered different treatments to his crew to cure them of a disease that they caught on-board. Formal experimentations in trials started taking a definitive shape in the 1940s.

    The 1962 drug amendment act (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5299804/) in the US made it mandatory for all drugs to show "substantial evidence of drug safety using clinical trials. This gave rise to a new field called Biostatistics". Biostatistics is the application of quantitative methods in the field of biology. Some common areas of applications are clinical trials, genomics, epidemiology, and so on.

    Survival Analysis, another area that developed as part of biostatistics, is used in the medical to predict the mortality rate of subjects under treatment. Given a set of factors and historic data, the statistical tool can provide "duration of survival". This technique then also started to get used in fields other than medicine—for instance, to predict how long users will take to unsubscribe in a subscription-based business-like Telco.

    Design of experiments in manufacturing: Design of experiments (DOE) is a branch of statistics that helps in setting up and studying unbiased experiments. Manufacturing (where this field primarily evolved) has a lot of process parameters that need to be tuned to get the best outcome. Process parameters could be the setting of temperate, pressure, or any other variable that controls the quality of the output. DOE helps in designing such experiments and studying the impact of input parameters on the output. Let us say that process A is dependent on Pressure and Temperature. Each of these factors says to operate at two settings—T1, T2 for temperature and P1, P2 for pressure. We want to understand the quality of the output based on these settings and choose the best setting. This could be measured in terms of acceptance rates of the product after a quality check. The design of the experiment table would look like as shown in Table 1.1:

    Table 1.1: Design of experiments

    The experiment trials are randomized and repeated for replication and to get the errors in the experimentations down. Once the results are measured, we analyze the results to understand the best set of temperature and pressure. This may sound trivial when the factors (for example—temperature and pressure) and their levels (T1, T2, P1, and P2) are limited. This table could become very large as we increase the number of factors and their levels. In such scenarios, there are experimental designs called "fractional factorial that helps us to test a subset of combinations. This, in turn, reduces the overall cost. DOE is extensively used in industrial processes and has now started being used in consumer studies as well. Some e-commerce" organizations use DOE to design and test—Banner Ads.

    Providing actionable insights: One of the key use cases of data in phase1 of D3 was to provide actionable insights to businesses. Examples would be to provide answers for questions like "Why is the sales lower this month, Estimate of intended product sales, Why are customers attriting, Are customers liking our Ads on TV, and so on. The responses to these questions would involve generating a set of hypotheses and validating if they are statistically accurate. Such questions are critical to many businesses, and these continue to be solved in today’s businesses as well. In the earlier era, insights were produced by generating a set of hypotheses and validating each of those. For example, to the question—Why are customers attriting. There would be a list of hypotheses generated between the analyst/ statistician and the domain expert. The domain expert could be a business manager, marketing manager, or customer support head. These hypotheses would involve typically two variables—in this case, the attrition rate of the customer and another, the variable in question (For example—increase in price, tenure of the customer, % customer complaints, and so on.). Hypotheses are investigated one at a time to conclude which could be causing customer attrition. Such techniques are nowadays driven by multivariate machine learning models", which help us to investigate the impact of many different variables in a single mathematical formulation. However, understanding the cause of an event is not straightforward. Correlations do not imply causation, and this is an important philosophy to keep in mind while mining data for causal impacts. One way to understand causation is to simulate the causal effect (based on correlations) through experiments than by using inferential methods on observational data. In our example, to understand the reasons for attrition, the company could decrease/increase the price (keeping other variables constant) or other such correlated variables and study if it impacts attrition rates.

    Preceding are the use cases that played a vital role during Phase 1 of D3. As you can see, the use cases are more static and focused on providing insights to decision-makers. Human intuition and domain knowledge played a greater role in shaping the data strategy of organizations. Decision-makers consumed these insights and acted on those as deemed. This phase is, hence, "insight driven rather than impact-driven. It was difficult to quantify the dollars impacted by these insights or analysis. Organizations could not pinpoint the value the analytics teams were driving. Any analytical project that only provides insights suffers from possibly two states—Either the provided insights are very intuitive and hence already known to the decision-makers, or they are too unexpected, and hence, difficult to believe and act upon. Also, generating new" insights on a periodic basis is also not possible if the trends do not change significantly. Hence, data could not add incremental value.

    Data-led and human-guided

    This phase marks the beginning of the widespread use of "Data Science.. The term Data Science can be traced to Peter Naur, who used the term freely in his work Concise Survey of Computer Methods in 1974 (https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/?sh=2df2f2b055cf"). Three things needed to happen for D3 to shift from human-led to data led—the immense growth in data powered accelerated by advancement in computing power, availability of machine learning algorithms and suitable use cases that prove the value of a data-first approach. Let us see each of these factors in detail:

    Growth of data: The growth of the internet in the 1990s resulted in the tremendous growth of the volume of data collected. The chart from Michael Lesk’s article on the growth of data from 1995 to 1998 (http://www.lesk.com/mlesk/ksg97/ksg.html) in figure 1.2 illustrates the point of "data explosion". Please refer to the following figure:

    Figure 1.2: Growth of data

    We can also see the explosion of growth in data in the 2000s. The chart in figure 1.3 is from IDC’s study (https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf). Please refer to the following figure:

    Figure 1.3: Growth of data from 2000

    With this scale of growth in data, organizations started playing data at the forefront. The data growth was also accelerated by an increase in computing power. Cheaper computing resources enabled data ingestion, storage, and retrieval at scale. High volume and variety of data (text, audio, and so on) enabled predictive models to take better decisions that had a higher business impact.

    Availability of machine learning algorithms: To churn this ocean of data, you need more efficient tools than a purely statistical approach. It is humanly impossible to make sense out of this data by analyzing one dimension at a time. When we have just 10 variables to predict, it is easy to come up with and guide the hypotheses. When there are 1000s of variables, we need sophisticated techniques. Hence, some of the "hypothesis led approaches of the earlier era got replaced by large-scale data mining techniques. Interestingly, machine learning algorithms were themselves around since the 1960s, and the data explosion" of the 1990s and early 2000s provided the right canvas and the right use cases. Machine learning is a stochastic process that learns from the errors of one iteration of prediction and improves the prediction of the next iteration based on the errors. A large number of instances or data points available works in its favor to increase prediction accuracy. Heuristic processes, in contrast, are static and cannot learn as dynamically as machine learning models do.

    The need to solve the right use cases: The earlier adopters of using machine learning algorithms happened to be the "digital first organizations like Google, Amazon, Netflix, and so on. Take the example of the search and recommendations problem in an eCommerce company, and these are impossible to be solved by uni-dimensional or bi-dimensional analysis and hypothesis. The Netflix challenge (https://en.wikipedia.org/wiki/Netflix_Prize) of 2006 is an example of the initial use cases that needed to be solved by machine learning models. The dataset itself was fairly huge even by today’s standards—over 100 million ratings of 17,770 movies from 480,189 customers (https://www.thrillist.com/entertainment/nation/the-netflix-prize). Solving the Netflix challenge included applying plenty of machine learning methods. In a lot of these use cases, a 1% improvement in accuracy could improve the top-line by multiple millions of dollars. Once the dollar impact got established by the early adopters, it soon got picked up for traditional use cases as well, which were earlier solved using statistical techniques.

    "Data Science" became a separate department in many organizations starting in the late 2000s. Data science teams started building real-world applications that impacted the top-line and bottom-line of various organizations. Today, every industry uses data science in various ways—right from credit decisioning of banks to recommending products in an eCommerce website to inventory management in manufacturing industries. The uptake of data science solutions across all industries is phenomenal.

    Applications of data science

    Let us see the applications of data science across the customer lifecycle for the eCommerce industry. The customer lifecycle is divided into seven stages (https://www.sciencedirect.com/science/article/pii/S2212567115000313?via%3Dihub) — Initiation, Acquisition, Regain, Maintenance, Expansion, Retention, and Exit. Following are some of the key use cases of an eCommerce industry in each of these phases. We have combined the maintenance and expansion phases into one. So, we have listed six phases here:

    Initiation

    This is a phase of seeking new users for the product. Digital ads optimization for new user acquisition, as well as ads budget optimization, could be examples of cases where data science comes very useful in the eCommerce industry. The focus is more on awareness about the product or the company.

    Acquisition

    In this stage, the user is targeted through various channels and offers. The targeted customers are on-boarded with the right messaging and value proposition. In a transaction platform like eCommerce, this phase extends till the first transaction. Some organizations consider customers as "new users" till the first 30 days from on-boarding or the first transaction. Some of the relevant data science use cases here could be as follows:

    Optimizing targeting campaigns to provide a better return on investment.

    Once the user onboards, understand the drivers for the early drop-offs. Early drop-offs are customers who drop off after on-boarding without making their first transaction.

    Predicting early drop-offs using signals like a channel of acquisition, device details, temporal variables (time of day, day of the week, and so on), and early browsing behavior. At this stage, we have very less data about the user, and this poses some challenges for the machine learning algorithms used for such problems.

    Understanding early indicators of loyal users. This understanding is super critical to retain the users at a later stage.

    Maintenance and expansion

    These two phases have a lot in common and are practically treated the same in eCommerce and similar industries. Hence, we will treat this as a single stage/phase of the customer lifecycle. This is an active customer management phase. This phase involves active servicing of the customers making sure the customer is active and transacting. Generally, the customers in this phase contribute to the profitability of the organization. Hence, growing the customers involves providing the right user experience to the users and shaping some of the behavior to purchase profitable products as well. Key use cases of data science here could be as follows:

    Optimizing search and recommendations in eCommerce sites. This produces relevant results for the user and helps her to purchase faster. The search results today are powered by machine learning algorithms and are optimized based on user, transaction history, and query attributes rather than being restricted to the keyword the user searched.

    Identifying delivery or logistics-related problems so that they can be solved to improve customer satisfaction. Higher satisfaction drives better repeat behavior from the user. Examples here include predicting the "Expected time of arrival" of the product based on product attributes and delivery location, assigning the right logistics partner depending on serviceable areas, penalizing or closing suppliers based on poor serviceability in the past, and so on.

    Targeting existing customers with better offers so that they engage better and transact more.

    Summarizing product reviews so that users can easily make a purchase decision.

    Cross-sell and upsell using machine learning models.

    Retention

    Before customers decide to exit the product or service, companies try to retain the customers. Hence predicting the customers who would potentially attrite and intervene with the right mechanism are the key use cases that data science solves in this phase. If a company chooses to use discount offers to retain existing customers, one of the key things that machine learning models need to solve is to improve the efficiency of the campaigns. A campaign is said to be of high efficiency when the users targeted by the campaign purchase much more than when they are given the offer as compared to when they are not.

    Exit

    To grow and focus on profitable "good" customers involves removing or penalizing bad customers. Customers are not always right. We need to be able to stop the transactions of the wrong customers to provide a better experience to the right customers. In the eCommerce industry, for example, where fraudsters abuse the system and take undue advantage of returns and "cash on delivery" kind of options. Penalizing fraudsters helps provide the right user experience to the right users. Some of the use cases here are listed as follows:

    Machine learning models focus on identifying fraudulent customers and fraudulent transactions.

    Identifying the right penalty for the right set of customers

    Identifying "communities" or groups of customers who are related to each other and do fraudulent transactions that benefit each other.

    Identity fraud - Customers provide the wrong identity to make use of offers for "first time users".

    Identifying and removing fake reviews so that genuine users are not impacted by the fake reviews of the product.

    Regain

    Regain is the stage when the existing customers

    Enjoying the preview?
    Page 1 of 1