Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Meta-Analytics: Consensus Approaches and System Patterns for Data Analysis
Meta-Analytics: Consensus Approaches and System Patterns for Data Analysis
Meta-Analytics: Consensus Approaches and System Patterns for Data Analysis
Ebook657 pages7 hours

Meta-Analytics: Consensus Approaches and System Patterns for Data Analysis

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Meta-Analytics: Consensus Approaches and System Patterns for Data Analysis presents an exhaustive set of patterns for data science to use on any machine learning based data analysis task. The book virtually ensures that at least one pattern will lead to better overall system behavior than the use of traditional analytics approaches. The book is ‘meta’ to analytics, covering general analytics in sufficient detail for readers to engage with, and understand, hybrid or meta- approaches. The book has relevance to machine translation, robotics, biological and social sciences, medical and healthcare informatics, economics, business and finance.

Inn addition, the analytics within can be applied to predictive algorithms for everyone from police departments to sports analysts.

  • Provides comprehensive and systematic coverage of machine learning-based data analysis tasks
  • Enables rapid progress towards competency in data analysis techniques
  • Gives exhaustive and widely applicable patterns for use by data scientists
  • Covers hybrid or ‘meta’ approaches, along with general analytics
  • Lays out information and practical guidance on data analysis for practitioners working across all sectors
LanguageEnglish
Release dateMar 10, 2019
ISBN9780128146248
Meta-Analytics: Consensus Approaches and System Patterns for Data Analysis
Author

Steven Simske

Steven J Simske is HP Fellow and Director at Hewlett Packard Labs, and has worked in machine intelligence and analytics for the past 25 years, with domains extending from medical image analytics to text summarization. He has performed research relevant to meta analytics for over 20 years at HP Labs, and in collaboration with major universities in the US and Brazil.

Related to Meta-Analytics

Related ebooks

Intelligence (AI) & Semantics For You

View More

Related articles

Reviews for Meta-Analytics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Meta-Analytics - Steven Simske

    2019

    Chapter 1

    Introduction, overview, and applications

    Abstract

    In this mammoth chapter, we cover the basic background material in statistics, machine learning, and artificial intelligence needed to understand the ever-broadening field of analytics. This chapter also introduces the software, data mining, and knowledge discovery skills necessary for the data scientist to proceed toward meta-analytics, that is, the next generation of analytics in which systems are hybrid by design and use multiple analytics to deduce valuable information about the data. Two longer sections at the end of the chapter will show how to build a classifier from the ground up that incorporates much of the statistical approaches of the earlier sections.

    Keywords

    Algorithms; Analytics; Artificial intelligence; Deep learning; Deep unlearning; Classification; Data mining; Machine intelligence; Machine learning; Parallelism; Recognition; Statistics; System architecture; Systems

    It is a capital mistake to theorize before one has data

    Arthur Conan Doyle (1887)

    Numquam ponenda est pluralitas sine necessitate

    William of Ockham, Duns Scotus, et al. (c. 1300)

    E pluribus unum

    US Motto

    1.1 Introduction

    We live in a world in which more data have been collected in the past 2–3 years than were collected in the entire history of the world before then. Based on the trends of the past few years, we’ll be saying this for a while. Why is this the case? The confluence of nearly limitless storage and processing power has, quite simply, made it far easier to generate and preserve data. The most relevant question is, perhaps, not whether this will continue, but rather how much of the data will be used for anything more than filling up storage space.

    The machine intelligence community is, of course, interested in turning these data into information and has had tremendous success to date albeit in somewhat specific and/or constrained situations. Recent advancements in hardware—from raw processing power and nearly limitless storage capacity, to the architectural revolution that graphics processing units (GPUs) bring, to parallel and distributed computation—have allowed software developers and algorithm developers to encode processes that were unthinkable with the hardware of even a decade ago. Deep learning and in particular convolutional neural networks, together with dataflow programming, allow for an ease of rolling out sophisticated machine learning algorithms and processes that is unprecedented, with the entire field having by all means a bright future.

    Taking the power of hybrid architectures as a starting point, analytic approaches can be upgraded to benefit from all components when employing a plurality of analytics. This book is about how simple building blocks of analytics can be used in aggregate to provide systems that are readily optimized for accuracy, robustness, cost, scalability, modularity, reusability, and other design concerns. This book covers the basics of analytics; builds on them to create a set of meta-analytic approaches; and provides straightforward analytics algorithms, processes, and designs that will bring a neophyte up to speed while augmenting the arsenal of an analytics authority. The goal of the book is to make analytics enjoyable, efficient, and comprehensible to the entire gamut of data scientists—in what is surely an age of data science.

    1.2 Why is this book important?

    First and foremost, this book is meant to be accessible to anyone interested in data science. Data already permeate every science, technology, engineering, and mathematics (STEM) endeavor, and the expectations to generate relevant and copious data in any process, service, or product will only continue to grow in the years to come. A book helping a STEM professional pick up the art of data analysis from the ground up, providing both fundamentals and a roadmap for the future, is needed.

    The book is aimed at supplying an extensive set of patterns for data scientists to use to hit the ground running on any machine-learning-based data analysis task and virtually ensures that at least one approach will lead to better overall system behavior (accuracy, cost, robustness, performance, etc.) than by using traditional analytic approaches only. Because the book is meta- analytics, it also must cover general analytics well enough for the reader to engage with and comprehend the hybrid approaches, or meta- approaches. As such, the book aims to allow a relative novice to analytics to move to an elevated level of competency and fluency relatively quickly. It is also intended to challenge the data scientist to think more broadly and more thoroughly than they might be otherwise motivated.

    The target audience, therefore, consists of data scientists in all sectors—academia, industry, government, and NGO. Because of the importance of statistical methods, data normalization, data visualization, and machine intelligence to the types of data science included in this book, the book has relevance to machine translation, robotics, biological and social sciences, medical and health-care informatics, economics, business, and finance. The analytic approaches covered herein can be applied to predictive algorithms for everyone from police departments (crime prediction) to sport analysts. The book is readily amenable to a graduate class on systems engineering, analytics, or data science, in addition to a course on machine intelligence. A subset of the book could be used for an advanced undergraduate class in intelligent systems.

    Predictive analytics have long held a fascination for people. Seeing the future has been associated with divinity, with magic, with the occult, or simply—and more in keeping with Occam’s razor—with enhanced intelligence. But is Occam’s razor, or the law of parsimony, applicable in the age of data science? It is no longer necessarily the best advice to say Numquam ponenda est pluralitas sine necessitate, or plurality is never to be posited without necessity, unless, of course, one uses goodness of fit to a model, output of sensitivity analysis, or least-squares estimation, among other quantitative artifacts, as proxies for necessity. The concept of predictive analytics, used at the galactic level and extending many thousands of years into the future, is the basis of the Foundation trilogy by Isaac Asimov, written in the middle of the 20th century. Futurist—or should we say mathematician?—Hari Seldon particularized the science of psychohistory, which presumably incorporated an extremely multivariate analysis intended to remove as much uncertainty from the future as possible for those privy to his output. Perhaps, the only prediction he was unable to make was the randomness of the personality of the Mule, an überintelligent, übermanipulative leader of the future. However, his ability to estimate the future in probabilistic terms led to the (correct) prediction of the collapse of the Galactic Empire and so included a manual to abbreviate the millennia of chaos expected to follow. In other words, he may have foreseen not the specific randomness of the Mule, but constructed his psychohistory to be optimally robust to the unforeseen. That is, Hari Seldon performed preflight sensitivity analysis of his predictive model. Kudos to Asimov for anticipating the value of analytics in the future. But even more so, kudos for anticipating that the law of parsimony would be insufficient to address the needs of a predictive analytic system to be insensitive to such unpredictable random artifacts (people, places, and things). The need to provide for the simplest model reasonable—that is, the law of model parsimony—remains. However, it is evident that hybrid systems, affording simplicity where possible but able to handle much more complexity where appropriate, are more robust than either extreme and ultimately will remain relevant longer in real-world applications.

    This book is, consequently, important precisely because of the value provided by both the Williams of Ockham and the Hari Seldon. The real world is dynamic and ever-changing, and predictive models must be preadapted to change in the assumptions that underpin them, including but not limited to the drift in data from that used to train the model; changes in the measurement system including sampling, filtering, transduction, and compression; and changes in the interactions between the system being modeled and measured and the larger environment around it. I hope that the approaches revisited, introduced, and/or elaborated in this book will aid data scientists in their tasks while also bringing non-data scientists to sufficient data fluency to be able to interact intelligently with the world of data. One thing is certain—unlike Hari Seldon’s Galactic Empire, the world of data is not about to crumble. It is getting stronger—for good and for bad—every day.

    1.3 Organization of the book

    This, the first, is the critical chapter for the entire book and takes on a disproportionate length compared with the other chapters intentionally, as this book is meant to stand on its own, allowing the student, data enthusiast, and even data professional to use it as a single source to proceed from unstructured data to fully tagged, clustered, and classified data. This chapter also provides background on the statistics, machine learning, and artificial intelligence needed for analytics and meta-analytics.

    Additional chapters, then, elaborate further on what analytics provide. In Chapter 2, the value of training data is thoroughly investigated, and the assumptions around the long-standing training, validation, and testing process are revisited. In Chapter 3, experimental design—from bias and normalization to the treatment of data experiments as systems of data—is considered. In Chapter 4, meta-analytic approaches are introduced, with primary focus being on cumulative gain, or lift, curves. Chapters 5–10 focus on other key aspects of systems around analytics, including the broad but very approachable field of sensitivity analysis (Chapter 5); the powerful family or platform of patterns for analytics loosely described as predictive selection (Chapter 6); a consideration of models, model fitting, and how to design models to be more robust to their environment (Chapter 7); addition analytic design patterns (Chapter 8); the recursive use of analytics to explore the efficacy of employed analytics (Chapter 9); and optimization of analytic system design (Chapter 10), which is a natural follow-on to Chapter 9. Chapter 11 is used to show how optimized system designs not only provide a better buffer to unanticipated random artifacts (these are called aleatory techniques here) but also do a better job of ingesting domain expertise from decidedly nonrandom artifacts, that is, from domain experts and requirements. In Chapters 12–13, the analytic approaches introduced in the preceding chapters are applied to specific technical fields (Chapter 12) and to some broader fields (Chapter 13). In Chapter 14, the contributions of this book are discussed in a larger context, and the future of data in the age of data is described.

    A note on what is meant by meta-analytics is worth providing. Essentially, meta-analysis has two broad fields of study/application:

    1.Meta- in the sense of meta-algorithmics, where we are combining two or more analytic techniques (algorithms, processes, services, systems, etc.) to obtain improved analytic output.

    2.Meta- in the sense of being outside, additional, and augmentative to pure analytics, which includes fields such as testing, ground truthing, training, and sensitivity analysis and optimization of system design.

    With this perspective, analytics is more than just simply machine learning: it is also learning in the correct order. It is not only knowledge extraction but also extraction of knowledge in the correct order. It is not only creating information but also creating information in the correct order. This means that analytics is more than simple descriptive or quantitative information. It is meant to extract and tell a story about the data that someone skilled in the field would be able to provide, including modifying the analysis in light of changing data and context for the data.

    1.4 Informatics

    Occasionally, data science will be used interchangeably with the term informatics. Informatics, however, is a branch of information engineering/science/systems concerned with the impact of data on humans (and presumably the impact of humans on data!). Informatics is concerned with the interaction between humans and relevant information, particularly in how humans process information digitally. Thus, an important aspect of informatics is the study of the social implications of information technologies. From this broad perspective, then, analytics gathered to determine how digital technologies affect humans [Carr11] are an important part of informatics.

    In this book, informatics will only be addressed peripherally, that is, as an integrated part of the example, which is instead focused on the algorithmic, process, or system approach to generating information from a data set. This does not mean we are allowed to operate in a vacuum as data scientists; rather, it simply means that this book will not have as a general concern the specific manner in which data are presented nor with which software the data are processed, etc.

    1.5 Statistics for analytics

    In this section, a quick summary (and, for many readers, a high-level recapitulation) of statistics relevant to data science is given. The main topics covered will be value (mean and estimate), variability, degrees of freedom, analysis of variance, and the relationship of these statistics to information and inferences that can be drawn from the data.

    1.5.1 Value and variance

    The value is an individual datum, typically binary, numerical, alphanumeric, or a word, depending on the data-type definition. The first-order descriptor of a plurality of values is the mean, μ, which is distinctly different from the average:

       (1.1)

    For example, the average income, house price, or cost of goods is generally given as the median, not the mean. The average day that the trash collector comes is usually the mode, not the mean. But in most analytics—that is, in parametric analytics—the mean is our average of choice. In nonparametric statistics, the median is often of concern, since the ranked order of values is important. Still on other occasions, the mean does not need to be computed but is instead a specification that a system is required to meet, for example, miles per gallon, cycles before failure, or bends before fatigue. In these cases, a single type of event is monitored and its mean calculated, and this mean is compared with this specification as mean.

    Of course, two populations can share the same mean and still be quite different. This is because most populations (and all nontheoretical populations) have variability around the mean. The second moment of the distribution is the variance, usually denoted by σ², whose square root the standard deviation σ, defined in Eq. (1.2), is an important characterizing datum of a distribution:

       (1.2)

    For a Gaussian, or normal, distribution, roughly 68% of the samples fall within the range {μ − σ, μ + σ}. Note in Eq. (1.2) that the degrees of freedom, or df for short, are equal to (number of samples)-1. This is intuitive since you can only choose the first (number of samples)-1 samples and then the last one is already determined. Degrees of freedom are always important in statistical analyses, since confidence in the result is directly related to the number of times a result has been repeated. While confidence is not a quantitative statistical measure (though confidence intervals are!), generally, confidence increases with degrees of freedom and inversely with variability. The highest possible confidence, then, comes when you repeat the exact same result many, many times.

    It is usually quite important to distinguish between comparing means and comparing variances. For example, this distinguishes between weather and climate: if, in a locale, the mean temperature is the same but the variance increases significantly over time, then the mean weather does not change, but the climate does. Similarly, higher variability in a genome more likely leads to new speciation than lower variability.

    Another example may be for an engine used for transportation or for hauling materials. For example, the modal and median engine revolutions per minute (RPM), when measured over a day or even over a driving/on-cycle session, may be well within the safety range. But this does not account for the variability. In some short driving sessions, the standard deviation may be as high as the mean, and so, a more important measure might be percent of time spent above a given value, which may be, for example, 1.2 standard deviation above the mean. Here, the nature of the distribution (the shape of the variance) is far more important than the mean. As a general rule, for nonnegative data sets, whenever μ ≤ σ, what you are measuring requires further elaboration to be useful from an analytic viewpoint.

    1.5.2 Sample and population tests

    This type of confidence directly factors in when we consider the first quantitative measurement for determining whether a sample belongs to a given population. This measure, the z-score, is given in Eq. (1.3), where we see that the numerator is the difference between the sample value, x, and the mean of the population, μ. The denominator is the standard deviation, σ, divided by the square root of the number of samples being compared with the population (which is effectively the degrees of freedom for comparing the sample x to the population having n samples):

       (1.3)

    Note that the value of z can be positive or negative depending on whether x is greater than the mean of the population. The z-score is used to decide with a given level of confidence that a sample does not come from a population. As such, the absolute value of the z-score in Eq. (1.3) is typically our concern. Table 1.3 provides a few of the most important probabilities and their corresponding z-scores. Two-tailed probability means that we do not know beforehand (a priori) whether a sample is being tested to be above or below the mean of the population; one-tailed probability means that we a priori are testing in a single direction from the mean. For example, a two-tailed test might be it’s not a normal temperature for this day of the year, while a one-tailed test might be it’s warmer than usual for this day of the year. In general, from a conservative statistical standpoint, it is better to use a two-tailed test than a one-tailed test unless you already have a hypothesis, model, or regulation guiding your comparison. You are less likely to have false positives for declaring a sample statistically significantly different from a population this way. Note that the probability of a one-tailed test is halfway to 100% from that of a two-tailed test. Thus, for z = 1.96, we are 95% certain that a sample did not come from a specific population, and we are 97.5% certain that it comes from a second population with a higher mean value if z = 1.96 (and not − 1.96). This makes sense, because we are effectively getting another 50% probability correct if the sign of the calculation z-value is correct. In this case, had z been − 1.96, we would not be able to support our hypothesis since the direction from the mean of the population of size n to which we compare the sample contradicts our hypothesis. (See Table 1.1.)

    Table 1.1

    The probability is not used to establish whether a sample belongs to a population; rather, it provides the probability that a single sample was not drawn from the population having mean μ and standard deviation σ per Eq. (1.3)

    Eq. (1.3) relies on some assumptions that are worth discussing, as there are several factors that affect the z-score in addition to the degrees of freedom. The first is the possibility of non-Gaussian (nonnormal) behavior of the population with which the sample is compared (and the population from which the sample actually comes, although we may have no way of knowing/estimating this population yet). When we consider third- and fourth-order moments such as skew and kurtosis, we may uncover non-Gaussian behavior such as left skew (long tail left), right skew (long tail right), bimodality (two clusters of data, implying that the population represents two subpopulations with different attributes), and other non-Gaussian behaviors (e.g., exponential, uniform, logistic, Poisson, and symmetrical distributions). These distribution deviations from assumed Gaussian behavior impact the interpretation of the z-score (generally undermining the p-value, or probability). Secondly, a temporal drift in the samples belonging to the population will undermine the z-score, since the sample may be compared with data that are no longer relevant. For this reason, the population and sample to compare should be time (and other experimental factor) matched whenever possible. Thirdly, an imbalanced training set or population sample bias will impact the z-score. If the population is meant to cover a specific range of input and does not, it can introduce distribution deviation and/or temporal drift or hide the same.

    In practice, z-scores are very important for process control and for identifying outliers. A brief example is given here. Suppose we represent a surface-based forensic, such as you might get using a high-resolution imager [Sims10] and image analysis that subtracts the actual postprinting or postmanufacturing micron-scale surface texture to that of a model [Poll10]. The so-called forensic signature (derived from the variations in electromagnetic spectrum, ultrasound, or other salient physical property) of the surface is represented as a bitstream, with 1024 bits in the string. When a new image is captured, its binary surface detail string is compared with that of the candidate (matched) sample and with the population of (unmatched) samples. The expected Hamming distance to the population of unmatched samples has an expected value of 512 bits (i.e., with random guessing, precisely 50% of the bits should match, and the other 50% should be in error). In our test of binary string descriptors for a large set of surfaces, we obtained a mean Hamming distance to unmatched samples of 509.7 (very close to the expected value of 512, with a standard deviation of 31.6). The number of test samples in the population is 100. Next, we measure a value, 319.4, for the Hamming difference between a surface that we wish to prove is authentic with a forensically relevant probability (typically p = 10− 9, meaning there is one chance in a billion of a false-positive match). Plugging into Eq. (1.3), we get Eq. (1.4):

       (1.4)

    So, z =− 6.02. Note that we use n = 1 (not n = 100, which is the number of samples to determine the population mean and standard deviation) here, since it is the number of samples that we are comparing with the population. Since z =− 5.997932 corresponds to p = 10− 9, we have (just barely!) forensic authentication (p < 10− 9).

    Even though there is a term for n, the number of samples, in the z-score, when the number of samples in a second population increases, we generally employ another statistical test for comparing two populations. This test, the t-test, is given by Eq. (1.5):

      

    (1.5)

    In the t-test statistic, the means of the two populations are denoted by the symbol μ, the standard deviations by the symbol σ, and the number in each sample by the symbol n (each with the appropriate numerical subscript). The overall degree of freedom (df) for comparison is n1 + n2 − 2 (this is needed when looking up the corresponding probability, or p, value from a t-table). The − 2 indicates the − 1 degree of freedom lost for selecting from each of the two populations. Statistical significance for one-tailed and two-tailed comparisons is determined as for z-values. Generally, t-tables, whether online or in a text, require the three data: df, t-score, and tailedness (1 or 2). For example, for df = 11, a two-tailed p = 0.01 requires | t |>3.106.

    Next, we consider what happens when there are several populations to compare simultaneously. In this case, we generally employ analysis of variance (or ANOVA), which is a collection of statistical models and their associated procedures (such as variation among and between groups) that are used to analyze the differences among group means. As with many other statistical approaches, ANOVA was originally developed for quantitative biological applications. A convenient means of calculating the necessary elements of an ANOVA is the tabular arrangement shown in Table 1.2. Here, a particular variable’s variance (sum squared variability about its mean) is partitioned into components attributed to the different sources of the variation (usually from within the groups or from between the groups). Groups can be clusters, classes, or other labeled sets. ANOVA provides a statistical test for whether the means of several groups are equal, providing a logical extension of the z-score (one dimension) to the t-test (two dimensions) to the comparing (testing) of three or more means for statistical significance.

    Table 1.2

    See text for details.

    As shown in Table 1.2, the sums of squares (around the means) between groups and within groups are calculated. Dividing these by the degrees of freedom gives us the mean squared variance (akin to mean squared error), and the ratio of mean squared error between and within groups gives us an F-score (named for Fisher, who was the first to systematize the ANOVA) to test if there are groups statistically significantly different from each other. High ratios of between-group to within-group variance are the basis of clustering, segmentation, and optimized partitioning. Thus, the F-score used for statistical analysis with the ANOVA is confluent with the aggregation approaches used for clustering.

    Additional calculations may be required for follow-on tests that determine the statistically significant differences between the groups, such as the Tukey; Student-Newman-Keuls (SNK); Fisher’s least significant difference (LSD); and Dunnett, Holm, Bonferroni, or Duncan’s multiple range test (MRT) [Ott08]. A variety of follow-on tests allow the statistician to trade off between false positives and false negatives. For example, Duncan’s MRT rank orders the clusters and compares each cluster pair with a critical value determined from a studentized range distribution. This has greater statistical power than the SNK but results in, statistically, more false positives. Tukey’s test is based on the z-test and is functionally akin to pairwise z-tests. The SNK test modifies Tukey’s test to have a more relaxed difference for more closely ranked samples, providing a bias toward false positives for closely ranked samples and the same bias toward false negatives for less closely ranked samples.

    1.5.3 Regression and estimation

    Regression techniques [Hast09] are used to provide predictive output for input across a broad range of values. There are many flavors of regression, including the familiar linear, polynomial, and logistic regressions that match curve descriptors for the relationship between independent (covariate) and dependent variables. Ridge regression, which is also known as weight decay, adds a regularization term that effectively acts like a Lagrange multiplier to incorporate one or more constraints to a regression equation. The least absolute shrinkage and selection operator (lasso) regression and stepwise selection perform both feature selection (dimensionality reduction, in which only a subset of the provided covariates are used in the final model, rather than the complete set of them) and regularization (which allows the regression to avoid overfitting by introducing, for example, interpolated information). Advanced forms of lasso alter the coefficients of the regression rather than setting some to zero as in stepwise selection. Finally, the elastic net adds penalty terms to extend lasso and provides a combination of lasso and ridge functionality.

    In this section, important aspects of regression for prediction—in particular sensitivity of the estimation—will be discussed using linear and logistic regression as the exemplars. Figs. 1.1 and 1.2 provide a simple linear and logistic, respectively, curve, along with the sample points from which the curve was defined. For linear regression, the line of best fit is described by Eq. (1.6):

       (1.6)

    Fig. 1.1 Example linear regression where the line of best fit for the filled circular points is indicated. The line is determined using least squared error as the cost function.

    Fig. 1.2 Example logistic regression where the logistic curve of best fit for the filled circular points is indicated. The curve is determined using least squared error as the cost function.

    For the logistic regression curve of Fig. 1.2, the relationship between the dependent and independent variables is given by Eq. (1.7):

       (1.7)

    Once the regression curve (center curves in Figs. 1.3 and 1.4) is determined, the curve is subtracted from the observations, and the mean and standard deviation of the errors, | xi − μ |, is computed. The error bars shown in Figs. 1.3–1.6 are the 99% error bars, that is, 2.576 standard deviations above and below the regression curves.

    Fig. 1.3 Example linear regression of Fig. 1.1 with 99% confidence interval lines indicated. These are 2.576 standard deviations to either side of the regression line.

    Fig. 1.4 Example logistic regression of Fig. 1.2 with 99% confidence interval lines indicated. These are 2.576 standard deviations to either side of the regression curve.

    Fig. 1.5 Example linear regression of Fig. 1.1 with sensitivity lines indicated. See text for details.

    Enjoying the preview?
    Page 1 of 1