Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Exploring Data Analysis: The Computer Revolution in Statistics
Exploring Data Analysis: The Computer Revolution in Statistics
Exploring Data Analysis: The Computer Revolution in Statistics
Ebook381 pages2 hours

Exploring Data Analysis: The Computer Revolution in Statistics

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This title is part of UC Press's Voices Revived program, which commemorates University of California Press’s mission to seek out and cultivate the brightest minds and give them voice, reach, and impact. Drawing on a backlist dating to 1893, Voices Revived makes high-quality, peer-reviewed scholarship accessible once again using print-on-demand technology. This title was originally published in 1974.
This title is part of UC Press's Voices Revived program, which commemorates University of California Press’s mission to seek out and cultivate the brightest minds and give them voice, reach, and impact. Drawing on a backlist dating to 1893, Voices Revived
LanguageEnglish
Release dateDec 22, 2023
ISBN9780520338210
Exploring Data Analysis: The Computer Revolution in Statistics

Related to Exploring Data Analysis

Related ebooks

Data Modeling & Design For You

View More

Related articles

Reviews for Exploring Data Analysis

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Exploring Data Analysis - W. J. Dixon

    EXPLORING DATA ANALYSIS

    Exploring Data Analysis

    The Computer Revolution In Statistics

    Edited by

    W. J. DIXON Department of Biomathematics University of California, Los Angeles

    and

    W. L. NICHOLSON Battelle Pacific Northwest Laboratories and

    National Buréau of Standards

    UNIVERSITY OF CALIFORNIA PRESS

    Berkeley Los Angeles London

    University of California Press Berkeley and Los Angeles, California

    University of California Press, Ltd.

    London, England

    ISBN: 0-520-02470-2

    Library of Congress Catalog Card Number: 73-78549 Copyright © 1974 by The Regents of the University of California

    Printed in the United States of America

    Contents 1

    Contents 1

    Preface

    CHAPTER 1

    CHAPTER 2

    CHAPTER 3

    CHAPTER 4

    CHAPTER 5

    CHAPTER 6

    CHAPTER 7

    CHAPTER 8

    Citation Index

    Preface

    The genesis of this book was a conference on statistical computing, organized as a workshop, to examine the frontiers of data analysis based on computer use. It was held in the Health Sciences Computing Facility (HSCF) at the University of California at Los Angeles in September 1971.

    The original impetus for such a workshop came from discussions with Wesley Nicholson during an international meeting in London some years earlier. We were dismayed at the current ivory tower trends in statistics. Mimicking the mathematicians, statisticians were increasingly avoiding the real world of application, and were purifying and separating the field from other sciences. The conference was planned as a counterrevolution to that trend.

    The Health Sciences Computing Facility provided an excellent place for the workshop. The facility is dedicated to serving biomedical research through research in mathematics, statistics and computer science. It has an IBM 360/91 and numerous typewriter, character scope, and graphics consoles served by a time-sharing operating system. The system specializes in interactive statistical techniques and the programs to serve them. Of special importance to conference participants was the use of graphical statistical techniques.

    Participants were limited to a select group of practicing data analysts. The papers presented real problems and included a discussion of the physical mechanisms involved in generating data for the analyses. With a real problem as a focal point, the analyses pursued the needs of the problem rather than stressing particular techniques of statistics. But any new techniques useful for the analyses were emphasized, and the degree to which the derivation and use of the techniques was dependent on the computer was stressed.

    Each paper was available to several critics in advance of the meeting. Their comments are included in this volume as well as additional comments by the authors and other critics that developed during the sessions.

    The conference revealed many characteristics of a data analyst at work.

    In contrast to the biologist who examines his data with the constructs of his own field in mind, the data analyst examines the data for its apparent similarity to a variety of statistical models he has in mind, letting the results of successive analytical attempts guide the direction he pursues (and refines) as he proceeds. The statistician approach might be described as one in which he states: "if we assume normality, independence, and perhaps other fundamentals, then the results indicate the validity of certain stated hypotheses with associated probabilities. ¹¹ In contrast, the data analyst may use many of the same techniques, but he will explore (also with statistical techniques) the degree to which these assumptions might be affecting his conclusions, and the consequences to the applicational field of deviations from reality in the analytical assumptions.

    The data analyst seems to be more involved in exploration than in refinement. He is slow to make assumptions before he examines the data. He is quite satisfied if any advance is made in the problem area independent of the sophistication of the analysis, the goodness of agreement of his model, or the presentability of the statistical analysis itself.

    He is quite prepared to find that one might arrive at the same conclusion using quite different routes and quite different techniques. The data analyst is almost sure to have a deep involvement in computers since he requires computing power for his freedom to use a wide variety of techniques.

    Techniques and analysts are not independent. They interact.

    One obtains a maximum result from interactions rather than from main effects. A particular person who uses certain techniques more powerfully than someone else may obtain better answers using those techniques than others can. On the other hand, another person may use his own techniques and do equally well, that is, there is an interaction in the process.

    Even when techniques are mathematically equivalent, different analysts use them in different ways. One may think and do analysis of variance, and another may think regression. They may be doing the same thing but their thought processes and the way they proceed through the analysis of the problem differ because of the way they conceptualize analysis of variance and regression; although the language may differ and even communication maybe difficult, the overall analyses may really be very similar.

    By the end of the conference it was clear that there is a heavy interaction between analysts and scientists in other fields. In most cases, the analyst has become very involved with the subject matter of the field’s basic theories and problems. The statistical research for his data analysis is truly collaborative— in many cases he enlists the cooperation of other statisticians as well. The statistical analysis is not separated and pursued for its mathematical elegance, rather it is oriented toward the needs of the problem.

    Perhaps this team work and cooperation is the most important and far-reaching revelation of the conference.

    A short definition of data analysis was proposed at the conference: Data analysis is the application of one or more techniques to a set of data steered by the problem.

    Computer facilities at HSCF were available to participants before and during the conference, and a UCLA,rbuddyn was assigned to each participant to help in any way necessary. Data presented at the conference is available from HSCF in machine readable form. A data set description containing at least a partial listing of the data from each paper is given in this book.

    The computational support was made possible by grant RR-3 of the Biotechnology Resources Branch, Division of Research and Resources of the National Institutes of Health. The conference itself was supported by grant GJ-29844 from the National Science Foundation.

    Acknowledgements are due several members of my staff for their help with the conference and in preparing material for this book. Ed Chen, Dolores Adams and Ellen Sommers assisted in preparations for and during the conference. Ellen Sommers prepared and edited the associated data sets. Lyda Boyer edited, and Betsy Potter typed the manuscripts.

    Much of the work or organization of the conference itself and working with the authors on the preparation of their manuscripts was done by Wesley Nicholson.

    W. J. Dixon

    CHAPTER 1

    ADVANCED BREAST CANCER DATA JAMES DICKEY

    Statistics Department, State University of New York at Buffalo and

    JUDY WALRATH

    Department of Epidemiology and Public Health, Yale University

    The majority of medical data-analysis problems arise from a physician’s hope that his records of past cases will yield useful information. The real problems are mathematically vague, but tangible: What lessons are to be learned from past experience for future clinical practice? What patient subpopulations have distinctive behavior patterns? What treatments should be used in what kinds of cases?

    In the language of John Tukey (1962, 1970), these are problems of exploratory data analysis — problems of how to Find Interesting Reportable Effects (FIRE).

    FIRE problems, however, are not the subject of the bulk of statistical theory, which is devised for After The Revelation Orderly Pickling of HYpotheses (ATROPHY), and to Guard Against Silly Selection Effects by Definition (GASSED).

    Research for this study was supported by NIGMS-NIH Grant GM 16557.

    Linear discrimination procedures have not been very productive in real medical problems (Radhakrishna, 1964). Even the FIRE- problem-motivated stepwise linear procedures (regression and discrimination) deliver linear functions that tend to be almost meaningless as final answers to physicians and statisticians alike, especially linear functions of three or more variables. They may, however, be useful in pointing out the few important variables.

    In this paper we strive to concentrate on FIRE problems of clinical-experience data, with the aim of contributing to a general systematic approach involving the use of computer programs as steps in an analytic sequence. We discuss exploratory data analysis for an important class of problems — the prediction of a dichotomized treatment-response variable.

    Prof. Wilfrid J. Dixon’s (1969, 1970) BMD biomedical computer programs are widely used for practical data analysis. Contributions to a systematized approach, inspired by the BMD programs, are put forth here, together with a few rough predecessor FORTRAN language programs, and programs not yet available.

    In the following section we introduce, as concrete motivation, the well-studied (Armitage et al, 1969) advanced breast cancer data analysis, and the clinical-decision problem of Bulbrook et al (I960), and Atkins et al (1968). Each of the remaining sections describes a type of computer program:

    • First Look At Graphs (FLAG);

    • Subsample Histograms Or Plots (SHOP);

    • Shop In Full Totality (SIFT); and

    • a discussion of discriminant analysis per se, with an emphasis on recent nonparametric procedures.

    The typical medical data set features a few (1 — 10) response variables and many (10 — 100) mixed-type (dichotomous to practically continuous) predictor variables, for a precious few (10 — 100Ó) observed cases. Missing values abound. The definitions of individual variables are ambiguous and ill-conceived. The data embody histories of undisciplined clerks’ misunderstandings. In short, the statistics teacher’s nightmare: imperfect data and vague problems.

    We consider here a decision problem in the management of advanced breast cancer, and a related data set from Guy’s Hospital, London (Atkins et al, 1968), unusual for the painstaking care with which it was collected. This concrete data- analysis problem is put forth as representative of many in being suited to a general systematic approach.

    Two hundred and ten advanced breast cancer patients were included in the study. Approximately two-thirds (139/210) of them had undergone attempted cure by radical (116/210) or simple (23/210) mastectomy, and then a year or so later had a recurrence of tumor growth locally or at a distant site. The other one-third (71/210) had been first diagnosed as already advanced. Three- fifths (132/210) began the palliative stage of their treatment with the administration of hormones, which were useful in some cases (17/132) for up to one year in controlling tumor growth.

    Then it was a question of whether or not surgery should be used to alter the hormonal environment of the tumors. If so, which of two operations should be performed: bilateral adrenalectomy with oophorectomy (removal of all adrenals and ovaries), or hypophysectomy (removal of pituitary). Each patient underwent an operation, about half each kind (115/210, 95/210).

    For one-quarter of the patients (54/210), the surgery was successful (complete remission of symptoms for over six months); for another one-quarter (53/210), intermediate results (partial remission); and for the other half (103/210), failure (no improvement).

    Both surgical procedures are radical attempts to prolong life. Hypophysectomy is a more involved and dangerous operation, but its whole-sample remission percentages (28/95 and 24/95) were essentially the same as those for adrenalectomy (26/115 and 29/115).

    Natural suggestions for variables related to surgical success include:

    1. measures of tumor growth rate

    a) age of patient

    b) extent of disease at mastectomy

    c) time from mastectomy to recurrence;

    2. tumor histology;

    3. menopausal status;

    4. history of mastectomy; and

    5. systemic (hence urinary) hormone levels.

    In I960, Dr. R. D. Bulbrook and his coinvestigators at Guy’s Hospital developed a linear discriminant function of two 24-hour- urinary-steroid levels, aetiocholanolone (E) and 17-hydroxy- corticosteroid (17 OHCS),

    80 — 80(17 OHCS) + E, (1)

    positive values of which tend to predict favorable response to surgery. After further prospective studies, Atkins et al (1968) reported the discriminant function by itself provides an efficient guide to response to hypophysectomy but does not do so for adrenalectomy in this series. They also found small effects for the factors l.c), 3., and 4. above.

    Armitage et al (1969) carried out extensive FIRE-like analyses of these same data. First, each of three response variables was dichotomized and fit by Hills’ (1967) stepwise sample-splitting discrimination procedure for dichotomized predictor variables. Then they performed special analyses, each suited to each original response variable.

    The response, a clinical assessment of success (as success, intermediate, and failure, defined above), was dichotomized into nonfailure and failure, and then related to various sets of predictor variables. Our discussion is restricted to this choice of a dichotomous response variable and to dichotomized responses in general, thus neglecting other important developments of methodology, for example, survival-time data.

    At the suggestion, and through the kindness, of Prof. Marvin Zelen, a card copy of the Armitage et al (1969) data was obtained from John Copas, and a slightly updated version of the original patient records (including 16 new cases) from Dr. R. D. Bulbrook. The updated records of all 210 cases are on file at HSCF under the title Advanced Breast Cancer Data (J. Dickey). A complete listing of the cancer data in card image form is given in the Data Set Description at the end of this chapter. This includes a description of the 50 variables associated with each patient, and, parenthetically, single word acronyms which identify variables.

    FIRST LOOK AT GRAPHS (FLAG)

    Newly punched data will, with high probability, contain mistaken values appearing as

    1. over punches and illegal characters;

    2. data-to-format mismatches;

    3. nonsense values of a variable

    a) off-range numeric values

    b) meaningless multiple-choice values;

    4. nonsense combinations of variable values, e. g., autopsy date preceding date of death;

    5. multivariate outliers; and

    6. undetectable-per-se mistaken values.

    Computer program-processing systems tend to abort program runs when data input contains mistakes of types 1 and 2. Many data-analysis programs abort or deliver unacceptable output from input mistakes of type 3, and less commonly, of type 4.

    One of the functions of our computer program, FLAG (Goldman et al, 1971) is to detect, and identify by flagged output, mistaken

    Enjoying the preview?
    Page 1 of 1