Exploring Data Analysis: The Computer Revolution in Statistics
By W. J. Dixon
()
About this ebook
This title is part of UC Press's Voices Revived program, which commemorates University of California Press’s mission to seek out and cultivate the brightest minds and give them voice, reach, and impact. Drawing on a backlist dating to 1893, Voices Revived
Related to Exploring Data Analysis
Related ebooks
Data Preparation and Exploration: Applied to Healthcare Data Rating: 0 out of 5 stars0 ratingsData and the American Dream: Contemporary Social Controversies and the American Community Survey Rating: 0 out of 5 stars0 ratingsPrinciples of Biomedical Informatics Rating: 0 out of 5 stars0 ratingsClinical Decision Support Systems: Theory and Practice Rating: 3 out of 5 stars3/5Schaum's Outline of Elements of Statistics I: Descriptive Statistics and Probability Rating: 0 out of 5 stars0 ratingsRepurposing Legacy Data: Innovative Case Studies Rating: 0 out of 5 stars0 ratingsBiostatistics: A Guide to Design, Analysis and Discovery Rating: 0 out of 5 stars0 ratingsStatistical Method from the Viewpoint of Quality Control Rating: 5 out of 5 stars5/5Data Analysis: What Can Be Learned From the Past 50 Years Rating: 0 out of 5 stars0 ratingsData-Centric Biology: A Philosophical Study Rating: 0 out of 5 stars0 ratingsStatistical Design and Analysis of Experiments: With Applications to Engineering and Science Rating: 0 out of 5 stars0 ratingsWays of Knowing in HCI Rating: 5 out of 5 stars5/5Data Treatment in Environmental Sciences Rating: 0 out of 5 stars0 ratingsQuantitative Analysis and Modeling of Earth and Environmental Data: Space-Time and Spacetime Data Considerations Rating: 0 out of 5 stars0 ratingsPractical Biostatistics: A Friendly Step-by-Step Approach for Evidence-based Medicine Rating: 5 out of 5 stars5/5Logic of Discovery and Diagnosis in Medicine Rating: 0 out of 5 stars0 ratingsLinear and Generalized Linear Mixed Models and Their Applications Rating: 0 out of 5 stars0 ratingsMultiple Imputation and its Application Rating: 0 out of 5 stars0 ratingsIntroduction to Data Analysis in Qualitative Research Rating: 0 out of 5 stars0 ratingsSignal Processing for Neuroscientists, A Companion Volume: Advanced Topics, Nonlinear Techniques and Multi-Channel Analysis Rating: 0 out of 5 stars0 ratingsAnalysis of Clinical Trials Using SAS: A Practical Guide, Second Edition Rating: 0 out of 5 stars0 ratingsAudit Studies: Behind the Scenes with Theory, Method, and Nuance Rating: 0 out of 5 stars0 ratingsClinical Research Computing: A Practitioner's Handbook Rating: 0 out of 5 stars0 ratingsDesigning User Studies in Informatics Rating: 0 out of 5 stars0 ratingsMultimethod Research, Causal Mechanisms, and Case Studies: An Integrated Approach Rating: 0 out of 5 stars0 ratingsPsychophysics: A Practical Introduction Rating: 0 out of 5 stars0 ratingsData Mining for the Social Sciences: An Introduction Rating: 0 out of 5 stars0 ratingsComputational Frameworks: Systems, Models and Applications Rating: 0 out of 5 stars0 ratingsSensory Evaluation Practices Rating: 5 out of 5 stars5/5Complex Surveys: A Guide to Analysis Using R Rating: 0 out of 5 stars0 ratings
Data Modeling & Design For You
Supercharge Power BI: Power BI is Better When You Learn To Write DAX Rating: 5 out of 5 stars5/5DAX Patterns: Second Edition Rating: 5 out of 5 stars5/5R Programming - a Comprehensive Guide: Software Rating: 0 out of 5 stars0 ratingsUltimate Enterprise Data Analysis and Forecasting using Python Rating: 0 out of 5 stars0 ratingsThinking in Algorithms: Strategic Thinking Skills, #2 Rating: 5 out of 5 stars5/5Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch Rating: 0 out of 5 stars0 ratingsData Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5Data Visualization: a successful design process Rating: 4 out of 5 stars4/5The Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5Bayesian Analysis with Python Rating: 5 out of 5 stars5/5Mastering Agile User Stories Rating: 4 out of 5 stars4/5Raspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps Rating: 3 out of 5 stars3/5End-to-End Data Science with SAS: A Hands-On Programming Guide Rating: 0 out of 5 stars0 ratingsThe Esri Guide to GIS Analysis, Volume 3: Modeling Suitability, Movement, and Interaction Rating: 0 out of 5 stars0 ratingsGraph Databases in Action: Examples in Gremlin Rating: 0 out of 5 stars0 ratingsAutoCAD® Pocket Reference Rating: 0 out of 5 stars0 ratingsData Fluency: Empowering Your Organization with Effective Data Communication Rating: 2 out of 5 stars2/5Data Analytics with Python: Data Analytics in Python Using Pandas Rating: 3 out of 5 stars3/5A Concise Guide to Object Orientated Programming Rating: 0 out of 5 stars0 ratingsThe Systems Thinker - Mental Models: The Systems Thinker Series, #3 Rating: 0 out of 5 stars0 ratingsLearn T-SQL Querying: A guide to developing efficient and elegant T-SQL code Rating: 0 out of 5 stars0 ratingsPrinciples of Data Science Rating: 4 out of 5 stars4/5Learning Python Design Patterns - Second Edition Rating: 0 out of 5 stars0 ratingsBrainstorming and Beyond: A User-Centered Design Method Rating: 0 out of 5 stars0 ratingsNo-Code Data Science: Mastering Advanced Analytics, Machine Learning, and Artificial Intelligence Rating: 0 out of 5 stars0 ratings
Reviews for Exploring Data Analysis
0 ratings0 reviews
Book preview
Exploring Data Analysis - W. J. Dixon
EXPLORING DATA ANALYSIS
Exploring Data Analysis
The Computer Revolution In Statistics
Edited by
W. J. DIXON Department of Biomathematics University of California, Los Angeles
and
W. L. NICHOLSON Battelle Pacific Northwest Laboratories and
National Buréau of Standards
UNIVERSITY OF CALIFORNIA PRESS
Berkeley Los Angeles London
University of California Press Berkeley and Los Angeles, California
University of California Press, Ltd.
London, England
ISBN: 0-520-02470-2
Library of Congress Catalog Card Number: 73-78549 Copyright © 1974 by The Regents of the University of California
Printed in the United States of America
Contents 1
Contents 1
Preface
CHAPTER 1
CHAPTER 2
CHAPTER 3
CHAPTER 4
CHAPTER 5
CHAPTER 6
CHAPTER 7
CHAPTER 8
Citation Index
Preface
The genesis of this book was a conference on statistical computing, organized as a workshop, to examine the frontiers of data analysis based on computer use. It was held in the Health Sciences Computing Facility (HSCF) at the University of California at Los Angeles in September 1971.
The original impetus for such a workshop came from discussions with Wesley Nicholson during an international meeting in London some years earlier. We were dismayed at the current ivory tower trends in statistics. Mimicking the mathematicians, statisticians were increasingly avoiding the real world of application, and were purifying and separating the field from other sciences. The conference was planned as a counterrevolution to that trend.
The Health Sciences Computing Facility provided an excellent place for the workshop. The facility is dedicated to serving biomedical research through research in mathematics, statistics and computer science. It has an IBM 360/91 and numerous typewriter, character scope, and graphics consoles served by a time-sharing operating system. The system specializes in interactive statistical techniques and the programs to serve them. Of special importance to conference participants was the use of graphical statistical techniques.
Participants were limited to a select group of practicing data analysts. The papers presented real problems and included a discussion of the physical mechanisms involved in generating data for the analyses. With a real problem as a focal point, the analyses pursued the needs of the problem rather than stressing particular techniques of statistics. But any new techniques useful for the analyses were emphasized, and the degree to which the derivation and use of the techniques was dependent on the computer was stressed.
Each paper was available to several critics in advance of the meeting. Their comments are included in this volume as well as additional comments by the authors and other critics that developed during the sessions.
The conference revealed many characteristics of a data analyst at work.
In contrast to the biologist who examines his data with the constructs of his own field in mind, the data analyst examines the data for its apparent similarity to a variety of statistical models he has in mind, letting the results of successive analytical attempts guide the direction he pursues (and refines) as he proceeds. The statistician approach might be described as one in which he states: "if we assume normality, independence, and perhaps other fundamentals, then the results indicate the validity of certain stated hypotheses with associated probabilities. ¹¹ In contrast, the data analyst may use many of the same techniques, but he will explore (also with statistical techniques) the degree to which these assumptions might be affecting his conclusions, and the consequences to the applicational field of deviations from reality in the analytical assumptions.
The data analyst seems to be more involved in exploration than in refinement. He is slow to make assumptions before he examines the data. He is quite satisfied if any advance is made in the problem area independent of the sophistication of the analysis, the goodness of agreement of his model, or the presentability of the statistical analysis itself.
He is quite prepared to find that one might arrive at the same conclusion using quite different routes and quite different techniques. The data analyst is almost sure to have a deep involvement in computers since he requires computing power for his freedom to use a wide variety of techniques.
Techniques and analysts are not independent. They interact.
One obtains a maximum result from interactions rather than from main effects. A particular person who uses certain techniques more powerfully than someone else may obtain better answers using those techniques than others can. On the other hand, another person may use his own techniques and do equally well, that is, there is an interaction in the process.
Even when techniques are mathematically equivalent, different analysts use them in different ways. One may think and do analysis of variance, and another may think regression. They may be doing the same thing but their thought processes and the way they proceed through the analysis of the problem differ because of the way they conceptualize analysis of variance and regression; although the language may differ and even communication maybe difficult, the overall analyses may really be very similar.
By the end of the conference it was clear that there is a heavy interaction between analysts and scientists in other fields. In most cases, the analyst has become very involved with the subject matter of the field’s basic theories and problems. The statistical research for his data analysis is truly collaborative— in many cases he enlists the cooperation of other statisticians as well. The statistical analysis is not separated and pursued for its mathematical elegance, rather it is oriented toward the needs of the problem.
Perhaps this team
work and cooperation is the most important and far-reaching revelation of the conference.
A short definition of data analysis was proposed at the conference: Data analysis is the application of one or more techniques to a set of data steered by the problem.
Computer facilities at HSCF were available to participants before and during the conference, and a UCLA,rbuddyn was assigned to each participant to help in any way necessary. Data presented at the conference is available from HSCF in machine readable form. A data set description containing at least a partial listing of the data from each paper is given in this book.
The computational support was made possible by grant RR-3 of the Biotechnology Resources Branch, Division of Research and Resources of the National Institutes of Health. The conference itself was supported by grant GJ-29844 from the National Science Foundation.
Acknowledgements are due several members of my staff for their help with the conference and in preparing material for this book. Ed Chen, Dolores Adams and Ellen Sommers assisted in preparations for and during the conference. Ellen Sommers prepared and edited the associated data sets. Lyda Boyer edited, and Betsy Potter typed the manuscripts.
Much of the work or organization of the conference itself and working with the authors on the preparation of their manuscripts was done by Wesley Nicholson.
W. J. Dixon
CHAPTER 1
ADVANCED BREAST CANCER DATA JAMES DICKEY
Statistics Department, State University of New York at Buffalo and
JUDY WALRATH
Department of Epidemiology and Public Health, Yale University
The majority of medical data-analysis problems arise from a physician’s hope that his records of past cases will yield useful information. The real problems are mathematically vague, but tangible: What lessons are to be learned from past experience for future clinical practice? What patient subpopulations have distinctive behavior patterns? What treatments should be used in what kinds of cases?
In the language of John Tukey (1962, 1970), these are problems of exploratory data analysis
— problems of how to Find Interesting Reportable Effects (FIRE).
FIRE problems, however, are not the subject of the bulk of statistical theory, which is devised for After The Revelation Orderly Pickling of HYpotheses (ATROPHY), and to Guard Against Silly Selection Effects by Definition (GASSED).
Research for this study was supported by NIGMS-NIH Grant GM 16557.
Linear discrimination procedures have not been very productive in real medical problems (Radhakrishna, 1964). Even the FIRE- problem-motivated stepwise linear procedures (regression and discrimination) deliver linear functions that tend to be almost meaningless as final answers to physicians and statisticians alike, especially linear functions of three or more variables. They may, however, be useful in pointing out the few important variables.
In this paper we strive to concentrate on FIRE problems of clinical-experience data, with the aim of contributing to a general systematic approach involving the use of computer programs as steps in an analytic sequence. We discuss exploratory data analysis for an important class of problems — the prediction of a dichotomized treatment-response variable.
Prof. Wilfrid J. Dixon’s (1969, 1970) BMD biomedical computer programs are widely used for practical data analysis. Contributions to a systematized approach, inspired by the BMD programs, are put forth here, together with a few rough predecessor FORTRAN language programs, and programs not yet available.
In the following section we introduce, as concrete motivation, the well-studied (Armitage et al, 1969) advanced breast cancer data analysis, and the clinical-decision problem of Bulbrook et al (I960), and Atkins et al (1968). Each of the remaining sections describes a type of computer program:
• First Look At Graphs (FLAG);
• Subsample Histograms Or Plots (SHOP);
• Shop In Full Totality (SIFT); and
• a discussion of discriminant analysis per se, with an emphasis on recent nonparametric procedures.
The typical medical data set features a few (1 — 10) response variables and many (10 — 100) mixed-type (dichotomous to practically continuous) predictor variables, for a precious few (10 — 100Ó) observed cases. Missing values abound. The definitions of individual variables are ambiguous and ill-conceived. The data embody histories of undisciplined clerks’ misunderstandings. In short, the statistics teacher’s nightmare: imperfect data and vague problems.
We consider here a decision problem in the management of advanced breast cancer, and a related data set from Guy’s Hospital, London (Atkins et al, 1968), unusual for the painstaking care with which it was collected. This concrete data- analysis problem is put forth as representative of many in being suited to a general systematic approach.
Two hundred and ten advanced breast cancer patients were included in the study. Approximately two-thirds (139/210) of them had undergone attempted cure by radical (116/210) or simple (23/210) mastectomy, and then a year or so later had a recurrence of tumor growth locally or at a distant site. The other one-third (71/210) had been first diagnosed as already advanced. Three- fifths (132/210) began the palliative stage of their treatment with the administration of hormones, which were useful in some cases (17/132) for up to one year in controlling tumor growth.
Then it was a question of whether or not surgery should be used to alter the hormonal environment of the tumors. If so, which of two operations should be performed: bilateral adrenalectomy with oophorectomy (removal of all adrenals and ovaries), or hypophysectomy (removal of pituitary). Each patient underwent an operation, about half each kind (115/210, 95/210).
For one-quarter of the patients (54/210), the surgery was successful (complete remission of symptoms for over six months); for another one-quarter (53/210), intermediate results (partial remission); and for the other half (103/210), failure (no improvement).
Both surgical procedures are radical attempts to prolong life. Hypophysectomy is a more involved and dangerous operation, but its whole-sample remission percentages (28/95 and 24/95) were essentially the same as those for adrenalectomy (26/115 and 29/115).
Natural suggestions for variables related to surgical success include:
1. measures of tumor growth rate
a) age of patient
b) extent of disease at mastectomy
c) time from mastectomy to recurrence;
2. tumor histology;
3. menopausal status;
4. history of mastectomy; and
5. systemic (hence urinary) hormone levels.
In I960, Dr. R. D. Bulbrook and his coinvestigators at Guy’s Hospital developed a linear discriminant function of two 24-hour- urinary-steroid levels, aetiocholanolone (E) and 17-hydroxy- corticosteroid (17 OHCS),
80 — 80(17 OHCS) + E, (1)
positive values of which tend to predict favorable response to surgery. After further prospective studies, Atkins et al (1968) reported the discriminant function by itself provides an efficient guide to response to hypophysectomy but does not do so for adrenalectomy in this series.
They also found small effects for the factors l.c), 3., and 4. above.
Armitage et al (1969) carried out extensive FIRE-like analyses of these same data. First, each of three response variables was dichotomized and fit by Hills’ (1967) stepwise sample-splitting discrimination procedure for dichotomized predictor variables. Then they performed special analyses, each suited to each original response variable.
The response, a clinical assessment of success (as success, intermediate, and failure, defined above), was dichotomized into nonfailure and failure, and then related to various sets of predictor variables. Our discussion is restricted to this choice of a dichotomous response variable and to dichotomized responses in general, thus neglecting other important developments of methodology, for example, survival-time data.
At the suggestion, and through the kindness, of Prof. Marvin Zelen, a card copy of the Armitage et al (1969) data was obtained from John Copas, and a slightly updated version of the original patient records (including 16 new cases) from Dr. R. D. Bulbrook. The updated records of all 210 cases are on file at HSCF under the title Advanced Breast Cancer Data (J. Dickey).
A complete listing of the cancer data in card image form is given in the Data Set Description at the end of this chapter. This includes a description of the 50 variables associated with each patient, and, parenthetically, single word acronyms which identify variables.
FIRST LOOK AT GRAPHS (FLAG)
Newly punched data will, with high probability, contain mistaken values appearing as
1. over punches and illegal characters;
2. data-to-format mismatches;
3. nonsense values of a variable
a) off-range numeric values
b) meaningless multiple-choice values;
4. nonsense combinations of variable values, e. g., autopsy date preceding date of death;
5. multivariate outliers; and
6. undetectable-per-se mistaken values.
Computer program-processing systems tend to abort program runs when data input contains mistakes of types 1 and 2. Many data-analysis programs abort or deliver unacceptable output from input mistakes of type 3, and less commonly, of type 4.
One of the functions of our computer program, FLAG (Goldman et al, 1971) is to detect, and identify by flagged output, mistaken