Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Perspectives in Biometrics
Perspectives in Biometrics
Perspectives in Biometrics
Ebook348 pages

Perspectives in Biometrics

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Perspectives in Biometrics is a collection of articles that deals with the state of active and important research area in the field of biometrics, as well as the methodological aspects of particular biometrical data analyses. The book reviews the statistical analysis of a large data base by using interactive computing and data analysis facilities as shown in the Albany Heart Study. One paper presents a survey of adaptive sampling techniques used in clinical trials, while another discusses computer-aided prognosis that can be useful in predicting the survival rate after the diagnosis and treatment of a serious disease. Another paper explains the use and interpretation of multivariate methods used in classifying the different stages encountered in infectious diseases of the critically ill. For example, the data bank in the Clinical Research Center—Acute is analyzed for a set of measurements that are then inputted in a computer base for later retrieval. The book also discusses "nonparametric estimation" that concerns estimates of distribution densities and cumulatives, as well as the use of "percentile points" to obtain decision rules in parametrization problems. The text can prove valuable for statisticians, students, and professors of calculus and advanced mathematics.
LanguageEnglish
Release dateOct 22, 2013
ISBN9781483272252
Perspectives in Biometrics

Related to Perspectives in Biometrics

Biology For You

View More

Reviews for Perspectives in Biometrics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Perspectives in Biometrics - Robert M. Elashoff

    York.

    Preface

    In recent years there has been a growing concensus that the science of biometrics properly includes biomathematics, biostatistics, and biocomputer science with appropriate regard for the bio in each of these subjects. Progress and understanding in biometrics thus makes use of methodologies, observations, and theories originating in many diverse fields such as artificial languages, control theory, differential equations, genetics, mathematical statistics, numerical analysis, operations research, pharmacology, and probability theory. An often articulated concern among those specializing in one of the biometrical disciplines is the difficulty in assessing and assimilating recent reasearch developments in other aspects of biometrics.

    This new series is devoted exclusively to publishing essentially two types of articles to aid biometricians: (1) critical reviews summarizing the present state of active and important research areas, and (2) methological aspects of particular biometrical data analyses. The union of these two types of articles was by design. Traditionally, a series of publications with the professed goals of this series would limit itself to the critical review article. The second type of article was included since the overall appreciation and understanding of the fabric of biometrical methodology is obtained only after experiencing the iterative process of data analysis and inference. The data analyst acts in the roles of assessor, assimilator, innovator, and reviewer as he sails a zigzag course to complete his mission. Study of case histories of data analyses will lead the biometrician to additional important perspectives about particular methodologies.

    Each of the contributors has chosen the content, organization, and manner of presentation of the material most appropriate for himself. Contributors were urged to be critical and provide in-depth coverage for a general audience.

    The editors welcome suggestions for topics and writers.

    In this first volume, Schatzoff, Bryant, and Dempster (Chapter 1) discuss some ways interactive computing with data analysis facilities can be used in the statistical analysis of a large data base—the Albany Heart Study. The emphasis throughout focuses on how computer-based interactive data analysis using APL fits into a statistical analysis.

    In Chapter 2, Hoel, Sobel, and Weiss review the subject of adaptive clinical trials, in which these authors have made important and innovative contributions. Essentially an adaptive trial arises whenever the data on the first (n −1) patients are used to choose the treatment for the nth patient. Adaptive clinical trials are viewed by some as solving ethical problems in the conduct of clinical trials.

    Chapter 3, by Brunk, Thomas, Elashoff, and Zippin, compares several ways to predict prognosis following diagnosis and treatment for a disease. The comparisons are carried out using data obtained from an extensive investigation into breast cancer.

    In Chapter 4, Friedman, Goldwyn, and Siegel present an innovative use of multivariate methods to develop and validate prognostic stages for patients critically ill with infectious diseases. The authors’ approach is gaining currency in certain intensive care centers.

    Arvesen and Salsburg (Chapter 5) review some of the important roles of the jackknife in data analysis. In particular, the authors show how the jackknife can produce approximate tests and confidence intervals which are useful for many problems and under broad conditions.

    Chapter 6, by Tarter, gives a systematic exposition to estimate the inverse cumulative distribution function and, thus, provide an alternative way to estimate the hazard function and age-specific death rate, and to choose a transformation. Tarter’s methods give us additional ways to look at some old problems, especially in the demographic area.

    1

    Interactive Statistical Computation with Large Data Structures

    MARTIN SCHATZOFF and PETER BRYANT,     IBM Cambridge Scientific Center, Cambridge, Massachusetts

    ARTHUR P. DEMPSTER,     Department of Statistics, Harvard University, Cambridge, Massachusetts

    Publisher Summary

    This chapter discusses some of the ways in which interactive computing and data analysis facilities can be used in the statistical analysis of a large database. It presents some data from the Albany Heart Study, with emphasis on how and where the interactive analysis fits rather than on the results of the analysis itself. The chapter also presents examples of exploring and editing data, formulating hypotheses to be investigated, and building up the necessary subprograms to examine those hypotheses. These are the areas in which interactive, online data analysis has marked advantages over the more conventional batch mode. The results of some further analyses are also given in the chapter.

    I. Introduction

    II. The Data and the Computing Facilities

    III. The Example 4

    A. Editing

    B. Exploring

    C. Exploring Continued

    D. Taking Stock

    IV. Summary of Further Analyses

    V. Comments on Interactive Data Analysis Systems

    References

    I INTRODUCTION

    We discuss in this chapter some of the ways in which interactive computing and data analysis facilities can be used in the statistical analysis of a large data base. We illustrate these ways with an example—a partial analysis of some data from the Albany Heart Study—but our emphasis is on how and where the interactive analysis fits rather than on the results of the analysis itself. More specifically, we go through an example of exploring and editing data, formulating hypotheses to be investigated, and building up the necessary subprograms to examine those hypotheses, since these are the areas in which interactive, on-line data analysis has marked advantages over the more conventional batch mode. We summarize the results of some further analyses.

    The partial analysis we use as an example was carried out using the APL(CMS)system (IBM, 1972) running under CP-67 (Seawright and Kelch, 1969), a virtual machine time-sharing system, at the IBM Cambridge Scientific Center. It was part of a study aimed at determining the system features required for effective on-line data analysis. We return at the end of this chapter to some of our conclusions in this respect.

    II THE DATA AND THE COMPUTING FACILITIES

    The data for our example come from the Albany Heart Study (Hilleboe et al., 1954). Over 2000 New York State male civil servants were examined at roughly 1-year intervals. Initial ages were 39–59. Blood pressure, cholesterol level, triglyceride level, hemoglobin count, age, height, weight, smoking patterns, and occupational data were recorded, as well as any diagnoses of heart or other diseases and hospitalizations (or deaths) between examinations. Up to 18 examinations are available for each individual. Of the population, 1909 males are considered eligible (by political and other nonmedical criteria) for study. The question of interest is: How are these variables related to the incidence of coronary heart disease? We are particularly interested in statistical methods that explicitly use the longitudinal aspects of the data.

    The APL(CMS) system used for our analysis is essentially the APL\360 system (IBM, 1970) modified by including some functions to permit access to CMS data files and to run in an individual virtual machine under CP-67. For interactive analysis with this system, we must restrict ourselves to those programs and data that can be held and manipulated in one workspace, an area of (conceptually local) storage. While the workspaces under APL(CMS) are perhaps five or ten times as large as those typically available with APL systems (because of the virtual storage facility of CP-67), they are by no means large enough to hold all the data from the Albany study. We therefore selected various subsets and samples of the individuals in the study, and stored them in individual workspaces that can be worked with separately. Figure 1 lists the various workspaces and the classes of individuals they contained. The workspaces were stored on read-only disc files, which could be accessed simultaneously by all of the participants in the project. Transferring and sharing other data and programs between individual users on the APL(CMS) system is very simple, so that programs or intermediate data produced by one member of the team can easily be made available to the others.

    Fig. 1 Individual workspaces and the classes of individuals they contain.

    Within each workspace, matrices H and X were created. Matrix H contains one row for each individual. Its columns are various items pertaining to that individual (birth date, cause of death, etc.). Each row of X contains the data for one examination on one of the individuals whose header is a row of H. The two matrices are related by their first columns, which contain the identification (ID) numbers assigned to the individuals. Figure 2 lists the variables that make up the columns of X and H. Each workspace also contains various statistical functions and programs, some of which we will illustrate.

    Fig. 2 Matrices H and X.

    Using these subsets of the available data, we hoped to be able to explore the data interactively, and determine variables to include in subsequent analyses. We were looking particularly for variables and measures that showed some power in discriminating among those individuals who did and did not contract coronary heart disease. In the following example we carry out one such exploration, using workspace ALB10, containing data for 60 individuals who died of coronary heart disease (CHD) suddenly (defined as dying of CHD with no previous diagnosis of CHD), and for 60 individuals who had no reported CHD on any examination.

    III THE EXAMPLE

    A Editing

    We cannot really convey here a feel for the power of the conversational computer system and its use in data analysis. For that, one must sit down at a terminal and work through some examples. Failing that, though, it is perhaps worthwhile to have such a session described in narrative form.

    In this example, we take a workspace of data for individuals who died suddenly of CHD and for individuals who had no reported CHD at all. We are looking for variables that seem likely to be good predictors of CHD. Comparing any proposed variable on these two groups of people seems to be a reasonable way of obtaining a rough idea of how good a discriminator it is. From various studies it seems likely that systolic blood pressure will be among the best variables for this purpose, so we start by comparing the two groups on this variable.

    In Fig. 3, we begin by asking the system to load the workspace containing our data. The system response SAVED … indicates when the workspace was last saved onto permanent storage. In this and most of the following figures, requests and commands typed by the user are indented six spaces. The system’s responses start at the left margin (with certain exceptions, notably the plots, which are indented).

    Fig. 3 An example of APL language used in the study. See text for discussion.

    Our first task is to separate the data into the two groups of interest. We define (←) a logical vector (0’s and 1’s), HEARTIDS, in the fourth line of Fig. 3. Such a vector contains a 1 in each position where some condition is satisfied and a 0 where it is not. In this case, the condition is that the seventh column of H, the header matrix, should be equal to 4109 (the code for death by CHD). Thus, HEARTIDS is a vector of length equal to the number of rows in H, with a 1 in position i if H(i, 7) = 4109—i.e., if the ith individual died of CHD—and a 0 otherwise.

    Next, we create vectors of 1’s and 0’s indicating which rows of X correspond to examinations for those who died suddenly of CHD (HEARTEXAMS), and for those who had no CHD reported (HEALTHYEXAMS). To understand such APL statements, it sometimes helps to read from right to left. In the fifth line of Fig. 3, the vector HEARTIDS, created one line earlier, is used to compress (/) the first column of H, which contains the ID numbers. Compressing a vector V by a logical vector L means selecting those elements of V for which there is a 1 in the corresponding position of L. In our case, we end up with a vector of the ID’s of all those individuals who had a cause of death field equal to 4109. We then create the new logical vector HEARTEXAMS, of length equal to the number of rows of X, which contains a 1 in every position corresponding to an ID which is an element of the vector just created. In the last line, then, HEALTHYEXAMS is defined to be the complement (∼) of HEARTEXAMS. (Note that while using the APL language to describe and specify the various subsets of data of interest is simple, describing the APL statements themselves in English is complicated. We urge the reader to make the effort to understand logical vectors and their uses in selecting subsets, though. They serve an important function in interactive data analysis, and we use them further in the next steps of our analysis.)

    We now have to face the problem of missing observations and bad values in the data. At the time we obtained this data, we had only the roughest ideas about how missing observations were recorded, and no formal documentation on it. In Fig. 4, the first command asks the system for a histogram of the tenth column of X (systolic blood pressure). The 10 10 field refers to the size of the plot and the number of classes in the histogram. The resulting histogram seems to indicate a few bad values very low down and at least one somewhere near 1000. None of these makes much sense as a value for systolic blood pressure. The next command asks for a count of how many values of X(10) are greater than 300. It does this by creating a logical vector for the condition X(10) > 300, and then using the operator + to compress the vector. This sums the elements of the vector, thus producing a count of the number of 1’s in it. The next command asks for the number of values that are less than 75. From the answers, we know that six values are accounting for the unreasonableness in the histogram. Next we define a logical vector, BPOK, containing 1’s for those examinations with apparently good values on blood pressure—which on the basis of the histogram we have just defined as being between 75 and 300. We do this by taking the logical and (^) of two logical vectors in an obvious

    Enjoying the preview?
    Page 1 of 1