Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Statistics: Basic Principles and Applications
Statistics: Basic Principles and Applications
Statistics: Basic Principles and Applications
Ebook1,052 pages12 hours

Statistics: Basic Principles and Applications

Rating: 0 out of 5 stars

()

Read preview

About this ebook

" Our objective in this book is to present an exposition of basic principles of statistics along with some indication of applications which satisfies the following ten commandments:
The focus should be placed on a clear development of basic ideas and principles.
The exposition of these basic ideas and principles should be streamlined so as to avoid having the undergrowth get in the way of the statistical forest.
High priority should be given to the assumptions which underlie the application of statistical principles.
Understanding of abuses, misuses, and misunderstandings which have arisen from the application of statistics is essential for a correct understanding of statistics.
The coverage should provide students with sufficient preparation for continued study of intermediate and advanced level statistics or disciplines which use statistical methodology.
The exposition should be readable and understandable by students without sacrifice of mathematical accuracy.
The organization should clearly distinguish mainstream topics inherent in every basic level statistics course, irrespective of applied interests, from topics of special interest to particular audience segments.
The computation dimension should not be given equal billing with statistical principles and ideas. Statistics is the master and, important as it is, the computation tool is the servant.
Exercises to provoke-thought - exercise the little grey cells, as Hercule Poirot would put it - should be a prominent part of the exposition.
Exercise banks to help the student see statistics as a whole are important.
LanguageEnglish
PublisherXlibris US
Release dateFeb 3, 2009
ISBN9781469107318
Statistics: Basic Principles and Applications
Author

Ramune B. Adams

William J. Adams, Professor of Mathematics at Pace University, is a recipient of Pace’s Outstanding Teacher Award. He was Chairman of the Pace N.Y. Mathematics Department from 1976 through 1991. Professor Adams is author or co-author of over twenty books on mathematics, its applications, and history, including Elements of Linear Programming (1969), Calculus for Business and Social Science (1975), Fundamentals of Mathematics for Business, Social and Life Sciences (1979), Elements of Complex Analysis (1987), Get a Grip on Your Math (1996), Slippery Math in Public Affairs: Price Tag and Defense (2002) and Think First, Apply MATH, Think Further: Food for Thought (2005), The Life and Times of the Central Limit Theorem Second Edition(2009). His concern with the slippery side of math and what math can do for us and its limitations is a prominent feature of his writings on applications. Concerning higher education in general, he is the author of The Nifty-Gritty in the Life of a University (2007).

Related to Statistics

Related ebooks

Mathematics For You

View More

Related articles

Reviews for Statistics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Statistics - Ramune B. Adams

    Copyright © 2009 by Mitchell P. Preiss, Irwin Kabus, and William J. Adams..

    All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the copyright owner.

    Any people depicted in stock imagery provided by Getty Images are models, and such images are being used for illustrative purposes only.

    Certain stock imagery © Getty Images.

    Statistics, Basic Principles and Applications, Revised Second Edition, is available on the web at webpage. pace. edu/wadams.

    Rev. date: 04/06/2018

    Xlibris

    1-888-795-4274

    www.Xlibris.com

    576178

    To

    Onutė, Andrius, Ramunė, and Rasa Adams

    Judy, Alan, Robert, and David Kabus

    Lucille Preiss and the Memory of Charles Preiss

    CONTENTS

    Preface to the Second Edition, Revised

    Perspective on the Chapters and Exercises

    To Our Students

    What Is Statistics?

    Statistics: The Computation Dimension

    PART ONE: THE NATURE, COLLECTION, ORGANIZATION, AND PROPERTIES OF DATA

    1. On Guard!

    1.1 How Good Are the Data?

    1.2 Are the Data Reliable?

    1.3 Slippery Statistics: An International Dimension

    1.4 Are the Data, Their Characteristics, and the Framework Generating Them Well-Chosen?

    1.5 There’s More to It Than What the Statistics May Suggest

    1.6 What Do the Statistics Say?

    1.7 The Thrust to Quantify

    1.8 Ask Yourself

    1.9 Suggestions for Further Reading

    2. The Basic Raw Materials: Data

    2.1 Data Scales

    2.2 Survey of Sampling Methods

    2.3 More on Random Sampling

    2.4 Problems of Sampling

    2.5 Polls, Surveys, and Questionnaires

    2.6 Qualitative vs. Statistical Studies

    2.7 How Trustworthy Are These Poll/Survey Results?

    3. Making Mountains of Data Manageable

    3.1 Frequency Distributions

    3.2 Seeing Is Believing

    3.3 Seeing Is Misleading

    4. Descriptive Measures for Ungrouped Data

    4.1 Notation

    4.2 Accuracy

    4.3 Measures of Location

    4.4 Measures of Variation

    5. Descriptive Measures for Grouped Data

    5.1 Preface

    5.2 Measures of Location

    5.3 Measures of Variation

    SELF-TESTS FOR PART ONE

    PART TWO: PROBABILITY BACKGROUND FOR STATISTICAL INFERENCE

    6. Uncertainty and Probability

    6.1 Preface to Probability

    6.2 Finite Probability Model

    6.3 Which is the Right Probability Model?

    6.4 Probability Models for Random Processes

    6.5 Return to Equally Likely Outcome Models

    6.6 Independent Events

    6.7 Bernoulli Trial Probability Models

    6.8 Interpretations of Probability

    6.9 Probabilities and Odds

    7. Random Variables

    7.1 Random Variables, Expected Values, and Probability Distributions

    7.2 Variance and Standard Deviation

    7.3 Bernoulli Trial Random Variables

    8. The Remarkable Normal Curves

    8.1 The Family of Normal Curves

    8.2 Normal Curve Estimates for Bernoulli Trial Probabilities

    8.3 Normally Distributed Random Variables

    8.4 Normally Distributed Populations

    9. Sampling Distributions

    9.1 Sampling Distributions of Sample Statistics

    9.2 The Sampling Distribution of the Sample Mean

    9.3 A Central Limit Theorem

    SELF-TESTS FOR PART TWO

    PART THREE: INTRODUCTION TO STATISTICAL INFERENCE

    10. Estimation

    10.1 Estimation Problems

    10.2 Estimation of Means: Large Sample Case

    10.3 The Small Sample Case and the t Distributions

    10.4 What Should Be Considered in Constructing Confidence Intervals?

    10.5 Estimation of Proportions

    10.6 The Central Limit Theorems

    10.7 The Chi-Square Distributions

    10.8 Estimation of Variance and Standard Deviation

    10.9 Estimators and Their Properties

    11. Hypothesis Testing

    11.1 Trial by Statistics

    11.2 A Case Study: Tests Concerning Means

    11.3 Balancing Type I and Type II Errors

    11.4 Power and Operating Characteristic Curves

    11.5 Tests Concerning Proportions

    11.6 Variance and Standard Deviation

    11.7 Difference Between Means

    11.8 Which Math Model is Better Suited to the Problem?

    11.9 Matched Pairs’ Mean Difference

    11.10 The Family of F Distributions

    11.11 Equality of Variances and Standard Deviations

    11.12 p-values

    11.13 Hypothesis Testing and Confidence Intervals

    11.14 Robustness

    11.15 Hypothesis Testing: A Decision Making Procedure?

    11.16 Strings Attached vs. Robustness

    SELF-TESTS FOR PART THREE

    PART FOUR: LINEAR REGRESSION AND CORRELATION

    12. Linear Regression

    12.1 Problems of Prediction

    12.2 Scatter Diagrams

    12.3 Fitting the Best Regression Line: The Method of Least Squares

    12.4 Cautions

    12.5 The Population Regression Line

    13. Linear Correlation

    13.1 The Coefficient of Determination

    13.2 The Sample Correlation Coefficient

    13.3 The Population Correlation Coefficient

    13.4 A Hypothesis Test for ρ

    SELF-TESTS FOR PARTS ONE THROUGH FOUR

    PART FIVE: SELECTED TOPICS

    14. Index Numbers and Index Number Modeling

    14.1 The Nature of Index Numbers

    14.2 Unweighted Index Numbers

    14.3 Weighted Index Numbers

    14.4 Index Number Models

    14.5 Problems Faced By Index Number Model Builders

    14.6 Some Important Index Number Models and Their Indexes

    14.7 Determining Real Dollar Amounts

    14.8 The Consumer Price Index Model: Realism vs. Politics

    14.9 Keep in Mind: Limitations of Index Number Models

    15. Time Series Analysis and Forecasting

    15.1 Introduction

    15.2 Components and Models of a Time Series

    15.3 Moving Averages

    15.4 Exponential Smoothing

    15.5 Trend Determination by the Method of Least Squares

    15.6 Measuring Seasonal Variation

    15.7 Deseasonalizing Data

    15.8 Determination of Cyclical Indexes

    15.9 Forecasting from Time Series Data

    16. Nonparametric Statistics

    16.1 Parametric Versus Nonparametric Methods

    16.2 A One-Sample Sign Test

    16.3 A Paired-Sample Sign Test

    16.4 Tests for Normality

    16.5 Rank Correlation

    16.6 Runs: A Test for Randomness

    16.7 The Mann-Whitney U Test

    16.8 Advantages and Disadvantages of Nonparametric Methods

    17. Additional Tests of Hypotheses

    17.1 Difference Between Proportions

    17.2 Contingency Tables

    17.3 Goodness of Fit

    Tables

    Answers to Selected Exercises

    Answers to Selected Self-Test Questions

    Self-Tests for Part 1

    Self-Tests for Part 2

    Self-Tests for Part 3

    Self-Tests for Parts 1-4

    Preface to the Second Edition, Revised

    Still another book on statistics when the weight of available books on the subject must come close to a ton? What’s the point?, you would be justified in asking.

    The rationale for this addition to the ton is founded on our conviction that the statistics literature on introductory statistics that we are familiar with does not adequately address the ten fundamental guidelines noted below. These ten commandments, as we term them, should, in our judgment, be given high priority in an exposition of basic statistics.

    • The focus should be placed on a clear development of basic ideas and principles.

    • The exposition of these basic ideas and principles should be streamlined so as to avoid having the undergrowth get in the way of the statistical forest.

    • High priority should be given to the assumptions that underlie the application of statistical principles.

    • Understanding of abuses, misuses, and misunderstandings which have arisen from the application of statistics is essential for a correct understanding of statistics.

    • The coverage should provide students with sufficient preparation for continued study of intermediate and advanced level statistics or disciplines which use statistical methodology.

    • The exposition should be readable and understandable by students without sacrifice of mathematical accuracy.

    • The organization should clearly distinguish mainstream topics inherent in every basic level statistics course, irrespective of applied interests, from topics of special interest to particular audience segments.

    • The computation dimension should not be given equal billing with statistical principles and ideas. Statistics is the master and, important as it is, the computation tool is the servant.

    • Exercises to provoke-thought—exercise the little grey cells, as Hercule Poirot would put it—should be a prominent part of the exposition.

    • Exercise banks to help the student see statistics as a whole are important.

    To realize the second of our commandments we ruthlessly cut back on peripheral topics to satisfy this condition.

    Much attention is given to the assumptions that underlie statistical practice to put into perspective the conditions required before statistical techniques can be applied and to combat the disturbing attitude that statistical computation is condition free.

    Beginning with the first chapter much attention is given to the misleading or incorrect use of statistics. Much attention is also given to the issue of how statistics may be interpreted.

    The exposition does not cater to any one group of applied interests. Rather it would be appropriate to say that it caters to them all, for as is well known, basic principles of statistics underlie the needs of all applied interests.

    Illustrative examples and applications have been chosen to avoid unnecessary technical features that detract from the main point, be accessible to the intended student audience without special background, and clearly illustrate the statistical point in question. We desire applications to be interesting but, needless-to-say, this quality is as much in the mind of the reader as in the eye of the writer.

    It is our conviction that all applications are important when it comes to gaining insight into the scope of statistical principles in operation. There is nothing more limiting to a sound understanding of statistics than the parochial view that applications of interest are defined by one’s discipline of interest.

    This book is a suitable text for a variety of audiences, including students majoring in a business discipline, economics, education, the life and health sciences, psychology, and the social sciences. It is also suitable for a non-calculus introduction to probability and statistics for students taking a major in computer science, mathematics, or one of the physical sciences.

    Our other concern is with the student who, apart from future course work and professional needs, wants to be a well-educated person. Statistics and its wide spectrum of applications is one of the great triumphs of the human intellect and, like fine art, fine music, and fine literature, brings us together as human beings interested in partaking of the best of the human spirit.

    The depth and rate at which topics are covered will, of course, depend on the audience at hand and its level of mathematics preparation. Strictly speaking, there are no mathematical prerequisites for this exposition apart from some knowledge of algebra, but the rate at which one may progress depends on the mathematical maturity and sophistication of the audience. This book is suitable for a one or two semester introductory course in statistics or probability and statistics, with the main focus on statistics, depending on course structure and time parameters.

    Fundamental statistical principles and the probability foundation that underlies them are developed in the first four parts of the book, which form the mainstream for any course in statistics. Part 5 presents selected topics from which one may pick and choose. The basic structure of this book is shown by the tree diagram that follows.

    In the revised second edition a number of refinements in the exposition have been made to add clarity, sharpen focus, and update background data.

    Acknowledgments

    We should like to express our appreciation to Ramunė Adams for preparing the illustrations, to Pace University’s Council of Deans for its support in the form of sabbaticals and a summer research grant, to our colleague Dr. Michael Kazlow for his insights on the computation dimension of statistics, to the Pace Word Processing Department for its assistance in preparing the manuscript, to our editor Bill Walsh for his support and encouragement, and to Joan van Glabek, whose queries to the authors saved us from a number of embarrassing missteps.

    Most of all, we wish to acknowledge our debt to our students, who played a major role in shaping and refining our ideas on the presentation of statistics.

    W. J. A.

    I. K.

    M. P. P.

    Availability

    To make this book widely available I have put it on the web at webpage.pace.edu/wadams.

    W. J. A.

    Image%201.jpg

    Perspective on the Chapters and Exercises

    Chapter 1: Since data are the basic raw materials of statistical analysis—the food for statistical analysis, if you will—it is essential that their quality be given suitable attention. At the very least this vital concern warrants a chapter in its own right.

    Chapter 2: The nature of data that must be taken into account (data scales) prior to employing statistical analysis and sampling methods for obtaining data are addressed.

    Example 1 introduces the Heavy Basic Statistics text situation which we return to a number of times throughout the book, each time adding another dimension of interest to it.

    Chapters 3, 4, and 5: With attention having been given to the reliability, relevance, and tools for obtaining data in Chapters 1 and 2, the stage is set for developing tools for making data manageable and visible, and obtaining useful numerical descriptions for them. This is undertaken in Chapters 3, 4, and 5. The Heavy Basic Statistics situation is employed to illustrate the concepts of frequency distribution, stem-and-leaf plot, histogram, frequency polygon, and ogive in Chapter 3, and the concepts of mean, median, and standard deviation of a frequency distribution in Chapter 5.

    Needless-to-say, statistics has a significant computation dimension, and the issue of computation accuracy unavoidably arises. This issue is given attention in Section 4.2, which should be considered a most important background section.

    Chapters 6 and 7: Chapter 6 is a linchpin for the introductory study of probability and statistics. Properties of relative frequency are used as a springboard for developing the concept of finite probability model. Probability models are developed for random processes with much attention being paid to the role of assumptions in formulating such models. The subjective probability interpretation of probability is introduced and a careful distinction is made between probability and its relative frequency and subjective probability interpretations.

    Chapter 7 is a short but important chapter which introduces the pivotal concepts of random variable, expected value and probability distribution.

    Chapters 8 and 9: The normal curves, or distributions, play such an important role in probability and statistics that they deserve a major chapter in their own right. Chapter 8 on the normal curves sets the stage for consideration of their central role in statistical inference. It is illuminating to look at the historical setting which gave rise to the normal curves and this is treated in Chapter 8 as well. The Heavy Basic Statistics situation makes an appearance in connection with the question of whether a population’s distribution can be described by a normal curve.

    The short but pivotal Chapter 9 introduces the key concept of sampling distribution, which underlies statistical inference, and a central limit theorem which establishes a link between sampling distributions and the normal curves.

    With these foundation stones in place we are ready to undertake a study of statistical inference, which is initiated in the next two chapters.

    Technique vs. Theorems

    With calculators and computers so readily available, crunching the numbers has almost become a trivial exercise. A blessing? Certainly. The downside, however, is that many, dare we say most, students and users of statistics come to see statistics as a calculation free-for-all. Throw the data into the computer and let it do its thing, is well on its way to becoming standard operating procedure for life with statistics.

    The hypothesis of theorems—strings attached, as we refer to them—guide number crunching and technique and to the extent possible in what is generally viewed as a number crunching course we pay a good deal of attention to strings attached. This point of view is reflected in a number of exercises. We strongly believe that to instruct a student in the use of statistics without emphasizing conditions that make possible its meaningful application would be as irresponsible as instructing a student in the use of firearms without emphasizing safety conditions that make possible their safe use. In both instances the potential exists for inflicting serious injury and damage.

    Chapters 10 and 11: Both chapters are concerned with defining the framework of statistical inference, confidence interval analysis in Chapter 10 and hypothesis testing in Chapter 11. While this is the case, it is with these chapters that one begins to see substantial payoff in terms of applications. Although some applied payoff is to be seen in Chapters 6-9, their major role is to set in place the foundation needed for statistical inference defined in Chapters 10 and 11 with their extensions in Chapters 12, 13, 16, and 17.

    The Heavy Basic Statistics text situation reappears in Exercise 9 of Section 10.3 (page 281) in connection with obtaining a confidence interval for the population mean. This setting illustrates the unity of ingredients obtained from Sections 2.3 (a sample drawn at random by using a table of random numbers), 5.2 (the mean obtained from a frequency distribution), and 5.3 (the standard deviation obtained from a frequency distribution). It also serves as a specific setting for addressing the question of whether a population is normally distributed, a condition required for constructing the confidence interval asked for in the aforenoted Exercise 9. This question is first considered in Example 1 of Section 8.4 and re-examined from the point of view of hypothesis testing in Example 1 of Section 16.4 by means of the Kolmogorov-Smironv test and Example 3 of Section 17.3 by means of the 16385.png goodness of fit test.

    Both are large chapters, especially Chapter 11, and a question might arise as to which sections should be given priority should time or other considerations not allow all to be covered. For Chapter 10 we recommend Sections 10.1, 10.2, 10.3, 10.4, 10.6 and 10.7; for Chapter 11 we recommend 11.1, 11.2, 11.7, 11.8, 11.10, 11.11, 11.13-11.16.

    Chapters 12 and 13: The Daniel company, a new entry in the fine baked goods market, wants to predict sales volume on the basis of advertising expenditure. This setting and problem are introduced in Example 1 of Chapter 12 and serves as a vehicle for defining and illustrating ideas needed for the development of linear regression and correlation analysis in Chapters 12 and 13.

    Chapters 14 and 15: A parochial view has it that index numbers and time series are the exclusive domain of business students. A more enlightened view has it that index numbers and time series deal with serious down-to-earth life issues that touch all of us, which makes them of general interest irrespective of major. We strongly subscribe to this more enlightened view.

    Chapter 16: What do you do when you desire to test a hypothesis but the conditions required of the test you have in mind are not satisfied or the nature of your data render the test inapplicable? Nonparametric methods, an interesting spectrum of which are taken up in Chapter 16, may be just the answer.

    Chapter 17: Chapter 17 provides an introduction to additional hypothesis testing techniques with a wide spectrum of interesting and important applications. In this chapter the 16398.png distributions dominate the scene.

    Questions that Challenge "the Little Grey Cells: All questions challenge to some extent the little gray cells, as Hercule Poirot describes them. Some questions are especially noteworthy in being thought provoking and helpful in leading to insights. In the spirit of the show 20-questions, the following twenty questions are among our favorites in being thought provoking and insightful.

    1. Sec. 1.5, page 24, No. 2

    2. Sec. 2.4, page 57, No. 4

    3. Sec. 2.5, page 57, No. 10

    4. Self-Tests for Part 1, Self-Test 2, page 143, No. 1

    5. Sec. 6.6, page 176, No. 17

    6. Sec. 8.4, page 236, No. 4

    7. Self-Tests for Part 2, Self-Test 3, page 253, No. 1

    8. Sec. 10.2, page 268, No. 10

    9. Sec. 10.2, page 271, No. 15

    10. Sec. 10.3, page 281, No. 9

    11. Sec. 10.4, page 286, No. 1

    12. Self-Tests for Parts 1-4, Self-Test 5, page 470, No. 3

    13. Sec. 10.8, page 305, No. 3

    14. Sec. 11.2, page 327, No. 20

    15. Sec. 11.8, page 360, No. 5

    16. Sec. 11.13, page 396, No. 3

    17. Self-Tests for Part 3, Self-Test 4, page 405, No. 1

    18. Self-Tests for Part 4, Self-Test 2, page 463, No. 2

    19. Sec. 16.3, page 579, No. 3

    20. Sec. 17.3, page 650, No. 6

    To Our Students

    Dear Student:

    As a student taking a course in statistics you have three major resources available to you: the course instructor, the text, and you yourself. Course instructors, many of us have experienced, range in quality from a hinderance to truly helpful and even inspiring.

    Course texts also range in quality from a hinderance to most useful. Our primary objective in writing this book was to write an exposition that would be an ally to you in your encounter with statistics. It is intended to be user useful and friendly.

    The third major resource that you bring to the course, you yourself, includes your commitment to the course and willingness to make best use of the other resources. It is the most important of all; if it is weak, then the others will be of little or no use; if it is strong, then weaknesses in the others can be overcome.

    This book presupposes your serious commitment to the study of statistics, which is not to say that we view the audience as composed of budding statisticians and mathematicians; quite the opposite. This text should not be viewed as just a collection of homework exercises. The level and style of the exposition reflect more than half a century of our collective teaching experience with students like you, and it was written under the assumption that it will be read. To do this effectively we urge you to read each section at least twice, the first time to get the general idea and the second time to nail down details. A third careful reading should also be undertaken. Basic definitions should be given careful attention and written down separately. Definitions provide the backbone of the exposition and must be mastered for the exposition itself to make sense. After completing a section we urge you to write a short summary, including a listing of technical terms, in your own words. We deliberately refrained from doing this to provide you with the opportunity to do so. This should be thought of as a standard exercise applicable to all sections.

    As to homework exercises in general, they are an integral part of the text which is essential for its understanding. Some exercises are simple, straightforward, and not particularly interesting, but intended to help you get basic concepts and technique under control. The exercises progress to more challenging and interesting with the intent of bringing together a number of ideas with a context for their application. The number of exercises that should be done to achieve an understanding of the material will vary from person to person, but there is no doubt that everyone must do a fair number of exercises to achieve that understanding. We believe that it is helpful to work together with fellow students and exchange ideas, as long as everyone in a group effort is contributing. At the same time, examinations are solo efforts, and beyond a certain point one should be prepared to fly solo. As an additional resource, answers to selected exercises are given at the end of the book.

    Self-tests have been included to help you bring blocks of material under control. We suggest that you do these self-tests under conditions similar to those under which actual tests are taken. Prepare yourself as you would for an actual test and then do a self-test at one sitting. You will be your own monitor and taskmaster.

    As to formulas, there are many formulas in this subject. The challenging part is to know which applies to which situation. It is not a matter of memorizing formulas, but of obtaining a sense of the context from which they come and apply to. This calls for an understanding of the subject rather than rote memorization. Some simple basic formulas should be known cold, so-to-speak, and more complex ones should be known well enough so that you know what to look for on a formula sheet and how to use it in a particular situation. As you work through the book we suggest that you prepare your own formula sheet for your use.

    Finally, there is a question of attitude toward statistics. Some would suggest that it is both hard and boring with numerous calculations. One of our colleagues, in another department, jokingly describes statistics as sadistics to his students. This is a joke which is clever, but it is neither funny nor accurate and contributes to a counterproductive poisoning of the well. What is hard and boring must, like beauty, be left to the eyes of the beholder. There are calculations, but the ones required in this text are not onerous and can be carried out with the assistance of a rather primitive hand-calculator. The more challenging, interesting, and essential components of statistics are to be found in its concepts and principles. It is these components which we emphasize along with an indication of their wide spectrum of applications. Many people have been fascinated by these aspects of statistics and we hope that you will be too.

    With regards and best wishes for a successful undertaking,

    W.J.A.

    I.K.

    M.P.P.

    What Is Statistics?

    Until the middle of the nineteenth century the answer would have been that statistics is a collection of numerical data concerning economic, military, and political affairs of state. The term is still sometimes used or understood in this restricted sense to refer to data arising from some enterprise of interest. The enterprise might be governmental, business, economic, industrial, scientific, medical, or of a sporting nature, to take a few examples, so that we often hear mention of medical statistics, baseball statistics, and so on.

    The modern, more general use of the term refers to a discipline which may be viewed in terms of two components called Descriptive Statistics and Inferential Statistics. Descriptive Statistics is concerned with methods of organizing masses of numerical data, graphically displaying them, and obtaining numerical values which help us better understand and deal with the data that we have. To take an example, suppose President Marx of Huxley College requests data on the mathematics placement exam scores of all entering freshmen. The freshman class consists of 5000 students, so that the president would be confronted by a list of 5000 numerical values should his request be answered literally. The forest of data would be so overwhelming that he would be at a loss to make sense of the individual trees. An organization of the data which shows how they fall in various intervals of interest (70-79, 80-89, etc.), graphs that show these features visually, and numerical measures, such as the mean and median values of the data, would be more meaningful and useful to the president than a forest of 5000 values. The problems of organizing the data into suitable groups and obtaining numerical measures of the data are problems of Descriptive Statistics.

    A statistical population is a collection of data viewed as a whole in its own right, whereas a sample is a part of this all-inclusive whole. Inferential Statistics is concerned with making and testing judgments about population values of interest on the basis of suitably selected samples and obtaining measures of reliability for these judgments. The anthropologist Alice Williams claims that recently discovered humanoid fossil remains in Central America are one million years old, let us say. The problem of conducting a statistical test on a suitably chosen sample of these remains to test this claim is a problem of inferential statistics. We would also require a measure of how reliable is the test result, which is an accompanying problem of statistical inference. And then there are related issues of how suitably chosen sample is defined and in fact chosen, and deceptive statistics that arise through carelessness, ignorance, or deliberate intent.

    Statistics may be likened to a river such as the powerful and majestic Mississippi. Like the Mississippi, it is fed by many tributary rivers and streams, two major ones being what we have identified as descriptive and inferential statistics. The other tributaries consist of mathematical methods, particularly from probability, and problems and issues which arose and continue to arise from such disciplines as agriculture, biology, business, economics, meteorology, the physical sciences, political science, psychology, and sociology which are statistical in nature. They have all fed the mighty river and benefitted from it.

    It is our objective in this text to examine some of the major tributaries which have fed the mighty river statistics, obtain a sense of the river as a whole, and issue warning signs where the river is in danger of being polluted through misuse and contamination.

    Statistics: The Computation Dimension

    Statistics has a significant computation dimension and the question naturally arises as to whether a computer is essential for life in the world of statistics. The answer to this question depends on which part of the world of statistics one inhabits. The highest priority of students taking a first course in statistics is to master basic principles and concepts, which the computation dimension should support and not interfere with. The computation requirements of this exposition can be satisfactorily carried out with a basic calculator (by today’s standards) with the four arithmetic operations and a key which allows one to take square roots. Such a machine leads to a reasonable amount of manual computation which gets one’s hands mildly dirty, in an arithmetic sense, but is healthy in a learning sense. In first encountering the concept of standard deviation, for example, working through the steps required of its definition helps one to obtain a real sense of this concept which cannot be achieved by only pressing a key on a sophisticated calculator or computer. Such is the case for all basic definitions, principles, and techniques.

    Computation Allies: An Overview

    At some point, which cannot be pinpointed exactly, the pedagogical benefits of working out standard deviations and the like step by step no longer accrue, and it makes good sense to seek more powerful tools to get the computation job done as quickly and painlessly as possible. The following overview might be useful for those seeking an enhanced computation or computer dimension. In considering enhanced computation options one’s primary objectives and the trade-offs inherent in these options should be kept in mind.

    Calculators

    Calculators range from basic to sophisticated with a rich variety of push-button options, programs, and memory. Basic calculators are satisfactory for small scale needs, relatively inexpensive, and occupy little space. Such a calculator meets the needs of the exposition in this book.

    As to computer options, there are three main categories of statistical packages.

    Mainframe Packages

    The big four basic statistics packages are SPSS (Statistical Package for the Social Sciences), SAS (Statistical Analysis System), BMDP (Biomedical Computer Programs), and MINITAB, which began as a teaching tool and was extended to a research tool. Mainframe packages have the following advantages: One may have common data sets on the mainframe for student use. Security constraints permitting, these packages may be accessed from remote locations. One can have confidence in their numerical accuracy. Numerical results are generally not worked out by a computer by means of the nice computation results cited in statistics books because the round off errors are too large. Computer packages employ different algorithms which must undergo extensive testing to ensure their accuracy. The aforementioned packages are based on well known algorithms whose numerical accuracy is assured.

    Two disadvantages of mainframe packages are that a fair amount of start-up time must be invested to become comfortable with their use and that response times are sometimes slow.

    IBM Compatible or PC/MS DOS

    In part, this category contains mainframe packages that have been rewritten for the IBM PC environment. The core of such mainframe packages is carried over, but certain tests and functions may have to be obtained separately. Graphics on PC packages tend to be much stronger than on their mainframe counterparts.

    IBM compatibles includes SPSS, MINITAB (both of which have minimal graphics), SAS, STAT-GRAPHICS, SYSTAT, and EXECUSTAT. The last noted states the assumptions that must be satisfied for a technique to be executed. Many MS-DOS statistics packages have student versions with restricted features and data sets.

    Macintosh—Post Apple II

    There are versions of SPSS and MINITAB for the Macintosh environment. Their graphics capability is not high. There is also JMP, SYSTAT, and DATA DESK, which have high graphics capability. These packages employ an exploratory data analysis approach to statistics. Macintosh packages have student versions with restricted features and data sets.

    Unquestioned Acceptance of Computer Generated Statistics

    Unfortunately, the attitude that computer generated statistical results are infallible because they are computer generated has become widespread. Those who do not appreciate the GIGO (garbage in, garbage out) principle and those who desire quick results without investing the needed time and effort to understand the discipline are particularly vulnerable to this kind of computer virus. While the computer has made it possible for many to address problems that heretofore had been intractable, it has also made it possible for more and more people having less and less understanding of statistics to generate more nonsense more quickly than ever before.

    Henry Clay’s sage observation, Statistics are no substitute for judgment, may be profitably extended to, computer generated statistics are no substitute for an understanding of statistics. Achieving an understanding of the principles and ideas of statistics is our only defense against errors that we ourselves might fall into and those that others have fallen into.

    PART ONE

    THE NATURE, COLLECTION, ORGANIZATION, AND PROPERTIES OF DATA

    ■ On Guard!

    ■ The Basic Raw Materials: Data

    ■ Making Mountains of Data Manageable

    ■ Descriptive Measures for Ungrouped Data

    ■ Descriptive Measures for Grouped Data

    1

    On Guard!

    1.1 ■ How Good Are the Data?

    Data are the fundamental raw material of statistical analysis. For a meaningful and useful statistical analysis of a situation under study the data must be relevant to the objective of the study and reliable. As the examples in this chapter make clear, relevance and reliability are not characteristics that we may simply take for granted. We must always be on alert.

    Image%202.jpg

    Obtaining well-chosen data for a study is what might be called a pre-statistical concern, pre-statistical in the sense that the methods of descriptive and inferential statistics are not directed at this concern. They presuppose as their starting point that the data are reliable, relevant and go on from there. Obtaining well-chosen data requires insight into the phenomenon under study which enables one to distinguish primary factors from secondary and irrelevant ones. If the mark is missed at this point, the application of no amount of statistical machinery to organize and portray the data, compute its numerical characteristics, and infer conclusions will bail us out. The finest statistical tower of Pisa built on a shaky foundation is doomed to topple. It is unfortunate that the number of people having real insight into what they are doing is small, which is one reason why we have been favored by the proliferation of so much statistical junk.

    1.2 ■ Are the Data Reliable?

    Case 1. Are Statistically Dangerous Schools Necessarily Dangerous? Are Statistically Safe Schools Necessarily Safe?

    In July 1986 the New York City Board of Education issued a list of its most dangerous schools based on incident and crime reports that it had received. One respondent notes that he never felt unsafe or threatened in teaching at the fifth listed most dangerous junior high school. [Spector; 28] Another respondent comments: I believe that the scorecard . . . names not the most dangerous schools but the schools whose administrators have the courage to report what is really happening. [Richman; 25]

    In June 1994 New York City Schools Chancellor Ramon Cortines rejected data on violence in the city’s schools, saying that he suspected school administrators were underreporting acts of violence to make their schools appear less turbulent. [Dillon; 8].

    Image%203.jpg

    In September 1995 Edward Costikyan, the chairman of the Mayor’s Commission on School Safety, observed: ‘There seems to be a total absence of any reliable numbers on anything. How can you manage anything without knowing what you’re dealing with.’ [Toy; 30]. Are things better now? Well . . .

    In September 2007 New York City comptroller William C. Thompson Jr. stated that an audit showed that the city had not ensured that all principals accurately report violence in their schools, making it difficult for the public to assess their safety. [Gootman; 13]. Are things better now? Well . . .

    Case 2. Are American Students Really That Bad in Math and Science?

    Every few years another study appears which shows again that American students are worse in math and science than their counterparts in even the poorest countries. But are they being compared with their counterparts? Some say no. Without denying that there is much to be improved in American education, a growing number of critics have argued that the test results are flawed because American students in total are consistently being compared with the elite students of other countries. [Kolata; 9]

    Case 3. These Data May Give You Nightmares

    Halcion, manufactured by the Upjohn Company and introduced in the United States in 1983 is one of the world’s best known sleeping pills. Its main advantage over competing products, Upjohn has claimed, is in encouraging nighttime sleep without daytime drowsiness.

    How safe is Halcion? It received Food and Drug Administration approval and its manufacturer claims that it is just as safe as other drugs of its kind. Dissenters argue that Halcion is more likely to cause symptoms such as amnesia, paranoia, and depression and that Upjohn engaged in data manipulation to conceal its side effects. This view emerged from a law suit filed by Ilo Grundberg, who killed her mother the day before her mother’s 83rd birthday and placed a birthday card in her hand. Mrs. Grundberg claimed that Halcion had made her psychotic, and charges against her were eventually dismissed. Upjohn settled the lawsuit with Mrs. Grundberg before it was to go to trial in August 1991, but in preparation for the suit it had to make available a good deal of data about Halcion to the plaintiff’s attorneys.

    Dr. Ian Oswald, who was head of the department of psychiatry at the University of Edinburgh and spent 30 years doing research on sleep, was obtained as an expert witness. Dr. Oswald spent two years going over Upjohn’s data and concluded that Upjohn had known about the extent of the drug’s adverse effects for 20 years and concealed these data. He concluded that the whole thing had been one long fraud. [Kolata; 18]. Dr. Graham Dukes, former medical director of the Dutch drug regulatory agency, who examined some of Upjohn’s data, believed that the data on Halcion had been organized in such a way as to minimize the drug’s adverse effects and that this could not have occurred accidentally.

    In reaction to the criticisms voiced, Britain, the Netherlands and Belgium were led to remove the sleeping pill from the market. A report issued in April 1994 by F.D.A. investigators stated that the Upjohn Company had engaged in ongoing misconduct with Halcion. The F.D.A. will investigate, it was announced. We have not heard anything further.

    Case 4. Is This Food Survey Healthy?

    A report issued in the fall of 1991 by a scientific panel and the General Accounting Office criticized the latest National Food Consumption Survey, carried out in 1987-88, as so flawed that its data are probably useless. The major problem is the survey’s low response rate of 34 percent, making it questionable whether the data are representative of the population. Follow up studies are required of those who do not respond, but no follow up studies were conducted.

    The flawed data are used for making major Government policy decisions involving school breakfast and lunch programs, food stamp allotments, setting pesticide levels in foods, calculating nutrient consumption levels, and determining the public’s exposure to pesticides and toxic metals. [Burros; 5]

    The significance of low response rates when a survey is taken is discussed in Section 2.5 Polls, Surveys, and Questionnaires.

    Case 5. How Solid Are Those Figures?

    A number of figures have been bandied about in the campaign to reduce youth smoking. Here are some of them.

    1. President Clinton warned that 1 million people would die prematurely if Congress did not pass tobacco legislation in 1998.

    2. Senator John McCain urged lawmakers to stop 3000 kids a day from starting this life-threatening addiction.

    3. After a $368.5 billion settlement proposal between state officials and tobacco producers was agreed to in 1997, the American Cancer Society stated that a 60 percent decrease in youth smoking could reduce premature deaths from diseases caused by tobacco by 1 million in coming years.

    4. Deputy Treasury Secretary Lawrence Summers cited studies saying that every 10 percent increase in the price of a pack of cigarettes would produce up to a 7 percent reduction in the number of children who smoke.

    Image%204.jpg

    5. Richard Kluger, author of Ashes to Ashes, a history of the battle between smoking and health in the United States, notes: ‘I think this whole business of trying to prevent kids from smoking being the impetus behind legislation is great politics. Nonsense in terms of anything you can put numbers next to.’ [Meier; 22]

    Case 6. Top of the Line Deception

    In 1992 the General Accounting Office audited seven Star Wars tests conducted between 1990 and 1992. It found that four of the test results described to Congress as successes were false whereas the three tests that were described as complete or partial failures were correct. [Weiner; 31]

    Image%205.jpg

    Case 7. Spin Versus Counterspin

    Speaking on television on Tuesday night of 3 August 1993, President Clinton described the budget legislation then before Congress as the largest deficit reduction in history. Almost immediately after the President spoke, Senator Robert Dole, Republican leader in the Senate, described the legislation as the largest tax increase in world history.

    Who is right? Neither; when the dollar amounts are adjusted for inflation so that dollar comparisons are meaningful, 1993’s budget bill is neither the biggest reduction measure nor the biggest tax increase in recent years. In 1993 dollars, the bill would lower the annual deficit by a projected total of $496 billion over five years; $241 billion of this would come from tax increases. The bill signed by George Bush in 1990 contained $532 billion in deficit reduction in terms of 1993 dollars. The bill signed by Ronald Reagan in 1982 raised taxes by $286 billion over five years in terms of 1993 dollars.

    For discussion of how to take inflation into account in comparing dollar amounts in different time periods, see Section 14.7 Determining Real Dollar Amounts.

    Image%206.jpg

    Case 8. Can We Trust TV Ratings?

    The life span of a television program is determined by the public’s reaction to it, which is measured by TV ratings. These ratings, produced by the Nielsen Company, estimate the audience in terms of the percentage of those sets in use which are turned to each channel, called a share, or in terms of the percentage of the total possible audience, sets on or off, called a rating. Shares and ratings are further broken down according to the sex and age of viewers so that advertisers can better focus their advertising campaigns. These numbers determine the buying and selling of billions of dollars of television air time. They mean life or death to television programs. The half-hour comedy Good & Evil, which had promising ingredients in terms of writing, acting and production talent, had a short life after its premiere in the fall of 1991 because of low initial ratings. In March 1992 NBC announced that they were dropping two successful shows, Matlock and In the Heat of the Night, because the demographic numbers favored older viewers while the network wished to build around a more youthful audience.

    Since 1986 the data which underlie the ratings have been collected by a device called a people-meter. The remote control part of a people-meter rests on top of the television set. When the set is turned on, the meter prompts viewers to enter their identification number. Information is provided on what channels are being beamed into the household and who is watching them. Nielsen puts its people-meter into 4000 households selected at random—that is, without bias—from the approximately 93 million homes in America with television.

    The people-meter data gathering system produced lower ratings for the networks than had been expected and a serious question arose as to whether this was because of the increased or decreased accuracy of this system over the method it replaced. The networks commissioned a study of the Nielsen methodology and two years later this Committee on Nationwide Television Audience Measurement (CONTAM) issued a nine-volume report that was highly critical of the Nielsen system. The report found evidence of button fatigue—that over time people did not push the buttons that would insure data accuracy as they did in the beginning. CONTAM was highly critical of Nielsen’s sampling procedures for obtaining the 4,000 households that make up their sample; random sampling was envisioned in the methodology, but the actual sampling deviated significantly from this requirement. From this came ratings which were highly suspect. David Poltrack, senior vice president of research at CBS, observes that: The whole business is crazy. I don’t think there’s an advertising agency in the United States that could get up in front of its clients and justify the way business is done right now. It’s being bought on narrow based demographics, demographic targets which are not representative of product consumption in the United States. [24]

    Random sampling was called for in theory, but not delivered in practice. This yielded highly suspect ratings.

    Image%207.jpg

    The problem of achieving random sampling in practice is discussed in Sections 2.2 Survey of Sampling Methods and 2.3 More on Random Sampling.

    Nielsen has overcome the statistical sampling problem, but it is still plagued by the problem of getting honest data from viewers in the sample selected. Its people-meter system for eliciting viewing data has been described as too mechanical and as not being user friendly. The

    The problem now is to get honest viewer data from the sample of viewers chosen.

    problem of obtaining accurate viewer data remains. Matters came to a head in March 1997 with the results of what is termed the February sweeps, an intense ratings period that determines television’s winners and losers in terms of how $46 billion in advertising money will be allocated.

    According to the Nielsen ratings, the average number of American households watching prime-time television fell by over one million in February 1997 compared to February 1996. This was the fourth decline in the last five years. The networks do not find Nielsen’s numbers credible. As Don Ohlmeyer of NBC put it: ‘I don’t trust their numbers at all. They’re trying to measure 21st-century technology with an abacus.’ [Carter; 6]

    Nielsen’s response is that the networks are engaging in the time-dishonored practice of blame the messenger. It is safe to assume, however, that they are working to improve their data collection system.

    1.3 ■ Slippery Statistics: An International Dimension

    The Slippery Statistics Society (SSS) has an international clientele. Here are a few examples.

    Brazil

    In a frank conversation between television interviews that was inadvertently broadcasted across his country, Brazil’s finance minister Rubens Ricupero expressed the sentiments of many kindred spirits when he confessed of economic indicators: I have no scruples, what is good we take advantage of. What is bad, we hide [Brooke; 3]. Minister Ricupero was immediately dismissed, but was this because of his performance or indiscretion?

    Britain

    In the past The Economist has been critical of the U.K. Central Statistical Office as having ‘figures often tasting of fudge.’ [Duncan and Gross; 9, p. 66], [Economist; 10, p. 88], [Economist; 11, p. 65].

    China

    Chinese government statistics have run a gamut of slipperiness. After the Communist Party assumed control in 1949, government statistics were systematically distorted to serve the wishes of the new political establishment. During the period of the Cultural Revolution of the late 1960s and early ’70s data-gathering was abandoned as unscientific.

    Since the passing of the Cultural Revolution, data-gathering and the publication of state statistics has resumed and other pressures have developed. In May 1994 Zhang Sai, director of the State Statistical Bureau, ‘warned that distorted statistics are increasing tensions between Beijing and localities.’ [Tefft; 29]

    Foreign investors in China are wary of Chinese statistics and many have taken to generating their own.

    Japan

    By late July 1998 American financial experts reached the conclusion that the magnitude of Japan’s banking crisis was far worse than had been publicly acknowledged. The bad debts were estimated as being on the order of $1 trillion, nearly twice the official estimate. The true amount, financial experts emphasized, is hard to pin down because Japanese banks have been using accounting tricks to conceal debts that are not being paid. [Sanger; 26]

    Image%208.jpg

    Soviet Union and Russia

    From the beginnings of the Soviet State, Soviet statistics have acquired a reputation of being unreliable. (See [Clark; 7], [Shaffer; 27].) Writing in 1990, V.N. Kirichenko, Chairman of the USSR State Committee on Statistics, expressed a hope to ‘ensure the accuracy of the data . . . restore the trust in such data on the part of the Soviet and international public. The country can no longer afford to seek the right way with the help of trick mirrors.’ [Duncan and Gross; 9, p. 66] and [Kirichenko; 16, pp. 50-57].

    Since the breakup of the Soviet Union, Russia has continued to have problems with government statistics, but for different reason. Rather than exaggerating output the statistical pendulum has swung to the extreme of underestimating it. In June 1998 Russia’s top statisticians were arrested on charges of manipulating data to underestimate the production of Russian businesses to help them minimize their tax obligations. [Gordon; 14]

    United States

    In June 1998 the thrust of the Republican majority in the House of Representatives was to cut taxes beyond what was called for in the earlier balanced-budget agreement. But then there are the spending cuts needed to achieve balance. The Congressional Budget Office did not produce the numbers needed for this to work out, which prompted the Republican leadership to address a letter to the Appropriations subcommittee warning that if the C.B.O. did not begin to produce better numbers, ‘we must review [its] structure and funding.’ [The New York Times; 23, A22]

    Image%209.jpg

    1.4 ■ Are the Data, Their Characteristics, and the Framework Generating Them Well-Chosen?

    Reliable data are not always relevant, that is, well-chosen in connection with the intent of a study undertaken. If interest centers on the heights of those in a certain community and we are presented with their weights, very carefully obtained, then the data are reliable, but hardly well-chosen in connection with the focus of the study. Subsequent mathematical refinements or conclusions obtained from poorly chosen data might have the aura of precision, but are no better than its basic stating point.

    Data may be described by various numerical characteristics (mean and median, for example) and the problem of determining which characteristic is most suitable for the situation at hand is both challenging and serious. This issue is explored in Case 10 and considered in Example 6 of Section 4.3 (page 112).

    Another dimension takes us a further step back. A decision-making framework is to be set up for some situation, let us say. The framework requires data. The data obtained may be consistent with the decision-making framework, but if that framework is poorly formulated, the data it leads to cannot be viewed as well-chosen. This dimension of suitability of data is explored in Case 13.

    Case 9. Which Data Best Reflect Airline Reliability?

    The long time standard measure of an airline’s reliability is its percentage of on-time arrivals, where a flight is deemed on-time if it arrives within 15 minutes of its scheduled arrival time. Such data are widely trumpeted by airlines in their advertising campaigns.

    But is this statistic the best measure of an airline’s reliability? According to Julius Maldutis, an airline analyst with Salomon Brothers, the answer is no. Maldutis argues

    Enjoying the preview?
    Page 1 of 1