Logic and Critical Thinking in the Biomedical Sciences: Volume 2: Deductions Based Upon Quantitative Data
()
About this ebook
- Demonstrates that a great deal can be deduced from quantitative data, without applying any statistical or mathematical analyses
- Provides readers with simple techniques for quickly reviewing and finding important relationships hidden within large and complex sets of data
- Using examples drawn from the biomedical literature, discusses common pitfalls in data interpretation and how they can be avoided
Jules J. Berman
Jules Berman holds two Bachelor of Science degrees from MIT (in Mathematics and in Earth and Planetary Sciences), a PhD from Temple University, and an MD from the University of Miami. He was a graduate researcher at the Fels Cancer Research Institute (Temple University) and at the American Health Foundation in Valhalla, New York. He completed his postdoctoral studies at the US National Institutes of Health, and his residency at the George Washington University Medical Center in Washington, DC. Dr. Berman served as Chief of anatomic pathology, surgical pathology, and cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he transferred to the US National Institutes of Health as a Medical Officer and as the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past President of the Association for Pathology Informatics and is the 2011 recipient of the Association’s Lifetime Achievement Award. He is a listed author of more than 200 scientific publications and has written more than a dozen books in his three areas of expertise: informatics, computer programming, and pathology. Dr. Berman is currently a freelance writer.
Read more from Jules J. Berman
Principles and Practice of Big Data: Preparing, Sharing, and Analyzing Complex Information Rating: 0 out of 5 stars0 ratingsPrecision Medicine and the Reinvention of Human Disease Rating: 0 out of 5 stars0 ratingsTaxonomic Guide to Infectious Diseases: Understanding the Biologic Classes of Pathogenic Organisms Rating: 5 out of 5 stars5/5Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases Rating: 0 out of 5 stars0 ratingsEvolution's Clinical Guidebook: Translating Ancient Genes into Precision Medicine Rating: 0 out of 5 stars0 ratingsClassification Made Relevant: How Scientists Build and Use Classifications and Ontologies Rating: 0 out of 5 stars0 ratingsPrinciples of Big Data: Preparing, Sharing, and Analyzing Complex Information Rating: 0 out of 5 stars0 ratingsData Simplification: Taming Information With Open Source Tools Rating: 0 out of 5 stars0 ratingsRepurposing Legacy Data: Innovative Case Studies Rating: 0 out of 5 stars0 ratingsLogic and Critical Thinking in the Biomedical Sciences: Volume I: Deductions Based Upon Simple Observations Rating: 0 out of 5 stars0 ratings
Related to Logic and Critical Thinking in the Biomedical Sciences
Related ebooks
Logic and Critical Thinking in the Biomedical Sciences: Volume I: Deductions Based Upon Simple Observations Rating: 0 out of 5 stars0 ratingsCellular and Animal Models in Human Genomics Research Rating: 0 out of 5 stars0 ratingsEmery and Rimoin’s Principles and Practice of Medical Genetics and Genomics: Hematologic, Renal, and Immunologic Disorders Rating: 0 out of 5 stars0 ratingsCritical Thinking: An introduction Rating: 5 out of 5 stars5/5Data Simplification: Taming Information With Open Source Tools Rating: 0 out of 5 stars0 ratingsThe Chicago Guide to Your Career in Science: A Toolkit for Students and Postdocs Rating: 5 out of 5 stars5/5Self-Control in Animals and People Rating: 0 out of 5 stars0 ratingsThe Ultimate Critical Thinking Guide: 100 Critical Thinking Questions Rating: 0 out of 5 stars0 ratingsAnalysing Data For Your PhD: An Introduction: PhD Knowledge, #3 Rating: 0 out of 5 stars0 ratingsCritical Thinking A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsPatient Care under Uncertainty Rating: 0 out of 5 stars0 ratingsBe a Great Thinker: Book One - Introduction to Critical Thinking Rating: 0 out of 5 stars0 ratingsScientific Writing = Thinking in Words Rating: 5 out of 5 stars5/5Mathematical Methods of Statistics (PMS-9), Volume 9 Rating: 3 out of 5 stars3/5Origins of Physiological Regulations Rating: 0 out of 5 stars0 ratingsJust Think about It Rating: 2 out of 5 stars2/5Exploring the Scientific Method: Cases and Questions Rating: 4 out of 5 stars4/5Introducing Logic and Critical Thinking: The Skills of Reasoning and the Virtues of Inquiry Rating: 4 out of 5 stars4/5Critical Thinking: Tools for Evaluating Research Rating: 0 out of 5 stars0 ratingsGale Researcher Guide for: Analyzing and Producing Arguments Rating: 0 out of 5 stars0 ratingsA Dialogue of Hope: Critical Thinking for Critical Times Rating: 0 out of 5 stars0 ratingsSystematic Review A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsDon't Tell Me: Critical Thinking: What Is It and Can I Buy It Online? Rating: 0 out of 5 stars0 ratingsPhilosophy & Ethics: Philosophical Approaches, Critical Thinking & Critical Analysis in Ethics. Rating: 0 out of 5 stars0 ratingsAdvanced Critical thinking Second Edition Rating: 0 out of 5 stars0 ratingsThe Postgenomic Condition: Ethics, Justice, and Knowledge after the Genome Rating: 0 out of 5 stars0 ratingsReasoning: The Neuroscience of How We Think Rating: 1 out of 5 stars1/5How to Read a Paper: The Basics of Evidence-based Medicine and Healthcare Rating: 4 out of 5 stars4/5Development of Questionnaires for Quantitative Medical Research Rating: 0 out of 5 stars0 ratingsPanorama of Psychology Rating: 0 out of 5 stars0 ratings
Medical For You
The Hormone Reset Diet: Heal Your Metabolism to Lose Up to 15 Pounds in 21 Days Rating: 4 out of 5 stars4/5What Happened to You?: Conversations on Trauma, Resilience, and Healing Rating: 4 out of 5 stars4/5Peptide Protocols: Volume One Rating: 4 out of 5 stars4/5Mating in Captivity: Unlocking Erotic Intelligence Rating: 4 out of 5 stars4/5Passionista: The Empowered Woman's Guide to Pleasuring a Man Rating: 4 out of 5 stars4/5The Diabetes Code: Prevent and Reverse Type 2 Diabetes Naturally Rating: 4 out of 5 stars4/5Adult ADHD: How to Succeed as a Hunter in a Farmer's World Rating: 4 out of 5 stars4/5Mediterranean Diet Meal Prep Cookbook: Easy And Healthy Recipes You Can Meal Prep For The Week Rating: 5 out of 5 stars5/5The 40 Day Dopamine Fast Rating: 4 out of 5 stars4/5Holistic Herbal: A Safe and Practical Guide to Making and Using Herbal Remedies Rating: 4 out of 5 stars4/5The Vagina Bible: The Vulva and the Vagina: Separating the Myth from the Medicine Rating: 5 out of 5 stars5/5David D. Burns’ Feeling Good: The New Mood Therapy | Summary Rating: 4 out of 5 stars4/5Period Power: Harness Your Hormones and Get Your Cycle Working For You Rating: 4 out of 5 stars4/5Gut: The Inside Story of Our Body's Most Underrated Organ (Revised Edition) Rating: 4 out of 5 stars4/5ATOMIC HABITS:: How to Disagree With Your Brain so You Can Break Bad Habits and End Negative Thinking Rating: 5 out of 5 stars5/5Women With Attention Deficit Disorder: Embrace Your Differences and Transform Your Life Rating: 5 out of 5 stars5/5The Amazing Liver and Gallbladder Flush Rating: 5 out of 5 stars5/5The Song of the Cell: An Exploration of Medicine and the New Human Rating: 4 out of 5 stars4/5Living Daily With Adult ADD or ADHD: 365 Tips o the Day Rating: 5 out of 5 stars5/5Lies My Gov't Told Me: And the Better Future Coming Rating: 4 out of 5 stars4/5The Emperor of All Maladies: A Biography of Cancer Rating: 5 out of 5 stars5/5Woman: An Intimate Geography Rating: 4 out of 5 stars4/5A Letter to Liberals: Censorship and COVID: An Attack on Science and American Ideals Rating: 3 out of 5 stars3/5Lifting the Fog: A specific guide to inattentive ADHD in adults Rating: 4 out of 5 stars4/5
Reviews for Logic and Critical Thinking in the Biomedical Sciences
0 ratings0 reviews
Book preview
Logic and Critical Thinking in the Biomedical Sciences - Jules J. Berman
Logic and Critical Thinking in the Biomedical Sciences
Volume 2: Deductions Based Upon Quantitative Data
First Edition
Jules J. Berman
Table of Contents
Cover image
Title page
Copyright
Other books by Jules J. Berman
Dedication
About the author
Preface
Abstract
1: Learning what counting tells us
Abstract
Section 1.1 Science is mostly about counting stuff
Section 1.2 Never count on an accurate count
Section 1.3 Large samples cannot compensate for nonrepresentative data
Section 1.4 The perils of combining data sets
Section 1.5 Compositionality: Why small outnumbers large
Section 1.6 Looking at data
Section 1.7 Counting mutations
Section 1.8 Chromosome length and the frequency of genetic diseases
Section 1.9 Counting instances of species
Section 1.10 Counting garbage
2: Drawing inferences from absences of data values
Abstract
Section 2.1 When the important data is what you do not see
Section 2.2 The power of negative thinking
Section 2.3 Absence of x-rays emitted by hot cups of coffee
Section 2.4 Absence of laboratory findings in SIDS (sudden infant death syndrome)
Section 2.5 Absence of lethal toxicity resulting from damage to the epigenome and systems that regulate gene expression
Section 2.6 Absence of deficiency diseases among highly conserved genes
Section 2.7 Absence of shared conserved noncoding elements
Section 2.8 Absence of animals with built-in wheels
Section 2.9 Absence of microcancers
Section 2.10 Absence of frogs on small islands
Section 2.11 Absence of great apes roaming outside Africa
Section 2.12 Absence of penguins in northern hemisphere
Section 2.13 Absence of samarium-146 isotope from earth
Section 2.14 Obligation to look for absences
3: Drawing inferences from data ranges
Abstract
Section 3.1 Why are data ranges important?
Section 3.2 The range of dust sizes that cause human disease
Section 3.3 When tumor cells have very small nuclei
Section 3.4 The range of heights that animals can jump
Section 3.5 Blood chemistry
Section 3.6 Narrow ranges of enzyme activity
Section 3.7 The number of different types of cancers
Section 3.8 Limits imposed by the dynamic range of measuring instruments
4: Drawing inferences from outliers and exceptions
Abstract
Section 4.1 One is the loneliest number
Section 4.2 Ozone, the outlier that couldn’t be believed
Section 4.3 Neoplasms having very short latency periods
Section 4.4 Outliers as sentinels for common diseases
Section 4.5 How exceptions elucidate pathogenesis
Section 4.6 Finding the outliers
5: What we learn when our data are abnormal
Abstract
Section 5.1 Creating normal distributions
Section 5.2 Pareto's principle and Zipf distribution in biological systems
Section 5.3 Pareto's bias: Favoring the common items
Section 5.4 Recognizing composite diseases
Section 5.5 Multimodality in population data
Section 5.6 Removing some of the mystery around ovarian cancers
Section 5.7 Living with Berkson's paradox
6: Using time to solve cause and effect dilemmas
Abstract
Section 6.1 Timing is everything
Section 6.2 Does anybody really know what time it is?
Section 6.3 Temporal paradoxes
Section 6.4 Timing the progression of cancer development
Section 6.5 When the temporal sequence is observed incorrectly
Section 6.6 Smoke and mirrors
Section 6.7 Refusing simple answers
Section 6.8 Dose-dependent effects and the fallacy of causation
Section 6.9 Time-window bias
Section 6.10 Replacing causation with pathogenesis
7: Heuristic methods that use random numbers
Abstract
Section 7.1 The value of randomness
Section 7.2 Repeated sampling
Section 7.3 Monte Carlo simulations for tumor growth and metastasis
Section 7.4 A seemingly unlikely string of occurrences
Section 7.5 Cancer is not caused by bad luck
Section 7.6 Several approaches to the birthday problem
Section 7.7 Modeling cancer incidence by age
Section 7.8 The Monty Hall puzzle
8: Estimations for biomedical data
Abstract
Section 8.1 The inestimable value of estimates
Section 8.2 The limit of hemoglobin concentration in red blood cells
Section 8.3 CODIS: How to do it all without having it all
Section 8.4 Some useful approximation methods
Section 8.5 Some useful numbers
Index
Copyright
Academic Press is an imprint of Elsevier
125 London Wall, London EC2Y 5AS, United Kingdom
525 B Street, Suite 1650, San Diego, CA 92101, United States
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom
© 2020 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
ISBN: 978-0-12-821369-8
For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Stacy Masucci
Acquisitions Editor: Rafael Teixeira
Editorial Project Manager: Pat Gonzalez
Production Project Manager: Punithavathy Govindaradjane
Cover Designer: Christian Bilbow
Typeset by SPi Global, India
Other books by Jules J. Berman
Dedication
For Luca
About the author
Jules J. Berman received two baccalaureate degrees from MIT: in Mathematics and in Earth and Planetary Sciences. He holds a PhD from Temple University and an MD from the University of Miami. He was a graduate student researcher in the Fels Cancer Research Institute at Temple University and at the American Health Foundation in Valhalla, New York. His postdoctoral studies were completed at the US National Institutes of Health, and his residency was completed at the George Washington University Medical Center in Washington, DC. Dr. Berman served as chief of anatomic pathology, surgical pathology, and cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998 he transferred to the US National Institutes of Health, as a medical officer and as the program director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past president of the Association for Pathology Informatics and the 2011 recipient of the Association's Lifetime Achievement Award. He has first authored more than 100 journal articles and has written 20 science books. His recent titles, published by Elsevier, include the following:
Taxonomic Guide to Infectious Diseases: Understanding the Biologic Classes of Pathogenic Organisms, 1st edition (2012)
Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information (2013)
Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases (2014)
Repurposing Legacy Data: Innovative Case Studies (2015)
Data Simplification: Taming Information With Open Source Tools (2016)
Precision Medicine and the Reinvention of Human Disease (2018)
Principles and Practice of Big Data: Preparing, Sharing, and Analyzing Complex Information, Second Edition (2018)
Taxonomic Guide to Infectious Diseases: Understanding the Biologic Classes of Pathogenic Organisms, 2nd edition (2019)
Evolution's Clinical Guidebook: Translating Ancient Genes Into Precision Medicine (2019)
Preface
Abstract
In Volume I, we learned how to gain fresh insight into the fields of biology and medicine by applying a few deductive methods to simple observations. In Volume II, we tackle the challenges that arise when our available data are quantitative, not descriptive. All too often, individuals engaged in the biomedical sciences assume that numeric data must be left to the authorities (i.e., statisticians and data analysts) who are trained to apply the sophisticated mathematical algorithms to their data. This is a terrible mistake insofar as the individuals who create data (e.g., biomedical scientists) are in the best position to understand what their data really means. The purpose of Volume II is to provide readers with a set of practical skills that will permit them to understand and draw valid inferences from numeric data. Anyone who can count and multiply will have no problem understanding Volume II of Logic and Critical Thinking in the Biomedical Sciences.
Keywords
Quantitative data; Data analysis; Algorithms; Approximation; Estimation; Counting; Complex data
I often say that when you can measure what you are speaking about and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of Science, whatever the matter may be.
William Thomson (Lord Kelvin). Popular lectures and addresses, Vol. 1, Electrical Units of Measurement,
1883.
Logic and Critical Thinking in the Biomedical Sciences is divided into two volumes. Volume I, Deductions Based Upon Simple Observations,
demonstrated that much can be learned when we apply a little bit of logic to common observations, without relying on any mathematical or quantitative analyses. Volume II, Deductions Based Upon Quantitative Data,
follows the same theme, but instead of looking at images, genes, species, and anatomic structures, we'll be looking at numbers that are typically displayed in some sort of data set. In either case, the goal is the same: finding relationships among different things and using such knowledge to discover additional relationships among more things. Fundamentally the pursuit of generalizable relationships is what science is all about.
Of course, to make sense of things and their relationships, we need to agree on what we mean by things
and relationships.
Back in Volume I, when we were dealing with things that could be described in words, we were usually referring to living organisms, diseases, the chemical constituents of physical objects, or their classes (i.e., their assigned groups within a classification). Our relationships came in the form of assertions (e.g., rats are mammals,
cancer is a disease,
and diseases result from a sequence of cellular events
). Now that we are about to move into the world of numbers and equations, how must we change our thinking? To find out, let's consider the words of Arthur Lyon Bowley, a statistician of prominence in the early 20th century¹:
Statistics are numerical statements of facts in any department of inquiry, placed in relation to each other; statistical methods are devices for abbreviating and classifying the statements and making clear the relations.
Bowley's definition of the purpose and methods of statistics fits our general definition of science: the pursuit of generalizable relationships. The logic of science is much the same, whether the subject matter is visual (e.g., anatomy, embryology, and histology), transactional (e.g., medicine), conceptual (e.g., mathematics), or numeric (e.g., statistics).
If I were developing a science curriculum for undergraduate students, I would stress four components:
1.Formulas—the abstract relationships that serve as the building blocks of science, along with how they were derived.
2.Algorithms—the methods by which formulas are applied to produce new knowledge.
3.Practice—time for problem solving.
4.History of science—providing context.
Such a curriculum would erase the artifactual and counterproductive boundaries separating the different scientific disciplines. Students would be taught to seek and find relationships wherever they may exist, without restraint. Would such a radical approach be successful? Who knows? Certainly the traditional approaches are not much to brag about. Here in the United States, despite all the education aimed at students, very few of us learn to love science or mathematics, two deeply human subjects that deserve all the love that we can muster.², ³
In lieu of developing a new, universal curriculum, I have written this book, Logic and Critical Thinking in the Biomedical Sciences, which is my way of approaching science in a nondisciplinary way. It is intended to appeal to students and professionals in the biomedical field, but the general analytic approaches described in this book have come from just about every field of science.
Volume II, Deductions Based Upon Quantitative Data,
is written specifically for readers who may have felt intimidated or bullied, by their high school and college courses in statistics. The truth of the matter is that much of statistics, as it is traditionally taught, is inscrutable to the logical mind; some of it doesn't make much sense. Enlightened statisticians love to argue over the meaning and utility of the touchstones of hypothesis testing (e.g., p values) and data description (e.g., linear regression).⁴–⁶ Many of the most touted statistical formulas are based on assumptions that are untrue much of the time. When you have billions of data measurements and powerful computers to churn through all the numbers, there really is no need to assume much of anything about your data. The data will tell you what you need to know.
Those readers who work in laboratories, who are involved in the clinical trials, or who generate data of any type should avoid the habit of turning your hard-earned data over to statisticians. The widely held belief that only professionals dedicated to data analysis are capable of reaching a valid conclusion is unfounded. Two statisticians can look at a set of data, each applying a different set of analytic methods to the data, to arrive at any of several irreconcilable conclusions.⁷ In point of fact the scientists who design the experiments and produce the measurements are in the best position to do something creative with their output. Ideally, it is the experimentalist, not the statistician, who will pay attention to what the data are trying to say and who will discover the secrets hidden within the data set.
It is fortunate that the mathematically challenged among us may seek guidance from a rich literature devoted to easy methods that quickly summarize data, and these methods are sometimes referred to as simplified exploratory techniques.¹, ⁸ In this book, we will make use of a few summarizing tools (e.g., averages; standard deviations for our normal distributions; and nonparametric measures such as interquartiles, medians, maxima, and minima for other forms of distributions). We will also be introducing a variety of estimator and approximation techniques as well as heuristic techniques that deploy random number generators.⁹ If you paid attention in your high school mathematics courses, you will have no trouble comprehending the contents of this book. Volume II, like Volume I, will demonstrate the kinds of inferences that can be drawn, by applying logic to common observations.
After reading this book, you might be interested in pursuing books that describe in detail specific data analytic methods, along with examples of how these methods should be used. I would recommend the following books, all of which stress the use of simple mathematical techniques, listed here in chronologic order from the data of publication:
Bowley A.L. Elementary Manual of Statistics, Third edition, 1920.¹
Tukey, JW. Exploratory Data Analysis, 1977.⁸
Simon JL. Resampling: The New Statistics. Second Edition, 1997.¹⁰
Janert PK. Gnuplot in Action: Understanding Data with Graphs, 2009.¹¹
Lewis PD. R for Medicine and Biology, 2009.¹²
Janert PK. Data Analysis with Open Source Tools, 2010.⁴
Milo R, Phillips R. Cell Biology by the Numbers, 2015.¹³
Berman JJ. Data Simplification: Taming Information With Open Source Tools, 2016.¹⁴
How to read this book
Each chapter comes with its own reference section and its own glossary. Rather than filling the corpus of text with a lot of description and definitions, I packed the glossaries with terminology and explanations of specialized techniques. After reading the text for all the book chapters, readers may enjoy going back and reading the chapter glossaries, as stand-alone documents. Both volumes are heavily referenced, with approximately 1200 citations selected from almost every field of science. Readers are encouraged to read any or all of these primary resources.
This book is not written particularly for programmers, but on numerous occasions in Volume II, I included snippets of source code written in Python or Perl, just in case any of the readers wanted to write their own software programs to assist in the analysis of data. Nowadays, it never hurts to know how to program, and there are lots of versatile, freely available languages to choose from. Perl, Python, and Ruby are my personal favorites. I have written a number of programming books for biomedical professionals that readers of this book might find suitable to their needs.¹⁵–¹⁷ The R statistical programming environment has become very popular lately, and a rich literature is available on this subject, including books geared to biomedical scientists.¹²
This book contains many different logical inferences, way too many for any reader to remember. I thought it might be useful to provide a trick whereby readers can collect and peruse all the inferences, without the explanatory text. For lack of a better idea, every inference is consistently preceded with the pompous conjunctive: Hence
(e.g., I think. Hence, I am.
). Doing so enables readers of the e-version of this book to quickly find every henced inference, just by repeating a find
operation on the word. Hence, readers can inspect every logical conclusion herein, for purposes of amusement, erudition, or criticism.
References
[1] Bowley A.L. Elementary manual of statistics. 3rd ed. Westiminster, England: P.S. King and Son; 1920.
[2] Rising above the gathering storm: energizing and employing America for a brighter economic future. Washington, DC: National Academy of Sciences, National Academy of Engineering, and Institute of Medicine, National Academies Press; 2007.
[3] Friedman T.L. Can’t keep a bad idea down. The New York Times. 2010 October 26.
[4] Janert P.K. Data analysis with open source tools. O'Reilly Media; 2010.
[5] Conlon I., Raff M. Size control in animal development. Cell. 1999;96:235–244.
[6] Nuzzo R. P values, the gold standard of statistical validity, are not as reliable as many scientists assume. Nature. 2014;506:150–152.
[7] Tatsioni A., Bonitsis N.G., Ioannidis J.P. Persistence of contradicted claims in the literature. JAMA. 2007;298:2517–2526.
[8] Tukey J.W. Exploratory data analysis. Boston, MA: Addison-Wesley; 1977.
[9] Diaconis P., Efron B. Computer-intensive methods in statistics. Scientific American. 1983;116–130 May.
[10] Simon J.L. Resampling: the new statistics. 2nd ed. 1997. Available from: http://www.resample.com/intro-text-online/ viewed on September 21, 2015.
[11] Janert P.K. Gnuplot in action: understanding data with graphs. Manning; 2009.
[12] Lewis P.D. R for medicine and biology. Sudbury: Jones and Bartlett Publishers; 2009.
[13] Milo R., Phillips R. Cell biology by the numbers. Oxford: Garland Science; 2015.
[14] Berman J.J. Data simplification: taming information with open source tools. Waltham, MA: Morgan Kaufmann; 2016.
[15] Berman J.J. Perl programming for medicine and biology. Sudbury, MA: Jones and Bartlett; 2007.
[16] Berman J.J. Ruby programming for medicine and biology. Sudbury, MA: Jones and Bartlett; 2008.
[17] Berman J.J. Methods in medical informatics: fundamentals of healthcare Programming in Perl, Python, and Ruby. Boca Raton: Chapman and Hall; 2010.
1
Learning what counting tells us
Abstract
The act of counting objects is a difficult and underappreciated task. Many of the entrenched misconceptions in science come from sloppy counting protocols. Knowing the number and diversity of data objects (i.e., how many classes of objects are present in the data set and how many members belong to each of those classes) tells us a great deal about the nature of biological data. This chapter teaches us that the process of counting biological objects (e.g., organisms, species, genes, proteins, and variants thereof) will often reveal or clarify profound biological mysteries. Examples will include why there are at least 50 million species of living organisms on earth; the significance of having just a handful of species belonging to the monotremes, while there are many thousands of species of beetles; why there is only a small number of general classes of body plans; what we learn by comparing the number of herbivorous mammals to the number of carnivorous mammals; acquired diseases are more common than genetic diseases why rare diseases are biologically, not just numerically, different from common diseases.
Keywords
Counting; Speciation; Biological diversification; Mutation rate; Mutation burden; Disease incidence
Chapter outline
Section 1.1. Science is mostly about counting stuff
Section 1.2. Never count on an accurate count
Section 1.3. Large samples cannot compensate for nonrepresentative data
Section 1.4. The perils of combining data sets
Section 1.5. Compositionality: Why small outnumbers large
Section 1.6. Looking at data
Section 1.7. Counting mutations
Section 1.8. Chromosome length and the frequency of genetic diseases
Section 1.9. Counting instances of species
Section 1.10. Counting garbage
Glossary
References
Not everything that counts can be counted, and not everything that can be counted counts.
William Bruce Cameron.
Section 1.1 Science is mostly about counting stuff
Much of what we know about reality comes from counting. We count the number of occurrences of disease, the number of days that a disease persists, the number of working days lost to the disease, the number of emergency room visits prompted by the disease, and so on. Once we have all those counts, we need somewhere to put them, so we invent classifications into which we assign the various types of things that we’ve counted. Explained this way, our most common scientific pursuit seems trivial, but without those counts, we would never understand much of anything.
When we take a short break from counting things (e.g., when measuring sizes or looking for patterns), we typically use our preliminary observations as the basis for new counting projects. For example, we can set up weather stations with equipment that continuously monitors the temperature, humidity, barometric pressure, and wind velocity at multiple locations. Each of these measurements produces a waveform, demonstrating how a variable changes over time. Typically, analysts will take the waveform data and transform it into counted items, such as the number of days in the year wherein the temperature exceeded 37°C, or the number of instances wherein the humidity exceeded 80% while the wind velocity fell below 4 miles per hour. In the field of digital signal processing, signals are commonly transformed from the time domain (e.g., waveforms) to the frequency domain (counts of occurrences of a particular type) for the purpose of analysis and manipulation. [Glossary Digital signal processing, Signal, Time, Waveform]
In the realm of bioinformatics, we like to think that we have moved beyond merely counting things and into a new realm of analysis unlocked by the genetic code. The sequence of nucleotides determines the function of a gene, not the quantity of each component nucleotide. Nonetheless, counting retains a position of paramount importance in molecular biology. When we find a gene pattern or motif of significance, we count how often it appears in the genome. When we observe a gene variant, we look to see how often it occurs in the population and whether it correlates with any specific biological feature (e.g., trait and disease). We found that gene expression is best determined by counting the number of expressed (i.e., mRNA) and translated (i.e., protein) sequences. It’s always the same story; just as soon as we discover anything of merit at the molecular level, we proceed to count what we’ve found.
Section 1.2 Never count on an accurate count
Most people would agree that the simple act of counting data is something that can be done accurately and reproducibly from laboratory to laboratory. Actually, this is not the case. Counting is fraught with errors. Consider the problem of counting words in a paragraph, it seems straightforward, until you start asking yourself how you might deal with hyphenated words. De-identified
is certainly one word. Under-represented
is probably one word, but sometimes the hyphen is replaced by a space, and then it is certainly two words. How about the term military-industrial,
which seems as though it should be two words? When a hyphen occurs at the end of a line, should we force a concatenation between the syllables at the end of one line and the start of the next?
Slashes are a tougher nut to crack than hyphens. How should we count terms that combine two related words by a slash, such as medical/pharmaceutical,
one word or two words? If we believe that the slash is a word separator (i.e., slashes mark the end of one word and the beginning of another), then we would need to parse Web addresses into individual words. [Glossary Parsing]
For example:
www.science.com/stuff/neat_stuff/super_neat_stuff/balloons.htm
The Web address could be broken into a string of words if the .
and _
characters could be considered valid word separators. In that case the single Web address would consist of 11 words: www, science, com, stuff, neat, stuff, super, neat, stuff, balloons, and htm. If you were only counting words that match entries in a standard dictionary, then the split Web address would contain eight words: science, stuff, neat, stuff, super, neat, stuff, and balloons. If we defined a word as a string bounded by a space or a part-of-sentence separator (e.g., period, comma, colon, semicolon, question mark, exclamation mark, and end-of-line character), then the unsplit Web address would count as one word. If the word must match a dictionary term, then the unsplit Web address would count as zero words. So, which is it: 11 words, or 8 words, or 1 word or 0 words? [Glossary String]
This is just the start of the problem. How shall we deal with abbreviations¹, ²? Should abbreviations be counted as one word, or as the sum of words represented by the abbreviation? Is U.S.
one word or two words? Suppose, before counting words, the text is preprocessed to expand abbreviations (i.e., every instance of U.S.
becomes an instance of United States, and UCLA would count as four words). This would yield an artificial increase in the number of words in the document. How would the word counter deal with abbreviations that look like words, such as mumps,
which could be the name of a viral disease of childhood, or it could be an abbreviation for a computer language used by medical informaticians and expanded as Massachusetts General Hospital Utility Multiprogramming System
?
How would we deal with numeric sequences appearing in the text? Should each numeric sequence be counted as a word? If not, how do we handle Roman numbers? Should IV
be counted as a word, because it is composed of alphabetic characters, or should it be omitted as a word, because it is equivalent to the numeric value 4
? When we encounter IV
, how can we be certain that we are parsing a Roman numeral? Could IV
, within the context of our document, represent the abbreviation for intravenous
?
It is obvious that the number of words in a document will depend on the particular method used to count the words. If we use a commercial word counting application, how can we know which word counting rules are applied? In the field of informatics, the total number of words is an important feature of a document. The total word count often appears in the denominator of common statistical measurements. Counting words is a highly specialized task, but estimating words is incredibly simple. My favorite estimator of the number of words in any text file is simply the size of the file (in bytes) divided by 6.5, the average number of characters in a word plus one separator character. When must we count, and when can we estimate? The US 2020 census will cost an estimated $15.6 billion and will probably be received with incredulity. Could we do a better job using estimation algorithms, at a tiny fraction of the cost?
The point here is that a simple counting task, such as word counting, can easily become complex. A complex counting task, involving subjective assessments of observations, seldom yields accurate results. When the criteria for counting change over time, then results that were merely inaccurate may devolve into irreproducibility. An example of a counting task that is complex and objective is the counting of hits and errors in baseball. The rules for counting errors are subjective and based on the scorer’s judgment of the intended purpose of the hit (e.g., sacrifice fly) and the expected number of bases reached in the absence of the error. The determination of an error sometimes depends on the outcome of the play after the presumptive error has occurred (i.e., on events that are not controlled or influenced by the error). Counting is also complex, with rules covering specific instances of play. For example, passed balls and wild pitches are not scored as errors; they are assigned to another category of play. Plays involving catchers are exempted from certain rules for errors that apply to fielders. It would be difficult to find an example of a counting task that is more complex than counting baseball errors.
Sometimes, counting criteria inadvertently exclude categories of items that should be counted. The diagnoses that appear on death certificates are chosen from a list of causes of death included in the International Classification of Diseases (ICD). Diagnoses collected from all of the death certificates issued in the United States are aggregated by the Centers for Disease Control and Prevention (CDC) and published in the National Vital Statistics Report.³ As it happens, medical error
is not included as a cause of death in the ICD; hence, US casualties of medical errors are not counted as such in the official records. Official tally notwithstanding, it is estimated that about one of every six deaths in the United States results from medical error.³
Data analytics is particularly vulnerable to counting errors, as data are typically collected from multiple sources, each with its own method for annotating data. In addition, data resources may extend forwards and backwards in time, constantly adding new data and merging with legacy data sets. The criteria for counting data may change over time, producing misleading results. Here are a few examples of counts that changed radically when the rules for counting changed. [Glossary Data annotation, Metaanalysis, New data]
1.Suicides at Beachy Head
Beachy Head is a cliff in England with a straight vertical drop and a beautiful sea view. It is a favorite jumping-off point for suicides. The suicide rate at Beachy Head dropped as sharply as the cliff when the medical examiner made a small policy change. Henceforth, bodies found at the cliff bottom would be counted as suicides only if their postmortem toxicology screen was negative for alcohol. Intoxicated subjects were pronounced dead by virtue of accident (i.e., not suicide).⁴
2.The number of US-Korean War deaths
In the year 2000, nearly a half-century after the Korean war, the US Department of State downsized its long-standing count of US military war deaths; to 36,616 down from an earlier figure of about 54,000. The drop of 17,000 deaths resulted from the exclusion of US military deaths that occurred during the Korean War, in countries outside Korea.⁵ The old numbers reflected deaths during the Korean War; the newer number reflects deaths that occurred due to the Korean War. Aside from historical interest the alteration indicates how collected counts may change retroactively.
3.The number of chromosomes in a human nucleus
Sometimes a count that is plainly wrong is repeated often enough to become credible. Once a number is written into the canon of accepted scientific facts, it becomes difficult to erase. Such a situation occurred when we began to count the number of human chromosomes, based on microscopic examination of chromosome spreads. In 1921 Theophilus Shickel Painter, a pioneer cytogeneticist, counted the number of chromosomes in a meiotic cell division of a spermatocyte. At the time, the technique for spreading and visualizing chromosomes was in its infancy. Painter’s best efforts led him to believe that there are 24 chromosomes in a haploid nucleus (or 48 chromosomes in a diploid nucleus). Other cytogeneticists tried their hand at counting chromosomes and confirmed Painter’s number. The official number of human chromosomes in a diploid cell remained at 48 for the ensuing 34 years. Then, in 1955, Joe Hin Tjio and Albert Levan decided to do a recount using more advanced chromosome-spreading techniques. They found a total of 46 chromosomes in diploid cells. Today, with high-resolution chromosome banding, there is simply no doubt that the number of chromosomes in a diploid human chromosome is 46, not 48. [Glossary Cytogeneticist]
Even when we have an accurate count, the way that we express numbers can be confusing. The normal chromosome complement in human male diploid somatic cells is 46 XY, indicating that males normally have two sets of 23 chromosomes, in which one is an X chromosome and one is a Y chromosome. An uninitiated observer, seeing the usual 46 XY
representation, might erroneously assume that males have 46 autosomal chromosomes plus two sex chromosomes (i.e., X + Y) producing a total complement of 48 chromosomes (46 + X + Y). Would it not make more sense to say that the normal male diploid karyotype is 44 XY, indicating that there are 44 autosomes plus one X chromosome plus one Y chromosome? (Fig. 1.1) [Glossary Karyotype, Translocation, X-chromosome, Y-chromosome]
Fig. 1.1 The normal karyotype of a male human, consisting of chromosomes captured during mitosis, spread out on a glass slide, stained, photographed, paired, and sorted by size. At the bottom right of the image are the X and Y chromosomes.
Theophilus Shickel Painter’s miscount of chromosomes occurred about 1 century ago. Surely, scientists have cleaned up their act since then! The Human Genome Project is a massive bioinformatics project in which multiple laboratories helped to sequence the three billion base pair haploid human genome. There are about two million species of proteins synthesized by human cells. If every protein had its own private gene containing its specific genetic code, then there would be about two million protein-coding genes contained in the human genome, and this number served as the earliest estimate for the number of protein-coding genes in the human genome. Over the years the estimated number of protein-coding genes fell, and estimates of 150,000 and 75,000 had their proponents. Based on an evaluation of segments of the genome that code for sequences that are translated into proteins, it turns out that we have, somewhere in the vicinity, 20–25 thousand protein-coding genes (about 90-fold smaller than the first estimate of 2 million). Furthermore, when we study other living organisms, we find that humans are massive underachievers when it comes to our number of protein-coding genes. The humble rice grain has 46–56 thousand genes.⁶ [Glossary Human Genome Project]
Why is there such a large discrepancy between the early gene estimates and our present-day analyses? Counting is difficult when we do not fully understand the object that we are counting. We