Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Logic and Critical Thinking in the Biomedical Sciences: Volume 2: Deductions Based Upon Quantitative Data
Logic and Critical Thinking in the Biomedical Sciences: Volume 2: Deductions Based Upon Quantitative Data
Logic and Critical Thinking in the Biomedical Sciences: Volume 2: Deductions Based Upon Quantitative Data
Ebook617 pages7 hours

Logic and Critical Thinking in the Biomedical Sciences: Volume 2: Deductions Based Upon Quantitative Data

Rating: 0 out of 5 stars

()

Read preview

About this ebook

All too often, individuals engaged in the biomedical sciences assume that numeric data must be left to the proper authorities (e.g., statisticians and data analysts) who are trained to apply sophisticated mathematical algorithms to sets of data. This is a terrible mistake. Individuals with keen observational skills, regardless of their mathematical training, are in the best position to draw correct inferences from their own data and to guide the subsequent implementation of robust, mathematical analyses. Volume 2 of Logic and Critical Thinking in the Biomedical Sciences provides readers with a repertoire of deductive non-mathematical methods that will help them draw useful inferences from their own data.Volumes 1 and 2 of Logic and Critical Thinking in the Biomedical Sciences are written for biomedical scientists and college-level students engaged in any of the life sciences, including bioinformatics and related data sciences.
  • Demonstrates that a great deal can be deduced from quantitative data, without applying any statistical or mathematical analyses
  • Provides readers with simple techniques for quickly reviewing and finding important relationships hidden within large and complex sets of data
  • Using examples drawn from the biomedical literature, discusses common pitfalls in data interpretation and how they can be avoided
LanguageEnglish
Release dateJul 8, 2020
ISBN9780128213629
Logic and Critical Thinking in the Biomedical Sciences: Volume 2: Deductions Based Upon Quantitative Data
Author

Jules J. Berman

Jules Berman holds two Bachelor of Science degrees from MIT (in Mathematics and in Earth and Planetary Sciences), a PhD from Temple University, and an MD from the University of Miami. He was a graduate researcher at the Fels Cancer Research Institute (Temple University) and at the American Health Foundation in Valhalla, New York. He completed his postdoctoral studies at the US National Institutes of Health, and his residency at the George Washington University Medical Center in Washington, DC. Dr. Berman served as Chief of anatomic pathology, surgical pathology, and cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he transferred to the US National Institutes of Health as a Medical Officer and as the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past President of the Association for Pathology Informatics and is the 2011 recipient of the Association’s Lifetime Achievement Award. He is a listed author of more than 200 scientific publications and has written more than a dozen books in his three areas of expertise: informatics, computer programming, and pathology. Dr. Berman is currently a freelance writer.

Read more from Jules J. Berman

Related to Logic and Critical Thinking in the Biomedical Sciences

Related ebooks

Medical For You

View More

Related articles

Reviews for Logic and Critical Thinking in the Biomedical Sciences

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Logic and Critical Thinking in the Biomedical Sciences - Jules J. Berman

    Logic and Critical Thinking in the Biomedical Sciences

    Volume 2: Deductions Based Upon Quantitative Data

    First Edition

    Jules J. Berman

    Table of Contents

    Cover image

    Title page

    Copyright

    Other books by Jules J. Berman

    Dedication

    About the author

    Preface

    Abstract

    1: Learning what counting tells us

    Abstract

    Section 1.1 Science is mostly about counting stuff

    Section 1.2 Never count on an accurate count

    Section 1.3 Large samples cannot compensate for nonrepresentative data

    Section 1.4 The perils of combining data sets

    Section 1.5 Compositionality: Why small outnumbers large

    Section 1.6 Looking at data

    Section 1.7 Counting mutations

    Section 1.8 Chromosome length and the frequency of genetic diseases

    Section 1.9 Counting instances of species

    Section 1.10 Counting garbage

    2: Drawing inferences from absences of data values

    Abstract

    Section 2.1 When the important data is what you do not see

    Section 2.2 The power of negative thinking

    Section 2.3 Absence of x-rays emitted by hot cups of coffee

    Section 2.4 Absence of laboratory findings in SIDS (sudden infant death syndrome)

    Section 2.5 Absence of lethal toxicity resulting from damage to the epigenome and systems that regulate gene expression

    Section 2.6 Absence of deficiency diseases among highly conserved genes

    Section 2.7 Absence of shared conserved noncoding elements

    Section 2.8 Absence of animals with built-in wheels

    Section 2.9 Absence of microcancers

    Section 2.10 Absence of frogs on small islands

    Section 2.11 Absence of great apes roaming outside Africa

    Section 2.12 Absence of penguins in northern hemisphere

    Section 2.13 Absence of samarium-146 isotope from earth

    Section 2.14 Obligation to look for absences

    3: Drawing inferences from data ranges

    Abstract

    Section 3.1 Why are data ranges important?

    Section 3.2 The range of dust sizes that cause human disease

    Section 3.3 When tumor cells have very small nuclei

    Section 3.4 The range of heights that animals can jump

    Section 3.5 Blood chemistry

    Section 3.6 Narrow ranges of enzyme activity

    Section 3.7 The number of different types of cancers

    Section 3.8 Limits imposed by the dynamic range of measuring instruments

    4: Drawing inferences from outliers and exceptions

    Abstract

    Section 4.1 One is the loneliest number

    Section 4.2 Ozone, the outlier that couldn’t be believed

    Section 4.3 Neoplasms having very short latency periods

    Section 4.4 Outliers as sentinels for common diseases

    Section 4.5 How exceptions elucidate pathogenesis

    Section 4.6 Finding the outliers

    5: What we learn when our data are abnormal

    Abstract

    Section 5.1 Creating normal distributions

    Section 5.2 Pareto's principle and Zipf distribution in biological systems

    Section 5.3 Pareto's bias: Favoring the common items

    Section 5.4 Recognizing composite diseases

    Section 5.5 Multimodality in population data

    Section 5.6 Removing some of the mystery around ovarian cancers

    Section 5.7 Living with Berkson's paradox

    6: Using time to solve cause and effect dilemmas

    Abstract

    Section 6.1 Timing is everything

    Section 6.2 Does anybody really know what time it is?

    Section 6.3 Temporal paradoxes

    Section 6.4 Timing the progression of cancer development

    Section 6.5 When the temporal sequence is observed incorrectly

    Section 6.6 Smoke and mirrors

    Section 6.7 Refusing simple answers

    Section 6.8 Dose-dependent effects and the fallacy of causation

    Section 6.9 Time-window bias

    Section 6.10 Replacing causation with pathogenesis

    7: Heuristic methods that use random numbers

    Abstract

    Section 7.1 The value of randomness

    Section 7.2 Repeated sampling

    Section 7.3 Monte Carlo simulations for tumor growth and metastasis

    Section 7.4 A seemingly unlikely string of occurrences

    Section 7.5 Cancer is not caused by bad luck

    Section 7.6 Several approaches to the birthday problem

    Section 7.7 Modeling cancer incidence by age

    Section 7.8 The Monty Hall puzzle

    8: Estimations for biomedical data

    Abstract

    Section 8.1 The inestimable value of estimates

    Section 8.2 The limit of hemoglobin concentration in red blood cells

    Section 8.3 CODIS: How to do it all without having it all

    Section 8.4 Some useful approximation methods

    Section 8.5 Some useful numbers

    Index

    Copyright

    Academic Press is an imprint of Elsevier

    125 London Wall, London EC2Y 5AS, United Kingdom

    525 B Street, Suite 1650, San Diego, CA 92101, United States

    50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States

    The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom

    © 2020 Elsevier Inc. All rights reserved.

    No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.

    This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).

    Notices

    Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

    Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

    To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

    Library of Congress Cataloging-in-Publication Data

    A catalog record for this book is available from the Library of Congress

    British Library Cataloguing-in-Publication Data

    A catalogue record for this book is available from the British Library

    ISBN: 978-0-12-821369-8

    For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

    Publisher: Stacy Masucci

    Acquisitions Editor: Rafael Teixeira

    Editorial Project Manager: Pat Gonzalez

    Production Project Manager: Punithavathy Govindaradjane

    Cover Designer: Christian Bilbow

    Typeset by SPi Global, India

    Other books by Jules J. Berman

    Dedication

    For Luca

    About the author

    Jules J. Berman received two baccalaureate degrees from MIT: in Mathematics and in Earth and Planetary Sciences. He holds a PhD from Temple University and an MD from the University of Miami. He was a graduate student researcher in the Fels Cancer Research Institute at Temple University and at the American Health Foundation in Valhalla, New York. His postdoctoral studies were completed at the US National Institutes of Health, and his residency was completed at the George Washington University Medical Center in Washington, DC. Dr. Berman served as chief of anatomic pathology, surgical pathology, and cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998 he transferred to the US National Institutes of Health, as a medical officer and as the program director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past president of the Association for Pathology Informatics and the 2011 recipient of the Association's Lifetime Achievement Award. He has first authored more than 100 journal articles and has written 20 science books. His recent titles, published by Elsevier, include the following:

    Taxonomic Guide to Infectious Diseases: Understanding the Biologic Classes of Pathogenic Organisms, 1st edition (2012)

    Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information (2013)

    Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases (2014)

    Repurposing Legacy Data: Innovative Case Studies (2015)

    Data Simplification: Taming Information With Open Source Tools (2016)

    Precision Medicine and the Reinvention of Human Disease (2018)

    Principles and Practice of Big Data: Preparing, Sharing, and Analyzing Complex Information, Second Edition (2018)

    Taxonomic Guide to Infectious Diseases: Understanding the Biologic Classes of Pathogenic Organisms, 2nd edition (2019)

    Evolution's Clinical Guidebook: Translating Ancient Genes Into Precision Medicine (2019)

    Preface

    Abstract

    In Volume I, we learned how to gain fresh insight into the fields of biology and medicine by applying a few deductive methods to simple observations. In Volume II, we tackle the challenges that arise when our available data are quantitative, not descriptive. All too often, individuals engaged in the biomedical sciences assume that numeric data must be left to the authorities (i.e., statisticians and data analysts) who are trained to apply the sophisticated mathematical algorithms to their data. This is a terrible mistake insofar as the individuals who create data (e.g., biomedical scientists) are in the best position to understand what their data really means. The purpose of Volume II is to provide readers with a set of practical skills that will permit them to understand and draw valid inferences from numeric data. Anyone who can count and multiply will have no problem understanding Volume II of Logic and Critical Thinking in the Biomedical Sciences.

    Keywords

    Quantitative data; Data analysis; Algorithms; Approximation; Estimation; Counting; Complex data

    I often say that when you can measure what you are speaking about and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of Science, whatever the matter may be.

    William Thomson (Lord Kelvin). Popular lectures and addresses, Vol. 1, Electrical Units of Measurement, 1883.

    Logic and Critical Thinking in the Biomedical Sciences is divided into two volumes. Volume I, Deductions Based Upon Simple Observations, demonstrated that much can be learned when we apply a little bit of logic to common observations, without relying on any mathematical or quantitative analyses. Volume II, Deductions Based Upon Quantitative Data, follows the same theme, but instead of looking at images, genes, species, and anatomic structures, we'll be looking at numbers that are typically displayed in some sort of data set. In either case, the goal is the same: finding relationships among different things and using such knowledge to discover additional relationships among more things. Fundamentally the pursuit of generalizable relationships is what science is all about.

    Of course, to make sense of things and their relationships, we need to agree on what we mean by things and relationships. Back in Volume I, when we were dealing with things that could be described in words, we were usually referring to living organisms, diseases, the chemical constituents of physical objects, or their classes (i.e., their assigned groups within a classification). Our relationships came in the form of assertions (e.g., rats are mammals, cancer is a disease, and diseases result from a sequence of cellular events). Now that we are about to move into the world of numbers and equations, how must we change our thinking? To find out, let's consider the words of Arthur Lyon Bowley, a statistician of prominence in the early 20th century¹:

    Statistics are numerical statements of facts in any department of inquiry, placed in relation to each other; statistical methods are devices for abbreviating and classifying the statements and making clear the relations.

    Bowley's definition of the purpose and methods of statistics fits our general definition of science: the pursuit of generalizable relationships. The logic of science is much the same, whether the subject matter is visual (e.g., anatomy, embryology, and histology), transactional (e.g., medicine), conceptual (e.g., mathematics), or numeric (e.g., statistics).

    If I were developing a science curriculum for undergraduate students, I would stress four components:

    1.Formulas—the abstract relationships that serve as the building blocks of science, along with how they were derived.

    2.Algorithms—the methods by which formulas are applied to produce new knowledge.

    3.Practice—time for problem solving.

    4.History of science—providing context.

    Such a curriculum would erase the artifactual and counterproductive boundaries separating the different scientific disciplines. Students would be taught to seek and find relationships wherever they may exist, without restraint. Would such a radical approach be successful? Who knows? Certainly the traditional approaches are not much to brag about. Here in the United States, despite all the education aimed at students, very few of us learn to love science or mathematics, two deeply human subjects that deserve all the love that we can muster.², ³

    In lieu of developing a new, universal curriculum, I have written this book, Logic and Critical Thinking in the Biomedical Sciences, which is my way of approaching science in a nondisciplinary way. It is intended to appeal to students and professionals in the biomedical field, but the general analytic approaches described in this book have come from just about every field of science.

    Volume II, Deductions Based Upon Quantitative Data, is written specifically for readers who may have felt intimidated or bullied, by their high school and college courses in statistics. The truth of the matter is that much of statistics, as it is traditionally taught, is inscrutable to the logical mind; some of it doesn't make much sense. Enlightened statisticians love to argue over the meaning and utility of the touchstones of hypothesis testing (e.g., p values) and data description (e.g., linear regression).⁴–⁶ Many of the most touted statistical formulas are based on assumptions that are untrue much of the time. When you have billions of data measurements and powerful computers to churn through all the numbers, there really is no need to assume much of anything about your data. The data will tell you what you need to know.

    Those readers who work in laboratories, who are involved in the clinical trials, or who generate data of any type should avoid the habit of turning your hard-earned data over to statisticians. The widely held belief that only professionals dedicated to data analysis are capable of reaching a valid conclusion is unfounded. Two statisticians can look at a set of data, each applying a different set of analytic methods to the data, to arrive at any of several irreconcilable conclusions.⁷ In point of fact the scientists who design the experiments and produce the measurements are in the best position to do something creative with their output. Ideally, it is the experimentalist, not the statistician, who will pay attention to what the data are trying to say and who will discover the secrets hidden within the data set.

    It is fortunate that the mathematically challenged among us may seek guidance from a rich literature devoted to easy methods that quickly summarize data, and these methods are sometimes referred to as simplified exploratory techniques.¹, ⁸ In this book, we will make use of a few summarizing tools (e.g., averages; standard deviations for our normal distributions; and nonparametric measures such as interquartiles, medians, maxima, and minima for other forms of distributions). We will also be introducing a variety of estimator and approximation techniques as well as heuristic techniques that deploy random number generators.⁹ If you paid attention in your high school mathematics courses, you will have no trouble comprehending the contents of this book. Volume II, like Volume I, will demonstrate the kinds of inferences that can be drawn, by applying logic to common observations.

    After reading this book, you might be interested in pursuing books that describe in detail specific data analytic methods, along with examples of how these methods should be used. I would recommend the following books, all of which stress the use of simple mathematical techniques, listed here in chronologic order from the data of publication:

    Bowley A.L. Elementary Manual of Statistics, Third edition, 1920.¹

    Tukey, JW. Exploratory Data Analysis, 1977.

    Simon JL. Resampling: The New Statistics. Second Edition, 1997.¹⁰

    Janert PK. Gnuplot in Action: Understanding Data with Graphs, 2009.¹¹

    Lewis PD. R for Medicine and Biology, 2009.¹²

    Janert PK. Data Analysis with Open Source Tools, 2010.

    Milo R, Phillips R. Cell Biology by the Numbers, 2015.¹³

    Berman JJ. Data Simplification: Taming Information With Open Source Tools, 2016.¹⁴

    How to read this book

    Each chapter comes with its own reference section and its own glossary. Rather than filling the corpus of text with a lot of description and definitions, I packed the glossaries with terminology and explanations of specialized techniques. After reading the text for all the book chapters, readers may enjoy going back and reading the chapter glossaries, as stand-alone documents. Both volumes are heavily referenced, with approximately 1200 citations selected from almost every field of science. Readers are encouraged to read any or all of these primary resources.

    This book is not written particularly for programmers, but on numerous occasions in Volume II, I included snippets of source code written in Python or Perl, just in case any of the readers wanted to write their own software programs to assist in the analysis of data. Nowadays, it never hurts to know how to program, and there are lots of versatile, freely available languages to choose from. Perl, Python, and Ruby are my personal favorites. I have written a number of programming books for biomedical professionals that readers of this book might find suitable to their needs.¹⁵–¹⁷ The R statistical programming environment has become very popular lately, and a rich literature is available on this subject, including books geared to biomedical scientists.¹²

    This book contains many different logical inferences, way too many for any reader to remember. I thought it might be useful to provide a trick whereby readers can collect and peruse all the inferences, without the explanatory text. For lack of a better idea, every inference is consistently preceded with the pompous conjunctive: Hence (e.g., I think. Hence, I am.). Doing so enables readers of the e-version of this book to quickly find every henced inference, just by repeating a find operation on the word. Hence, readers can inspect every logical conclusion herein, for purposes of amusement, erudition, or criticism.

    References

    [1] Bowley A.L. Elementary manual of statistics. 3rd ed. Westiminster, England: P.S. King and Son; 1920.

    [2] Rising above the gathering storm: energizing and employing America for a brighter economic future. Washington, DC: National Academy of Sciences, National Academy of Engineering, and Institute of Medicine, National Academies Press; 2007.

    [3] Friedman T.L. Can’t keep a bad idea down. The New York Times. 2010 October 26.

    [4] Janert P.K. Data analysis with open source tools. O'Reilly Media; 2010.

    [5] Conlon I., Raff M. Size control in animal development. Cell. 1999;96:235–244.

    [6] Nuzzo R. P values, the gold standard of statistical validity, are not as reliable as many scientists assume. Nature. 2014;506:150–152.

    [7] Tatsioni A., Bonitsis N.G., Ioannidis J.P. Persistence of contradicted claims in the literature. JAMA. 2007;298:2517–2526.

    [8] Tukey J.W. Exploratory data analysis. Boston, MA: Addison-Wesley; 1977.

    [9] Diaconis P., Efron B. Computer-intensive methods in statistics. Scientific American. 1983;116–130 May.

    [10] Simon J.L. Resampling: the new statistics. 2nd ed. 1997. Available from: http://www.resample.com/intro-text-online/ viewed on September 21, 2015.

    [11] Janert P.K. Gnuplot in action: understanding data with graphs. Manning; 2009.

    [12] Lewis P.D. R for medicine and biology. Sudbury: Jones and Bartlett Publishers; 2009.

    [13] Milo R., Phillips R. Cell biology by the numbers. Oxford: Garland Science; 2015.

    [14] Berman J.J. Data simplification: taming information with open source tools. Waltham, MA: Morgan Kaufmann; 2016.

    [15] Berman J.J. Perl programming for medicine and biology. Sudbury, MA: Jones and Bartlett; 2007.

    [16] Berman J.J. Ruby programming for medicine and biology. Sudbury, MA: Jones and Bartlett; 2008.

    [17] Berman J.J. Methods in medical informatics: fundamentals of healthcare Programming in Perl, Python, and Ruby. Boca Raton: Chapman and Hall; 2010.

    1

    Learning what counting tells us

    Abstract

    The act of counting objects is a difficult and underappreciated task. Many of the entrenched misconceptions in science come from sloppy counting protocols. Knowing the number and diversity of data objects (i.e., how many classes of objects are present in the data set and how many members belong to each of those classes) tells us a great deal about the nature of biological data. This chapter teaches us that the process of counting biological objects (e.g., organisms, species, genes, proteins, and variants thereof) will often reveal or clarify profound biological mysteries. Examples will include why there are at least 50 million species of living organisms on earth; the significance of having just a handful of species belonging to the monotremes, while there are many thousands of species of beetles; why there is only a small number of general classes of body plans; what we learn by comparing the number of herbivorous mammals to the number of carnivorous mammals; acquired diseases are more common than genetic diseases why rare diseases are biologically, not just numerically, different from common diseases.

    Keywords

    Counting; Speciation; Biological diversification; Mutation rate; Mutation burden; Disease incidence

    Chapter outline

    Section 1.1. Science is mostly about counting stuff

    Section 1.2. Never count on an accurate count

    Section 1.3. Large samples cannot compensate for nonrepresentative data

    Section 1.4. The perils of combining data sets

    Section 1.5. Compositionality: Why small outnumbers large

    Section 1.6. Looking at data

    Section 1.7. Counting mutations

    Section 1.8. Chromosome length and the frequency of genetic diseases

    Section 1.9. Counting instances of species

    Section 1.10. Counting garbage

    Glossary

    References

    Not everything that counts can be counted, and not everything that can be counted counts.

    William Bruce Cameron.

    Section 1.1 Science is mostly about counting stuff

    Much of what we know about reality comes from counting. We count the number of occurrences of disease, the number of days that a disease persists, the number of working days lost to the disease, the number of emergency room visits prompted by the disease, and so on. Once we have all those counts, we need somewhere to put them, so we invent classifications into which we assign the various types of things that we’ve counted. Explained this way, our most common scientific pursuit seems trivial, but without those counts, we would never understand much of anything.

    When we take a short break from counting things (e.g., when measuring sizes or looking for patterns), we typically use our preliminary observations as the basis for new counting projects. For example, we can set up weather stations with equipment that continuously monitors the temperature, humidity, barometric pressure, and wind velocity at multiple locations. Each of these measurements produces a waveform, demonstrating how a variable changes over time. Typically, analysts will take the waveform data and transform it into counted items, such as the number of days in the year wherein the temperature exceeded 37°C, or the number of instances wherein the humidity exceeded 80% while the wind velocity fell below 4 miles per hour. In the field of digital signal processing, signals are commonly transformed from the time domain (e.g., waveforms) to the frequency domain (counts of occurrences of a particular type) for the purpose of analysis and manipulation. [Glossary Digital signal processing, Signal, Time, Waveform]

    In the realm of bioinformatics, we like to think that we have moved beyond merely counting things and into a new realm of analysis unlocked by the genetic code. The sequence of nucleotides determines the function of a gene, not the quantity of each component nucleotide. Nonetheless, counting retains a position of paramount importance in molecular biology. When we find a gene pattern or motif of significance, we count how often it appears in the genome. When we observe a gene variant, we look to see how often it occurs in the population and whether it correlates with any specific biological feature (e.g., trait and disease). We found that gene expression is best determined by counting the number of expressed (i.e., mRNA) and translated (i.e., protein) sequences. It’s always the same story; just as soon as we discover anything of merit at the molecular level, we proceed to count what we’ve found.

    Section 1.2 Never count on an accurate count

    Most people would agree that the simple act of counting data is something that can be done accurately and reproducibly from laboratory to laboratory. Actually, this is not the case. Counting is fraught with errors. Consider the problem of counting words in a paragraph, it seems straightforward, until you start asking yourself how you might deal with hyphenated words. De-identified is certainly one word. Under-represented is probably one word, but sometimes the hyphen is replaced by a space, and then it is certainly two words. How about the term military-industrial, which seems as though it should be two words? When a hyphen occurs at the end of a line, should we force a concatenation between the syllables at the end of one line and the start of the next?

    Slashes are a tougher nut to crack than hyphens. How should we count terms that combine two related words by a slash, such as medical/pharmaceutical, one word or two words? If we believe that the slash is a word separator (i.e., slashes mark the end of one word and the beginning of another), then we would need to parse Web addresses into individual words. [Glossary Parsing]

    For example:

    www.science.com/stuff/neat_stuff/super_neat_stuff/balloons.htm

    The Web address could be broken into a string of words if the . and _ characters could be considered valid word separators. In that case the single Web address would consist of 11 words: www, science, com, stuff, neat, stuff, super, neat, stuff, balloons, and htm. If you were only counting words that match entries in a standard dictionary, then the split Web address would contain eight words: science, stuff, neat, stuff, super, neat, stuff, and balloons. If we defined a word as a string bounded by a space or a part-of-sentence separator (e.g., period, comma, colon, semicolon, question mark, exclamation mark, and end-of-line character), then the unsplit Web address would count as one word. If the word must match a dictionary term, then the unsplit Web address would count as zero words. So, which is it: 11 words, or 8 words, or 1 word or 0 words? [Glossary String]

    This is just the start of the problem. How shall we deal with abbreviations¹, ²? Should abbreviations be counted as one word, or as the sum of words represented by the abbreviation? Is U.S. one word or two words? Suppose, before counting words, the text is preprocessed to expand abbreviations (i.e., every instance of U.S. becomes an instance of United States, and UCLA would count as four words). This would yield an artificial increase in the number of words in the document. How would the word counter deal with abbreviations that look like words, such as mumps, which could be the name of a viral disease of childhood, or it could be an abbreviation for a computer language used by medical informaticians and expanded as Massachusetts General Hospital Utility Multiprogramming System?

    How would we deal with numeric sequences appearing in the text? Should each numeric sequence be counted as a word? If not, how do we handle Roman numbers? Should IV be counted as a word, because it is composed of alphabetic characters, or should it be omitted as a word, because it is equivalent to the numeric value 4? When we encounter IV, how can we be certain that we are parsing a Roman numeral? Could IV, within the context of our document, represent the abbreviation for intravenous?

    It is obvious that the number of words in a document will depend on the particular method used to count the words. If we use a commercial word counting application, how can we know which word counting rules are applied? In the field of informatics, the total number of words is an important feature of a document. The total word count often appears in the denominator of common statistical measurements. Counting words is a highly specialized task, but estimating words is incredibly simple. My favorite estimator of the number of words in any text file is simply the size of the file (in bytes) divided by 6.5, the average number of characters in a word plus one separator character. When must we count, and when can we estimate? The US 2020 census will cost an estimated $15.6 billion and will probably be received with incredulity. Could we do a better job using estimation algorithms, at a tiny fraction of the cost?

    The point here is that a simple counting task, such as word counting, can easily become complex. A complex counting task, involving subjective assessments of observations, seldom yields accurate results. When the criteria for counting change over time, then results that were merely inaccurate may devolve into irreproducibility. An example of a counting task that is complex and objective is the counting of hits and errors in baseball. The rules for counting errors are subjective and based on the scorer’s judgment of the intended purpose of the hit (e.g., sacrifice fly) and the expected number of bases reached in the absence of the error. The determination of an error sometimes depends on the outcome of the play after the presumptive error has occurred (i.e., on events that are not controlled or influenced by the error). Counting is also complex, with rules covering specific instances of play. For example, passed balls and wild pitches are not scored as errors; they are assigned to another category of play. Plays involving catchers are exempted from certain rules for errors that apply to fielders. It would be difficult to find an example of a counting task that is more complex than counting baseball errors.

    Sometimes, counting criteria inadvertently exclude categories of items that should be counted. The diagnoses that appear on death certificates are chosen from a list of causes of death included in the International Classification of Diseases (ICD). Diagnoses collected from all of the death certificates issued in the United States are aggregated by the Centers for Disease Control and Prevention (CDC) and published in the National Vital Statistics Report.³ As it happens, medical error is not included as a cause of death in the ICD; hence, US casualties of medical errors are not counted as such in the official records. Official tally notwithstanding, it is estimated that about one of every six deaths in the United States results from medical error.³

    Data analytics is particularly vulnerable to counting errors, as data are typically collected from multiple sources, each with its own method for annotating data. In addition, data resources may extend forwards and backwards in time, constantly adding new data and merging with legacy data sets. The criteria for counting data may change over time, producing misleading results. Here are a few examples of counts that changed radically when the rules for counting changed. [Glossary Data annotation, Metaanalysis, New data]

    1.Suicides at Beachy Head

    Beachy Head is a cliff in England with a straight vertical drop and a beautiful sea view. It is a favorite jumping-off point for suicides. The suicide rate at Beachy Head dropped as sharply as the cliff when the medical examiner made a small policy change. Henceforth, bodies found at the cliff bottom would be counted as suicides only if their postmortem toxicology screen was negative for alcohol. Intoxicated subjects were pronounced dead by virtue of accident (i.e., not suicide).

    2.The number of US-Korean War deaths

    In the year 2000, nearly a half-century after the Korean war, the US Department of State downsized its long-standing count of US military war deaths; to 36,616 down from an earlier figure of about 54,000. The drop of 17,000 deaths resulted from the exclusion of US military deaths that occurred during the Korean War, in countries outside Korea.⁵ The old numbers reflected deaths during the Korean War; the newer number reflects deaths that occurred due to the Korean War. Aside from historical interest the alteration indicates how collected counts may change retroactively.

    3.The number of chromosomes in a human nucleus

    Sometimes a count that is plainly wrong is repeated often enough to become credible. Once a number is written into the canon of accepted scientific facts, it becomes difficult to erase. Such a situation occurred when we began to count the number of human chromosomes, based on microscopic examination of chromosome spreads. In 1921 Theophilus Shickel Painter, a pioneer cytogeneticist, counted the number of chromosomes in a meiotic cell division of a spermatocyte. At the time, the technique for spreading and visualizing chromosomes was in its infancy. Painter’s best efforts led him to believe that there are 24 chromosomes in a haploid nucleus (or 48 chromosomes in a diploid nucleus). Other cytogeneticists tried their hand at counting chromosomes and confirmed Painter’s number. The official number of human chromosomes in a diploid cell remained at 48 for the ensuing 34 years. Then, in 1955, Joe Hin Tjio and Albert Levan decided to do a recount using more advanced chromosome-spreading techniques. They found a total of 46 chromosomes in diploid cells. Today, with high-resolution chromosome banding, there is simply no doubt that the number of chromosomes in a diploid human chromosome is 46, not 48. [Glossary Cytogeneticist]

    Even when we have an accurate count, the way that we express numbers can be confusing. The normal chromosome complement in human male diploid somatic cells is 46 XY, indicating that males normally have two sets of 23 chromosomes, in which one is an X chromosome and one is a Y chromosome. An uninitiated observer, seeing the usual 46 XY representation, might erroneously assume that males have 46 autosomal chromosomes plus two sex chromosomes (i.e., X + Y) producing a total complement of 48 chromosomes (46 + X + Y). Would it not make more sense to say that the normal male diploid karyotype is 44 XY, indicating that there are 44 autosomes plus one X chromosome plus one Y chromosome? (Fig. 1.1) [Glossary Karyotype, Translocation, X-chromosome, Y-chromosome]

    Fig. 1.1 The normal karyotype of a male human, consisting of chromosomes captured during mitosis, spread out on a glass slide, stained, photographed, paired, and sorted by size. At the bottom right of the image are the X and Y chromosomes.

    Theophilus Shickel Painter’s miscount of chromosomes occurred about 1 century ago. Surely, scientists have cleaned up their act since then! The Human Genome Project is a massive bioinformatics project in which multiple laboratories helped to sequence the three billion base pair haploid human genome. There are about two million species of proteins synthesized by human cells. If every protein had its own private gene containing its specific genetic code, then there would be about two million protein-coding genes contained in the human genome, and this number served as the earliest estimate for the number of protein-coding genes in the human genome. Over the years the estimated number of protein-coding genes fell, and estimates of 150,000 and 75,000 had their proponents. Based on an evaluation of segments of the genome that code for sequences that are translated into proteins, it turns out that we have, somewhere in the vicinity, 20–25 thousand protein-coding genes (about 90-fold smaller than the first estimate of 2 million). Furthermore, when we study other living organisms, we find that humans are massive underachievers when it comes to our number of protein-coding genes. The humble rice grain has 46–56 thousand genes.⁶ [Glossary Human Genome Project]

    Why is there such a large discrepancy between the early gene estimates and our present-day analyses? Counting is difficult when we do not fully understand the object that we are counting. We

    Enjoying the preview?
    Page 1 of 1