Data Preparation and Exploration: Applied to Healthcare Data
By Robert Hoyt and Robert Muenchen
()
About this ebook
Prominent issues such as how to handle missing data and imbalanced datasets are covered along with sections on descriptive statistics, visualization, correlations, handling duplicates and outliers, scaling, standardization, and much more.
Chapters are as follows:
* The importance of Data Preparation and Exploration
* Data preparation
* Data exploration
* Automated data preparation and exploration
* Healthcare data resources
Related to Data Preparation and Exploration
Related ebooks
Data Analytics Rating: 1 out of 5 stars1/5Data Collection: Getting Started With Statistics Rating: 0 out of 5 stars0 ratingsBiostatistics Explored Through R Software: An Overview Rating: 4 out of 5 stars4/5Introduction to Statistics: An Intuitive Guide for Analyzing Data and Unlocking Discoveries Rating: 0 out of 5 stars0 ratingsThe Big Unlock: Harnessing Data and Growing Digital Health Businesses in a Value-Based Care Era Rating: 0 out of 5 stars0 ratingsData Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next Rating: 0 out of 5 stars0 ratingsData Analytics with Python: Data Analytics in Python Using Pandas Rating: 3 out of 5 stars3/5Associations and Correlations for Medical Research Rating: 0 out of 5 stars0 ratingsIntroduction to Biostatistics with JMP (Hardcover edition) Rating: 1 out of 5 stars1/5Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information Rating: 0 out of 5 stars0 ratingsData Preparation for Data Mining Using SAS Rating: 5 out of 5 stars5/5Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses Rating: 0 out of 5 stars0 ratingsMastering Data Analysis with R Rating: 5 out of 5 stars5/5Big Data Analytics for Intelligent Healthcare Management Rating: 0 out of 5 stars0 ratingsSimulation for Data Science with R Rating: 0 out of 5 stars0 ratingsJust Enough R: Learn Data Analysis with R in a Day Rating: 4 out of 5 stars4/5Data Analysis with R Rating: 5 out of 5 stars5/5Machine Learning in Healthcare Rating: 0 out of 5 stars0 ratingsR Data Science Essentials Rating: 2 out of 5 stars2/5Demystifying Big Data, Machine Learning, and Deep Learning for Healthcare Analytics Rating: 0 out of 5 stars0 ratingsR Programming - a Comprehensive Guide: Software Rating: 0 out of 5 stars0 ratingsData Quality for Analytics Using SAS Rating: 4 out of 5 stars4/5Practical Data Analysis Rating: 4 out of 5 stars4/5Real World Health Care Data Analysis: Causal Methods and Implementation Using SAS Rating: 0 out of 5 stars0 ratingsR for Data Science Rating: 5 out of 5 stars5/5Handbook of Statistical Analysis and Data Mining Applications Rating: 4 out of 5 stars4/5
Data Visualization For You
DAX Patterns: Second Edition Rating: 5 out of 5 stars5/5Learning pandas - Second Edition Rating: 4 out of 5 stars4/5Fieldwork Handbook: A Practical Guide on the Go Rating: 0 out of 5 stars0 ratingsPresent Beyond Measure: Design, Visualize, and Deliver Data Stories That Inspire Action Rating: 0 out of 5 stars0 ratingsCool Infographics: Effective Communication with Data Visualization and Design Rating: 4 out of 5 stars4/5Data Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5The Big Book of Dashboards: Visualizing Your Data Using Real-World Business Scenarios Rating: 4 out of 5 stars4/5Top 20 Essential Skills for ArcGIS Pro Rating: 0 out of 5 stars0 ratingsData Visualization: A Practical Introduction Rating: 5 out of 5 stars5/5How to Become a Data Analyst: My Low-Cost, No Code Roadmap for Breaking into Tech Rating: 0 out of 5 stars0 ratingsLearn D3.js: Create interactive data-driven visualizations for the web with the D3.js library Rating: 0 out of 5 stars0 ratingsTeach Yourself VISUALLY Power BI Rating: 0 out of 5 stars0 ratingsLearning Tableau 2019 - Third Edition: Tools for Business Intelligence, data prep, and visual analytics, 3rd Edition Rating: 0 out of 5 stars0 ratingsVisual Analytics with Tableau Rating: 0 out of 5 stars0 ratingsEffective Data Storytelling: How to Drive Change with Data, Narrative and Visuals Rating: 4 out of 5 stars4/5Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python Rating: 0 out of 5 stars0 ratingsMastering Python for Data Science Rating: 3 out of 5 stars3/5Visualizing Graph Data Rating: 0 out of 5 stars0 ratingsD3.js in Action: Data visualization with JavaScript Rating: 0 out of 5 stars0 ratingsTableau For Dummies Rating: 4 out of 5 stars4/5How to be Clear and Compelling with Data: Principles, Practice and Getting Beyond the Basics Rating: 0 out of 5 stars0 ratingsThe Esri Guide to GIS Analysis, Volume 2: Spatial Measurements and Statistics Rating: 5 out of 5 stars5/5Financial Reporting with Dashboards in Power BI Rating: 0 out of 5 stars0 ratingsMastering Excel: Excel Apps Rating: 3 out of 5 stars3/5
Reviews for Data Preparation and Exploration
0 ratings0 reviews
Book preview
Data Preparation and Exploration - Robert Hoyt
COPYRIGHT
Data Preparation and Exploration
Applied to Healthcare Data
By Robert Hoyt and Robert Muenchen
Copyright © November 2020 by Informatics Education
All rights reserved. No part of this book may be reproduced or transmitted in any form, by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system without written permission from the publisher, except for the inclusion of brief excerpts in connection with reviews or scholarly analysis
Disclaimer
Every effort has been made to make this book as accurate as possible, but no warranty is implied. The information provided is on an as is
basis. The authors and the publisher shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book. The views expressed in the book are those of the authors and do not necessarily reflect the official policy or position of any university or government.
eBook (EPUB): ISBN: 978-0-9887529-6-2
Print copy: ISBN: 978-0-9887529-7-9
eBook (pdf): ISBN: 978-0-9887529-0-0
PREFACE
Most data scientists spend the majority of their time locating appropriate clinical data, then preparing and exploring it for meaningful results. While some have referred to data science as the sexiest job of the twenty-first century,
the reality is that it involves much more than just creating a model with cutting-edge algorithms and programming languages.
Data preparation and exploration is like prep work before painting. There is sanding, dissembling, color selection, and priming before the final coat of paint is applied. Without proper data preparation and exploration, a user will likely encounter a garbage in, garbage out
scenario.
We wrote this textbook because we felt there was not enough emphasis on this topic and only a few resources to select from. Most resources tend to focus on only one approach, such as applying a programming language to every problem. In this book, we use statistical software, spreadsheets, and programming languages to tackle data preparation and exploration problems. Also, we use healthcare datasets to make the scenarios more real-world and we added student exercises at the end of each chapter. We also added video tutorials in as many places as possible to provide additional resources in another format.
The field is moving towards automated machine learning that will expedite the process of data preparation and exploration. Despite this welcomed advance, budding data scientists will still need to understand why and how these steps are taken.
There is a separate chapter on healthcare data resources to make the journey easier. The datasets are all publicly available and may derive from governmental and private organizations. Instructors and students are strongly urged to get their feet wet
with as many data exercises as possible. It would be wise to develop a checklist of the normal steps of data preparation and exploration for every dataset you analyze.
More textbook details are available on https://informaticseducation.org
Robert Hoyt MD FACP FAMIA
Robert Muenchen MS PSA
ABOUT THE AUTHORS
Robert E. Hoyt, MD, FACP, FAMIA, is an internal medicine physician who was in private practice for 15 years and served as a physician in the military for 20 years. During this time, he taught health informatics for 13 years at the University of West Florida. He has been involved in health informatics for the past two decades, but in the last five years he has focused primarily on biomedical data science, with emphasis on machine learning and artificial intelligence. He is a co-author and co-editor of Health Informatics: Practical Guide that is in its seventh edition. Additionally, he is the co-editor and co-author of the Introduction to Biomedical Data Science with Robert Muenchen that launched in 2019.
Robert A. Muenchen, MS, PSA is the author of the BlueSky Statistics 7.1 User Guide, R for SAS and SPSS Users, and coauthor of R for Stata Users and Introduction to Biomedical Data Science. An ASA Accredited Professional Statistician, Bob wrote or co-authored over 70 articles published in scientific journals and conference proceedings. At The University of Tennessee, he guided more than 1,000 graduate theses and dissertations and he continues to teach R workshops there.
ACKNOWLEDGMENTS
We would like to thank Ann Yoshihashi MD FACE for textbook formatting and proofreading.
We would also like to thank Karen Monsen PhD RN FAMIA FAAN and David Hurwitz MD FACP for their help reviewing the textbook
1
THE IMPORTANCE OF DATA PREPARATION AND EXPLORATION
Robert Hoyt MD Robert Muenchen
"Data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others." – Mike Loukides, editor, O’Reilly Media.
LEARNING OBJECTIVES
After reading the chapter the reader should be able to:
Introduction
Because data scientists and others spend so much time with data preparation and exploration, we believe a separate textbook is warranted and we now offer it in addition to our other textbook Introduction to Biomedical Data Science. (1) Data preparation and exploration occur early in the data science process, as data scientists prepare their data before modeling.
The data science process, (as adapted from Blitzstein and Pfeister) includes multiple steps, as displayed in figure 1.1 below. (2) This chapter will focus on the first 4 steps, specifically asking the right question, getting the data, cleaning, visualizing, and exploring the data.
Figure 1.1 The data science process (adapted from Blitzstein and Pfeister)
The majority of the time spent by a data scientist is on the early four steps of the data science process. Take note of the number of bi-directional arrows between the boxes and the single arrow on the left that returns from deploying the model back to the beginning. Starting over happens every time a variable, metric, or feature is added to the dataset. This highlights the possibility that the model created was a poor performer and needs to be adjusted, and the process starts over. This entire process is iterative and not linear. Domain (clinical) expertise is critical to help sort out what is important