Learn R for Applied Statistics: With Data Visualizations, Regressions, and Statistics
()
About this ebook
Learn R for Applied Statistics is a timely skills-migration book that equips you with the R programming fundamentals and introduces you to applied statistics for data explorations.
What You Will Learn
- Discover R, statistics, data science, data mining, and big data
- Master the fundamentals of R programming, including variables and arithmetic, vectors, lists, data frames, conditional statements, loops, and functions
- Work with descriptive statistics
- Create data visualizations, including bar charts, line charts, scatter plots, boxplots, histograms, and scatterplots
- Use inferential statistics including t-tests, chi-square tests, ANOVA, non-parametric tests, linear regressions, and multiple linear regressions
Who This Book Is For
Those who are interested in data science, in particular data exploration using applied statistics, and the use of R programming for data visualizations.
Related to Learn R for Applied Statistics
Related ebooks
Learn Data Science Using SAS Studio: A Quick-Start Guide Rating: 0 out of 5 stars0 ratingsDeep Learning for Numerical Applications with SAS Rating: 0 out of 5 stars0 ratingsUncertain Input Data Problems and the Worst Scenario Method Rating: 0 out of 5 stars0 ratingsMutualistic Networks Rating: 0 out of 5 stars0 ratingsMeasuring Abundance: Methods for the Estimation of Population Size and Species Richness Rating: 0 out of 5 stars0 ratingsQuery Optimization A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsHumanities Data Analysis: Case Studies with Python Rating: 0 out of 5 stars0 ratingsJulia for Data Analysis Rating: 0 out of 5 stars0 ratingsThe Logic of Social Science Rating: 0 out of 5 stars0 ratingsArts and Crafts Essays by Members of the Arts and Crafts Exhibition Society Rating: 1 out of 5 stars1/5Iterative Solution of Large Linear Systems Rating: 0 out of 5 stars0 ratingsRefactoring Application Architecture A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsSQL Primer: An Accelerated Introduction to SQL Basics Rating: 0 out of 5 stars0 ratingsSPSS: The Ultimate Data Analysis Tool Rating: 0 out of 5 stars0 ratingsExperimental Design: A Chemometric Approach Rating: 0 out of 5 stars0 ratingsRegression Graphics: Ideas for Studying Regressions Through Graphics Rating: 0 out of 5 stars0 ratingsSharePoint A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsApplied Data Mining for Forecasting Using SAS Rating: 0 out of 5 stars0 ratingsApplied Data Science Using PySpark: Learn the End-to-End Predictive Model-Building Cycle Rating: 0 out of 5 stars0 ratingsData Simplification: Taming Information With Open Source Tools Rating: 0 out of 5 stars0 ratingsPython Testing with Selenium: Learn to Implement Different Testing Techniques Using the Selenium WebDriver Rating: 0 out of 5 stars0 ratingsAssigning Risk Indicators to Hazard Trees Rating: 0 out of 5 stars0 ratingsRadiation Safety in Nuclear Medicine: A Practical, Concise Guide Rating: 0 out of 5 stars0 ratingsStatistics for Experimentalists Rating: 0 out of 5 stars0 ratingsIntroduction to Data Science Using R Rating: 0 out of 5 stars0 ratingsAcademic Search Engines: A Quantitative Outlook Rating: 0 out of 5 stars0 ratingsFuture Development of Japanese Dwelling Houses Rating: 0 out of 5 stars0 ratingsSAS Viya: The R Perspective Rating: 0 out of 5 stars0 ratingsElementary Statistics Using SAS Rating: 0 out of 5 stars0 ratingsDescriptive and Subject Cataloguing: A Workbook Rating: 0 out of 5 stars0 ratings
Programming For You
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5Java for Beginners: A Crash Course to Learn Java Programming in 1 Week Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 0 out of 5 stars0 ratingsGrokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1 Rating: 4 out of 5 stars4/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days Rating: 5 out of 5 stars5/5The Unofficial Guide to Open Broadcaster Software: OBS: The World's Most Popular Free Live-Streaming Application Rating: 0 out of 5 stars0 ratingsPYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Python Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming Rating: 0 out of 5 stars0 ratingsTeach Yourself C++ Rating: 4 out of 5 stars4/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles Rating: 4 out of 5 stars4/5The Little SAS Book: A Primer, Sixth Edition Rating: 5 out of 5 stars5/5SQL All-in-One For Dummies Rating: 3 out of 5 stars3/5Pokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS Rating: 5 out of 5 stars5/5Linux: Learn in 24 Hours Rating: 5 out of 5 stars5/5
Reviews for Learn R for Applied Statistics
0 ratings0 reviews
Book preview
Learn R for Applied Statistics - Eric Goh Ming Hui
© Eric Goh Ming Hui 2019
Eric Goh Ming HuiLearn R for Applied Statisticshttps://doi.org/10.1007/978-1-4842-4200-1_1
1. Introduction
Eric Goh Ming Hui¹
(1)
Singapore, Singapore
In this book, you will use R for applied statistics, which can be used in the data understanding and modeling stages of the CRISP DM (data mining) model. Data mining is the process of mining the insights and knowledge from data. R programming was created for statistics and is used in academic and research fields. R programming has evolved over time and many packages have been created to do data mining, text mining, and data visualizations tasks. R is very mature in the statistics field, so it is ideal to use R for the data exploration, data understanding, or modeling stages of the CRISP DM model.
What Is R?
According to Wikipedia, R programming is for statistical computing and is supported by the R Foundation for Statistical Computing. The R programming language is used by academics and researchers for data analysis and statistical analysis, and R programming’s popularity has risen over time. As of June 2018, R is ranked 10th in the TIOBE index. The TIOBE Company created and maintains the TIOBE programming community index, which is the measure of the popularity of programming languages. TIOBE is the acronym for The Importance of Being Earnest.
R is a GNU package and is available freely under the GNU General Public License. This means that R is available with source code, and you are free to use R, but you must adhere to the license. R is available in the command line, but there are many integrated development environments (IDEs) available for R. An IDE is software that has comprehensive facilities like a code editor, compiler, and debugger tools to help developers write R scripts. One famous IDE is RStudio, which assists developers in writing R scripts by providing all the required tools in one software package.
R is an implementation of the S programming language, which was created by Ross Ihahka and Robert Gentlemen at the University of Auckland. R and its libraries are made up of statistical and graphical techniques, including descriptive statistics, inferential statistics, and regression analysis. Another strength of R is that it is able to produce publishable quality graphs and charts, and can use packages like ggplot for advanced graphs.
According to the CRISP DM model, to do a data mining project, you must understand the business, and then understand and prepare the data. Then comes modeling and evaluation, and then deployment. R is strong in statistics and data visualization, so it is ideal to use R for data understanding and modeling.
Along with Python, R is used widely in the field of data science, which consists of statistics, machine learning, and domain expertise or knowledge.
High-Level and Low-Level Languages
A high-level programming language (HLL) is designed to be used by a human and is closer to the human language. Its programming style is easier to comprehend and implement than a lower-level programming language (LLL). A high-level programming language needs to be converted to machine language before being executed, so a high-level programming language can be slower.
A low-level programming language, on the other hand, is a lot closer to the machine and computer language. A low-level programming language can be executed directly on computer without the need to convert between languages before execution. Thus, a low-level programming language can be faster than a high-level programming language. Low-level programming languages like the assembly language are more inclined towards machine language that deals with bits 0 and 1.
R is a HLL because it shares many similarities to human languages. For example, in R programming code,
> var1 <- 1;
> var2 <- 2;
>
> result <- var1 + var2;
> print(result)
[1] 3
>
The R programming code is more like human language. A low-level programming language like the assembly language is more towards the machine language, like 0011 0110:
0x52ac87: movl7303445 (%ebx), %eax
0x52ac78: calll 0x6bfb03
What Is Statistics?
Statistics is a collection of mathematics to deal with the organization, analysis, and interpretation of data. Three main statistical methods are used in the data analysis: descriptive statistics, inferential statistics, and regressions analysis.
Descriptive statistics summarizes the data and usually focuses on the distribution, the central tendency, and the dispersion of data. The distribution can be normal distribution or binomial distribution, and the central tendency is to describe the data with respect to the central of the data. The central tendency can be the mean, median, and mode of the data. The dispersion describes the spread of the data, and dispersion can be the variance, standard deviation, and interquartile range.
Inferential statistics tests the relationship between two data sets or two samples, and a hypothesis is usually set for the statistical relationships between them. The hypothesis can be a null hypothesis or alterative hypothesis, and rejecting the null hypothesis is done using tests like the T Test, Chi Square Test, and ANOVA. The Chi Square Test is more for categorical variables, and the T Test is more for continuous variables. The ANOVA test is for more complex applications.
Regression analysis is used to identify the relationships between two variables. Regressions can be linear regressions or non-linear regressions. The regression can also be a simple linear regression or multiple linear regressions for identifying relationships for more variables.
Data visualization is the technique used to communicate or present data using graphs, charts, and dashboards. Data visualizations can help us understand the data more easily.
What Is Data Science?
Data science is a multidisciplinary field that includes statistics, computer science, machine learning, and domain expertise to get knowledge and insights from data. Data science usually ends up developing a data product. A data product is the changing of the data of a company into a product to solve a problem.
For example, a data product can be the product recommendation system used in Amazon and Lazada. These companies have a lot of data based on shoppers’ purchases. Using this data, Amazon and Lazada can identify the shopping patterns of shoppers and create a recommendation system or data product to recommend other products whenever a shopper buys a product.
The term data science
has become a buzzword and is now used to represent many areas like data analytics, data mining, text mining, data visualizations, prediction modeling, and so on.
The history of data science started in November 1997, when C. F. Jeff Wu characterized statistical work as data collection, analysis, and decision making, and presented his lecture called Statistics = Data Science?
In 2001, William S. Cleveland introduced data science as a field that comprised statistics and some computing in his article called Data Science: An Action Plan for Expanding the Technical Area of the Field of Statistics.
DJ Patil, who claims to have coined the term data science
with Jeff Hammerbacher and who wrote the Data Scientist: The Sexiest Job of the 21st Century
article published in the Harvard Business Review, says that there is a data scientist shortage in many industries, and data science is important in many companies because data analysis can help companies make many decisions. Every company needs to make decisions in strategic directions.
Statistics is important in data science because it can help analysts or data scientists analyze and understand data. Descriptive statistics assists in summarizing the data, inferential statistics tests the relationship between two data sets or samples, and regression analysis explores the relationships between multiple variables. Data visualizations can explore the data with charts, graphs, and dashboards. Regressions and machine learning algorithms can be used in predictive analytics to train a model and predict a variable.
Linear regression has the formula y = mx + c. You use historical data to train the formula to get the m and c. Y is the output variable and x is the input variable. Machine learning algorithms and regression or statistical learning algorithms are used to predict a variable like this approach.
Domain expertise is the knowledge of the data set. If the data set is business data, then the domain expertise should be business; if it is university data, education is the domain expertise; if the data set is healthcare data, healthcare is the domain knowledge. I believe that business is the most important knowledge because almost all companies use data analysis to make important strategic business decisions.
Adding in product design and engineering knowledge takes us into the fields of Internet of Things (IoT) and smart cities because data science and predictive analytics can be used on sensor data. Because data science is a multidisciplinary field, if you can master statistics, machine e-learning, and business knowledge, it is extremely hard to be replaced. You can also work with statisticians, machine learning engineers, or business experts to complete a data science project.
Figure 1-1 shows a data science diagram.
../images/471585_1_En_1_Chapter/471585_1_En_1_Fig1_HTML.pngFigure 1-1
Data science is an intersection
What Is Data Mining?
Data mining is closely related to data science. Data mining is the process of identifying the patterns from data using statistics, machine learning, and data warehouses or databases.
Extraction of patterns from data is not very new, and early methods include the use of the Nayes theorem and regressions. The growth of technologies increases the ability in data collection. The growth of technologies also allows the use of statistical learning and machine learning algorithms like neural networks, fuzzy logic, decision trees, generic algorithms, and support vector machines to uncover the hidden patterns of data. Data mining combines statistics and machine learning, and usually results in the creation of models for making predictions based on historical data.
The cross-industry standard process of data mining , also known as CRISP-DM, is a process used by data mining experts and it is one of the most popular data mining models. See Figure 1-2.
../images/471585_1_En_1_Chapter/471585_1_En_1_Fig2_HTML.pngFigure 1-2
Cross-industry standard process for data mining
The CRISP-DM model was created in 1996