Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Learn R for Applied Statistics: With Data Visualizations, Regressions, and Statistics
Learn R for Applied Statistics: With Data Visualizations, Regressions, and Statistics
Learn R for Applied Statistics: With Data Visualizations, Regressions, and Statistics
Ebook300 pages1 hour

Learn R for Applied Statistics: With Data Visualizations, Regressions, and Statistics

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Gain the R programming language fundamentals for doing the applied statistics useful for data exploration and analysis in data science and data mining. This book covers topics ranging from R syntax basics, descriptive statistics, and data visualizations to inferential statistics and regressions. After learning R’s syntax, you will work through data visualizations such as histograms and boxplot charting, descriptive statistics, and inferential statistics such as t-test, chi-square test, ANOVA, non-parametric test, and linear regressions. 
Learn R for Applied Statistics is a timely skills-migration book that equips you with the R programming fundamentals and introduces you to applied statistics for data explorations. 
What You Will Learn
  • Discover R, statistics, data science, data mining, and big data
  • Master the fundamentals of R programming, including variables and arithmetic, vectors, lists, data frames, conditional statements, loops, and functions
  • Work with descriptive statistics 
  • Create data visualizations, including bar charts, line charts, scatter plots, boxplots, histograms, and scatterplots
  • Use inferential statistics including t-tests, chi-square tests, ANOVA, non-parametric tests, linear regressions, and multiple linear regressions

Who This Book Is For
Those who are interested in data science, in particular data exploration using applied statistics, and the use of R programming for data visualizations.  
LanguageEnglish
PublisherApress
Release dateNov 30, 2018
ISBN9781484242001
Learn R for Applied Statistics: With Data Visualizations, Regressions, and Statistics

Related to Learn R for Applied Statistics

Related ebooks

Programming For You

View More

Related articles

Reviews for Learn R for Applied Statistics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Learn R for Applied Statistics - Eric Goh Ming Hui

    ©  Eric Goh Ming Hui 2019

    Eric Goh Ming HuiLearn R for Applied Statisticshttps://doi.org/10.1007/978-1-4842-4200-1_1

    1. Introduction

    Eric Goh Ming Hui¹ 

    (1)

    Singapore, Singapore

    In this book, you will use R for applied statistics, which can be used in the data understanding and modeling stages of the CRISP DM (data mining) model. Data mining is the process of mining the insights and knowledge from data. R programming was created for statistics and is used in academic and research fields. R programming has evolved over time and many packages have been created to do data mining, text mining, and data visualizations tasks. R is very mature in the statistics field, so it is ideal to use R for the data exploration, data understanding, or modeling stages of the CRISP DM model.

    What Is R?

    According to Wikipedia, R programming is for statistical computing and is supported by the R Foundation for Statistical Computing. The R programming language is used by academics and researchers for data analysis and statistical analysis, and R programming’s popularity has risen over time. As of June 2018, R is ranked 10th in the TIOBE index. The TIOBE Company created and maintains the TIOBE programming community index, which is the measure of the popularity of programming languages. TIOBE is the acronym for The Importance of Being Earnest.

    R is a GNU package and is available freely under the GNU General Public License. This means that R is available with source code, and you are free to use R, but you must adhere to the license. R is available in the command line, but there are many integrated development environments (IDEs) available for R. An IDE is software that has comprehensive facilities like a code editor, compiler, and debugger tools to help developers write R scripts. One famous IDE is RStudio, which assists developers in writing R scripts by providing all the required tools in one software package.

    R is an implementation of the S programming language, which was created by Ross Ihahka and Robert Gentlemen at the University of Auckland. R and its libraries are made up of statistical and graphical techniques, including descriptive statistics, inferential statistics, and regression analysis. Another strength of R is that it is able to produce publishable quality graphs and charts, and can use packages like ggplot for advanced graphs.

    According to the CRISP DM model, to do a data mining project, you must understand the business, and then understand and prepare the data. Then comes modeling and evaluation, and then deployment. R is strong in statistics and data visualization, so it is ideal to use R for data understanding and modeling.

    Along with Python, R is used widely in the field of data science, which consists of statistics, machine learning, and domain expertise or knowledge.

    High-Level and Low-Level Languages

    A high-level programming language (HLL) is designed to be used by a human and is closer to the human language. Its programming style is easier to comprehend and implement than a lower-level programming language (LLL). A high-level programming language needs to be converted to machine language before being executed, so a high-level programming language can be slower.

    A low-level programming language, on the other hand, is a lot closer to the machine and computer language. A low-level programming language can be executed directly on computer without the need to convert between languages before execution. Thus, a low-level programming language can be faster than a high-level programming language. Low-level programming languages like the assembly language are more inclined towards machine language that deals with bits 0 and 1.

    R is a HLL because it shares many similarities to human languages. For example, in R programming code,

    > var1 <- 1;

    > var2 <- 2;

    >

    > result <- var1 + var2;

    > print(result)

     [1] 3

    >

    The R programming code is more like human language. A low-level programming language like the assembly language is more towards the machine language, like 0011 0110:

    0x52ac87:      movl7303445 (%ebx), %eax

    0x52ac78:      calll         0x6bfb03

    What Is Statistics?

    Statistics is a collection of mathematics to deal with the organization, analysis, and interpretation of data. Three main statistical methods are used in the data analysis: descriptive statistics, inferential statistics, and regressions analysis.

    Descriptive statistics summarizes the data and usually focuses on the distribution, the central tendency, and the dispersion of data. The distribution can be normal distribution or binomial distribution, and the central tendency is to describe the data with respect to the central of the data. The central tendency can be the mean, median, and mode of the data. The dispersion describes the spread of the data, and dispersion can be the variance, standard deviation, and interquartile range.

    Inferential statistics tests the relationship between two data sets or two samples, and a hypothesis is usually set for the statistical relationships between them. The hypothesis can be a null hypothesis or alterative hypothesis, and rejecting the null hypothesis is done using tests like the T Test, Chi Square Test, and ANOVA. The Chi Square Test is more for categorical variables, and the T Test is more for continuous variables. The ANOVA test is for more complex applications.

    Regression analysis is used to identify the relationships between two variables. Regressions can be linear regressions or non-linear regressions. The regression can also be a simple linear regression or multiple linear regressions for identifying relationships for more variables.

    Data visualization is the technique used to communicate or present data using graphs, charts, and dashboards. Data visualizations can help us understand the data more easily.

    What Is Data Science?

    Data science is a multidisciplinary field that includes statistics, computer science, machine learning, and domain expertise to get knowledge and insights from data. Data science usually ends up developing a data product. A data product is the changing of the data of a company into a product to solve a problem.

    For example, a data product can be the product recommendation system used in Amazon and Lazada. These companies have a lot of data based on shoppers’ purchases. Using this data, Amazon and Lazada can identify the shopping patterns of shoppers and create a recommendation system or data product to recommend other products whenever a shopper buys a product.

    The term data science has become a buzzword and is now used to represent many areas like data analytics, data mining, text mining, data visualizations, prediction modeling, and so on.

    The history of data science started in November 1997, when C. F. Jeff Wu characterized statistical work as data collection, analysis, and decision making, and presented his lecture called Statistics = Data Science? In 2001, William S. Cleveland introduced data science as a field that comprised statistics and some computing in his article called Data Science: An Action Plan for Expanding the Technical Area of the Field of Statistics.

    DJ Patil, who claims to have coined the term data science with Jeff Hammerbacher and who wrote the Data Scientist: The Sexiest Job of the 21st Century article published in the Harvard Business Review, says that there is a data scientist shortage in many industries, and data science is important in many companies because data analysis can help companies make many decisions. Every company needs to make decisions in strategic directions.

    Statistics is important in data science because it can help analysts or data scientists analyze and understand data. Descriptive statistics assists in summarizing the data, inferential statistics tests the relationship between two data sets or samples, and regression analysis explores the relationships between multiple variables. Data visualizations can explore the data with charts, graphs, and dashboards. Regressions and machine learning algorithms can be used in predictive analytics to train a model and predict a variable.

    Linear regression has the formula y = mx + c. You use historical data to train the formula to get the m and c. Y is the output variable and x is the input variable. Machine learning algorithms and regression or statistical learning algorithms are used to predict a variable like this approach.

    Domain expertise is the knowledge of the data set. If the data set is business data, then the domain expertise should be business; if it is university data, education is the domain expertise; if the data set is healthcare data, healthcare is the domain knowledge. I believe that business is the most important knowledge because almost all companies use data analysis to make important strategic business decisions.

    Adding in product design and engineering knowledge takes us into the fields of Internet of Things (IoT) and smart cities because data science and predictive analytics can be used on sensor data. Because data science is a multidisciplinary field, if you can master statistics, machine e-learning, and business knowledge, it is extremely hard to be replaced. You can also work with statisticians, machine learning engineers, or business experts to complete a data science project.

    Figure 1-1 shows a data science diagram.

    ../images/471585_1_En_1_Chapter/471585_1_En_1_Fig1_HTML.png

    Figure 1-1

    Data science is an intersection

    What Is Data Mining?

    Data mining is closely related to data science. Data mining is the process of identifying the patterns from data using statistics, machine learning, and data warehouses or databases.

    Extraction of patterns from data is not very new, and early methods include the use of the Nayes theorem and regressions. The growth of technologies increases the ability in data collection. The growth of technologies also allows the use of statistical learning and machine learning algorithms like neural networks, fuzzy logic, decision trees, generic algorithms, and support vector machines to uncover the hidden patterns of data. Data mining combines statistics and machine learning, and usually results in the creation of models for making predictions based on historical data.

    The cross-industry standard process of data mining , also known as CRISP-DM, is a process used by data mining experts and it is one of the most popular data mining models. See Figure 1-2.

    ../images/471585_1_En_1_Chapter/471585_1_En_1_Fig2_HTML.png

    Figure 1-2

    Cross-industry standard process for data mining

    The CRISP-DM model was created in 1996

    Enjoying the preview?
    Page 1 of 1