A General Introduction to Data Analytics

Ebook616 pages8 hours

A General Introduction to Data Analytics

Name: A General Introduction to Data Analytics
Author: João Moreira
ISBN: 9781119296263

By João Moreira, Andre Carvalho and Tomás Horvath

Rating: 0 out of 5 stars

()

Read preview

About this ebook

A guide to the principles and methods of data analysis that does not require knowledge of statistics or programming

A General Introduction to Data Analytics is an essential guide to understand and use data analytics. This book is written using easy-to-understand terms and does not require familiarity with statistics or programming. The authors—noted experts in the field—highlight an explanation of the intuition behind the basic data analytics techniques. The text also contains exercises and illustrative examples.

Thought to be easily accessible to non-experts, the book provides motivation to the necessity of analyzing data. It explains how to visualize and summarize data, and how to find natural groups and frequent patterns in a dataset. The book also explores predictive tasks, be them classification or regression. Finally, the book discusses popular data analytic applications, like mining the web, information retrieval, social network analysis, working with text, and recommender systems. The learning resources offer:

A guide to the reasoning behind data mining techniques
A unique illustrative example that extends throughout all the chapters
Exercises at the end of each chapter and larger projects at the end of each of the text’s two main parts

Together with these learning resources, the book can be used in a 13-week course guide, one chapter per course topic.

The book was written in a format that allows the understanding of the main data analytics concepts by non-mathematicians, non-statisticians and non-computer scientists interested in getting an introduction to data science. A General Introduction to Data Analytics is a basic guide to data analytics written in highly accessible terms.

Skip carousel

Mathematics

LanguageEnglish

PublisherWiley-Interscience

Release dateJun 25, 2018

ISBN9781119296263

Author

João Moreira

Related authors

Skip carousel

Related to A General Introduction to Data Analytics

Related ebooks

Skip carousel

Data Mining For Dummies
Ebook
Data Mining For Dummies
byMeta S. Brown
Rating: 4 out of 5 stars
4/5
Feature Engineering Bookcamp
Ebook
Feature Engineering Bookcamp
bySinan Ozdemir
Rating: 0 out of 5 stars
0 ratings
The Hyperautomation Revolution: Transforming Industries and Workforces
Ebook
The Hyperautomation Revolution: Transforming Industries and Workforces
byMorgan Lee
Rating: 0 out of 5 stars
0 ratings
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Ebook
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
byMatthew Rosch
Rating: 0 out of 5 stars
0 ratings
Business Analytics with SAS Studio: Deliver Business Intelligence by Combining SQL Processing, Insightful Visualizations, and Various Data Mining Techniques
Ebook
Business Analytics with SAS Studio: Deliver Business Intelligence by Combining SQL Processing, Insightful Visualizations, and Various Data Mining Techniques
byRajinder Kr. Chitoria
Rating: 0 out of 5 stars
0 ratings
The Happiness Index: Why Today's Employee Emotions Equal Tomorrow's Business Success
Ebook
The Happiness Index: Why Today's Employee Emotions Equal Tomorrow's Business Success
byMatt Phelan
Rating: 0 out of 5 stars
0 ratings
Data Scientist Pocket Guide: Over 600 Concepts, Terminologies, and Processes of Machine Learning and Deep Learning Assembled Together
Ebook
Data Scientist Pocket Guide: Over 600 Concepts, Terminologies, and Processes of Machine Learning and Deep Learning Assembled Together
byMohamed Sabri
Rating: 0 out of 5 stars
0 ratings
SQL 101 Crash Course: Comprehensive Guide to SQL Fundamentals and Practical Applications
Ebook
SQL 101 Crash Course: Comprehensive Guide to SQL Fundamentals and Practical Applications
byEmrys Callahan
Rating: 5 out of 5 stars
5/5
Job Ready Python
Ebook
Job Ready Python
byHaythem Balti
Rating: 0 out of 5 stars
0 ratings
Python for Data Science For Dummies
Ebook
Python for Data Science For Dummies
byJohn Paul Mueller
Rating: 0 out of 5 stars
0 ratings
AI and Machine-Learning Algorithms Second Edition
Ebook
AI and Machine-Learning Algorithms Second Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
The New Know: Innovation Powered by Analytics
Ebook
The New Know: Innovation Powered by Analytics
byThornton May
Rating: 0 out of 5 stars
0 ratings
Data Mining for the Social Sciences: An Introduction
Ebook
Data Mining for the Social Sciences: An Introduction
byPaul Attewell
Rating: 0 out of 5 stars
0 ratings
Ensemble Methods for Machine Learning
Ebook
Ensemble Methods for Machine Learning
byGautam Kunapuli
Rating: 0 out of 5 stars
0 ratings
Decision Intelligence For Dummies
Ebook
Decision Intelligence For Dummies
byPamela Baker
Rating: 0 out of 5 stars
0 ratings
Information Systems Management: Governance, Urbanization and Alignment
Ebook
Information Systems Management: Governance, Urbanization and Alignment
byDaniel Alban
Rating: 0 out of 5 stars
0 ratings
Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst
Ebook
Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst
byDean Abbott
Rating: 0 out of 5 stars
0 ratings
Data Science and Machine Learning Interview Questions Using R: Crack the Data Scientist and Machine Learning Engineers Interviews with Ease
Ebook
Data Science and Machine Learning Interview Questions Using R: Crack the Data Scientist and Machine Learning Engineers Interviews with Ease
byVishwanathan Narayanan
Rating: 0 out of 5 stars
0 ratings
Access Solutions: Tips, Tricks, and Secrets from Microsoft Access MVPs
Ebook
Access Solutions: Tips, Tricks, and Secrets from Microsoft Access MVPs
byArvin Meyer
Rating: 0 out of 5 stars
0 ratings
Active Directory and PowerShell for Jobseekers: Learn how to create, manage, and secure user accounts (English Edition)
Ebook
Active Directory and PowerShell for Jobseekers: Learn how to create, manage, and secure user accounts (English Edition)
byMariusz Wróbel
Rating: 0 out of 5 stars
0 ratings
Data Lakes For Dummies
Ebook
Data Lakes For Dummies
byAlan R. Simon
Rating: 0 out of 5 stars
0 ratings
The Data Science Handbook
Ebook
The Data Science Handbook
byField Cady
Rating: 0 out of 5 stars
0 ratings
Beginning with Machine Learning: The Ultimate Introduction to Machine Learning, Deep Learning, Scikit-learn, and TensorFlow (English Edition)
Ebook
Beginning with Machine Learning: The Ultimate Introduction to Machine Learning, Deep Learning, Scikit-learn, and TensorFlow (English Edition)
byDr. Amit Dua
Rating: 0 out of 5 stars
0 ratings
Developing Analytic Talent: Becoming a Data Scientist
Ebook
Developing Analytic Talent: Becoming a Data Scientist
byVincent Granville
Rating: 3 out of 5 stars
3/5
Data Literacy Fundamentals
Ebook
Data Literacy Fundamentals
byBen Jones
Rating: 0 out of 5 stars
0 ratings
"AI Innovations: How Technology is Pushing the Boundaries" Understanding and Using Artificial Intelligence: An AI Book
Ebook
"AI Innovations: How Technology is Pushing the Boundaries" Understanding and Using Artificial Intelligence: An AI Book
byJhon Dujardin
Rating: 0 out of 5 stars
0 ratings
Self-Service Data Analytics and Governance for Managers
Ebook
Self-Service Data Analytics and Governance for Managers
byNathan E. Myers
Rating: 0 out of 5 stars
0 ratings
Empowered by Data: How to Build Inspired Analytics Communities
Ebook
Empowered by Data: How to Build Inspired Analytics Communities
byEva Murray
Rating: 0 out of 5 stars
0 ratings
Cohort analysis A Complete Guide
Ebook
Cohort analysis A Complete Guide
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Data literacy A Clear and Concise Reference
Ebook
Data literacy A Clear and Concise Reference
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings

Mathematics For You

Skip carousel

Algebra - The Very Basics
Ebook
Algebra - The Very Basics
byMetin Bektas
Rating: 5 out of 5 stars
5/5
Basic Math & Pre-Algebra For Dummies
Ebook
Basic Math & Pre-Algebra For Dummies
byMark Zegarelli
Rating: 4 out of 5 stars
4/5
Calculus For Dummies
Ebook
Calculus For Dummies
byMark Ryan
Rating: 4 out of 5 stars
4/5
Statistics 101: From Data Analysis and Predictive Modeling to Measuring Distribution and Determining Probability, Your Essential Guide to Statistics
Ebook
Statistics 101: From Data Analysis and Predictive Modeling to Measuring Distribution and Determining Probability, Your Essential Guide to Statistics
byDavid Borman
Rating: 4 out of 5 stars
4/5
Mental Math Secrets - How To Be a Human Calculator
Ebook
Mental Math Secrets - How To Be a Human Calculator
byRandy Silverman
Rating: 5 out of 5 stars
5/5
Geometry For Dummies
Ebook
Geometry For Dummies
byMark Ryan
Rating: 5 out of 5 stars
5/5
Build a Mathematical Mind - Even If You Think You Can't Have One: Become a Pattern Detective. Boost Your Critical and Logical Thinking Skills.
Ebook
Build a Mathematical Mind - Even If You Think You Can't Have One: Become a Pattern Detective. Boost Your Critical and Logical Thinking Skills.
byAlbert Rutherford
Rating: 5 out of 5 stars
5/5
Pre-Calculus For Dummies
Ebook
Pre-Calculus For Dummies
byYang Kuang
Rating: 5 out of 5 stars
5/5
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
Ebook
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
byS. Deviant
Rating: 4 out of 5 stars
4/5
Precalculus: A Self-Teaching Guide
Ebook
Precalculus: A Self-Teaching Guide
bySteve Slavin
Rating: 4 out of 5 stars
4/5
Quantum Physics for Beginners
Ebook
Quantum Physics for Beginners
byMax Thomson
Rating: 4 out of 5 stars
4/5
Calculus Made Easy
Ebook
Calculus Made Easy
bySilvanus P. Thompson
Rating: 4 out of 5 stars
4/5
Algebra I Workbook For Dummies
Ebook
Algebra I Workbook For Dummies
byMary Jane Sterling
Rating: 3 out of 5 stars
3/5
Practice Makes Perfect Algebra II Review and Workbook, Second Edition
Ebook
Practice Makes Perfect Algebra II Review and Workbook, Second Edition
byChristopher Monahan
Rating: 4 out of 5 stars
4/5
The Everything Guide to Algebra: A Step-by-Step Guide to the Basics of Algebra - in Plain English!
Ebook
The Everything Guide to Algebra: A Step-by-Step Guide to the Basics of Algebra - in Plain English!
byChristopher Monahan
Rating: 4 out of 5 stars
4/5
Game Theory: A Simple Introduction
Ebook
Game Theory: A Simple Introduction
byK.H. Erickson
Rating: 4 out of 5 stars
4/5
The Everything Everyday Math Book: From Tipping to Taxes, All the Real-World, Everyday Math Skills You Need
Ebook
The Everything Everyday Math Book: From Tipping to Taxes, All the Real-World, Everyday Math Skills You Need
byChristopher Monahan
Rating: 5 out of 5 stars
5/5
The Everything Guide to Pre-Algebra: A Helpful Practice Guide Through the Pre-Algebra Basics - in Plain English!
Ebook
The Everything Guide to Pre-Algebra: A Helpful Practice Guide Through the Pre-Algebra Basics - in Plain English!
byJane Cassie
Rating: 5 out of 5 stars
5/5
A Mind for Numbers | Summary
Ebook
A Mind for Numbers | Summary
bySummary Station
Rating: 4 out of 5 stars
4/5
Introducing Game Theory: A Graphic Guide
Ebook
Introducing Game Theory: A Graphic Guide
byIvan Pastine
Rating: 4 out of 5 stars
4/5
The Golden Ratio: The Divine Beauty of Mathematics
Ebook
The Golden Ratio: The Divine Beauty of Mathematics
byGary B. Meisner
Rating: 5 out of 5 stars
5/5
Relativity: The special and the general theory
Ebook
Relativity: The special and the general theory
byAlbert Einstein
Rating: 5 out of 5 stars
5/5
The Thirteen Books of the Elements, Vol. 1
Ebook
The Thirteen Books of the Elements, Vol. 1
byEuclid
Rating: 0 out of 5 stars
0 ratings
Limitless Mind: Learn, Lead, and Live Without Barriers
Ebook
Limitless Mind: Learn, Lead, and Live Without Barriers
byJo Boaler
Rating: 4 out of 5 stars
4/5
The Little Book of Mathematical Principles, Theories & Things
Ebook
The Little Book of Mathematical Principles, Theories & Things
byRobert Solomon
Rating: 3 out of 5 stars
3/5
The Math of Life and Death: 7 Mathematical Principles That Shape Our Lives
Ebook
The Math of Life and Death: 7 Mathematical Principles That Shape Our Lives
byKit Yates
Rating: 4 out of 5 stars
4/5
Is God a Mathematician?
Ebook
Is God a Mathematician?
byMario Livio
Rating: 4 out of 5 stars
4/5
My Best Mathematical and Logic Puzzles
Ebook
My Best Mathematical and Logic Puzzles
byMartin Gardner
Rating: 5 out of 5 stars
5/5
Flatland
Ebook
Flatland
byEdwin A. Abbott
Rating: 4 out of 5 stars
4/5
ACT Math & Science Prep: Includes 500+ Practice Questions
Ebook
ACT Math & Science Prep: Includes 500+ Practice Questions
byKaplan Test Prep
Rating: 3 out of 5 stars
3/5

Related podcast episodes

Skip carousel

Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
Podcast episode
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
78: Mindset of a Rockstar Data Analyst w/ Trevor Tapscott: Our focus for this inspiring episode of AOF is mindset, especially if you want to be a standout data analyst! I have brought one of my first ever followers and day ones! Trevor Tapscott is a VP and Analytics Consultant at Wells Fargo and has been in...
Podcast episode
78: Mindset of a Rockstar Data Analyst w/ Trevor Tapscott: Our focus for this inspiring episode of AOF is mindset, especially if you want to be a standout data analyst! I have brought one of my first ever followers and day ones! Trevor Tapscott is a VP and Analytics Consultant at Wells Fargo and has been in...
byAnalytics on Fire
0 ratings
0% found this document useful
Data Observability - Barr Moses
Podcast episode
Data Observability - Barr Moses
byDataTalks.Club
0 ratings
0% found this document useful
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
Podcast episode
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
byMLOps.community
0 ratings
0% found this document useful
Revisiting the Minimalist Approach to Offline Reinforcement Learning: Recent years have witnessed significant advancements in offline reinforcement learning (RL), resulting in the development of numerous algorithms with varying degrees of complexity. While these algorithms have led to noteworthy improvements, many inco...
Podcast episode
Revisiting the Minimalist Approach to Offline Reinforcement Learning: Recent years have witnessed significant advancements in offline reinforcement learning (RL), resulting in the development of numerous algorithms with varying degrees of complexity. While these algorithms have led to noteworthy improvements, many inco...
byPapers Read on AI
0 ratings
0% found this document useful
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations: Large-scale recommendation systems are characterized by their reliance on high cardinality, heterogeneous features and the need to handle tens of billions of user actions on a daily basis. Despite being trained on huge volume of data with thousands o...
Podcast episode
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations: Large-scale recommendation systems are characterized by their reliance on high cardinality, heterogeneous features and the need to handle tens of billions of user actions on a daily basis. Despite being trained on huge volume of data with thousands o...
byPapers Read on AI
0 ratings
0% found this document useful
?ThursdAI - LAION down, OpenChat beats GPT3.5, Apple is showing where it's going, Midjourney v6 is here & Suno can make music!
Podcast episode
?ThursdAI - LAION down, OpenChat beats GPT3.5, Apple is showing where it's going, Midjourney v6 is here & Suno can make music!
byThursdAI - The top AI news from the past week
0 ratings
0% found this document useful
Ep. 65 - Data Modeling
Podcast episode
Ep. 65 - Data Modeling
byWhat's Your Baseline? Enterprise Architecture & Business Process Management Demystified
0 ratings
0% found this document useful
MLOps Coffee Sessions #10 Analyzing the Article “Continuous Delivery and Automation Pipelines in Machine Learning" // Part 2
Podcast episode
MLOps Coffee Sessions #10 Analyzing the Article “Continuous Delivery and Automation Pipelines in Machine Learning" // Part 2
byMLOps.community
0 ratings
0% found this document useful
What to consider when choosing an image analysis solution for phenotyping? (part 3) w/ Regan Baird, Visiopharm
Podcast episode
What to consider when choosing an image analysis solution for phenotyping? (part 3) w/ Regan Baird, Visiopharm
byDigital Pathology Podcast
0 ratings
0% found this document useful
22. Luke Marsden - Data Science Infrastructure and MLOps
Podcast episode
22. Luke Marsden - Data Science Infrastructure and MLOps
byTowards Data Science
0 ratings
0% found this document useful
Conquering the Last Mile in Data - Caitlin Moorman
Podcast episode
Conquering the Last Mile in Data - Caitlin Moorman
byDataTalks.Club
0 ratings
0% found this document useful
"Saga of a Gnarly Report" with Owen and Dan: Elixir Wizards Owen and Dan delve into the complexities of building advanced reporting features within software applications. They share personal insights and challenges encountered while developing reporting solutions for user-generated data, leveraging both Elixir/Phoenix and Ruby on Rails.
Podcast episode
"Saga of a Gnarly Report" with Owen and Dan: Elixir Wizards Owen and Dan delve into the complexities of building advanced reporting features within software applications. They share personal insights and challenges encountered while developing reporting solutions for user-generated data, leveraging both Elixir/Phoenix and Ruby on Rails.
byElixir Wizards
0 ratings
0% found this document useful
Retrieval-Augmented Generation for Large Language Models: A Survey: Large language models (LLMs) demonstrate powerful capabilities, but they still face challenges in practical applications, such as hallucinations, slow knowledge updates, and lack of transparency in answers. Retrieval-Augmented Generation (RAG) refers...
Podcast episode
Retrieval-Augmented Generation for Large Language Models: A Survey: Large language models (LLMs) demonstrate powerful capabilities, but they still face challenges in practical applications, such as hallucinations, slow knowledge updates, and lack of transparency in answers. Retrieval-Augmented Generation (RAG) refers...
byPapers Read on AI
0 ratings
0% found this document useful
Radio Frequency Data Collection: Aerial collection of data from smart meters. In this episode, we discuss why this is better than mesh networks, and other ground collection options and also look at the limitations of drones in terms of data collection platforms. Join the email list f...
Podcast episode
Radio Frequency Data Collection: Aerial collection of data from smart meters. In this episode, we discuss why this is better than mesh networks, and other ground collection options and also look at the limitations of drones in terms of data collection platforms. Join the email list f...
byThe MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
0 ratings
0% found this document useful
E9 - Is the Marathon really 42.195km?
Podcast episode
E9 - Is the Marathon really 42.195km?
byPro Running News
0 ratings
0% found this document useful
Episode 17: Perfecting Polymers Processing
Podcast episode
Episode 17: Perfecting Polymers Processing
byMaterialism: A Materials Science Podcast
0 ratings
0% found this document useful
Andrew Atkinson - Maintainable... Databases?: Robby engages with independent consultant and author, Andrew Atkinson, delving into the intricate world of software development and database maintenance. The episode is a treasure trove of insights, covering everything from optimizing database performance with rules to navigating the tricky terrain of advocating for codebase improvements in the face of reluctant stakeholders.
Podcast episode
Andrew Atkinson - Maintainable... Databases?: Robby engages with independent consultant and author, Andrew Atkinson, delving into the intricate world of software development and database maintenance. The episode is a treasure trove of insights, covering everything from optimizing database performance with rules to navigating the tricky terrain of advocating for codebase improvements in the face of reluctant stakeholders.
byMaintainable
0 ratings
0% found this document useful
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
Podcast episode
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Changepoint Detection: Secret Weapon of the Data Scientist
Podcast episode
Changepoint Detection: Secret Weapon of the Data Scientist
byDataCafé
0 ratings
0% found this document useful
4 + 1 Model of Data Science: Before diving into the complex world of data science it seemed to wise to establish a shared definition of the field. Here at the UVA School of Data Science, we have defined data science with the 4 + 1 Model. This model serves an outline for the first series of UVA Data Points. It also serves as a guiding definition within the School of Data Science, touching everything from research to course planning. In this introduction trailer, host Monica Manney discusses the history, development, and function of the 4 + 1 Model of Data Science with its main author, Raf Alvarado. Below is a brief expect from An Outline of the 4 + 1 Model of Data Science by Raf Alvarado: “The point of the 4 + 1 model, abstract as it is, is to provide a practical template for strategically planning the various elements of a school of data science. To serve as an effective template, a model must be general. But generality if often purchased at the cost of intuitive understanding. The fol
Podcast episode
4 + 1 Model of Data Science: Before diving into the complex world of data science it seemed to wise to establish a shared definition of the field. Here at the UVA School of Data Science, we have defined data science with the 4 + 1 Model. This model serves an outline for the first series of UVA Data Points. It also serves as a guiding definition within the School of Data Science, touching everything from research to course planning. In this introduction trailer, host Monica Manney discusses the history, development, and function of the 4 + 1 Model of Data Science with its main author, Raf Alvarado. Below is a brief expect from An Outline of the 4 + 1 Model of Data Science by Raf Alvarado: “The point of the 4 + 1 model, abstract as it is, is to provide a practical template for strategically planning the various elements of a school of data science. To serve as an effective template, a model must be general. But generality if often purchased at the cost of intuitive understanding. The fol
byUVA Data Points
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
Podcast episode
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
byData Engineering Podcast
0 ratings
0% found this document useful
Yaniv Tal: The Graph – A Marketplace for Web3 Data Indexes Based on GraphQL: We're joined by Yaniv Tal, Project Lead at The Graph. The project aims to create a scalable marketplace for high-availability blockchain data indexes.
Podcast episode
Yaniv Tal: The Graph – A Marketplace for Web3 Data Indexes Based on GraphQL: We're joined by Yaniv Tal, Project Lead at The Graph. The project aims to create a scalable marketplace for high-availability blockchain data indexes.
byEpicenter - Learn about Crypto, Blockchain, Ethereum, Bitcoin and Distributed Technologies
0 ratings
0% found this document useful
Interleaving: If you’re Google or Netflix, and you have a recom…
Podcast episode
Interleaving: If you’re Google or Netflix, and you have a recom…
byLinear Digressions
0 ratings
0% found this document useful
A murder mystery: who killed our user experience?: On this sponsored episode of the Stack Overflow Podcast, we talk with Greg Leffler of Splunk about the keys to instrumenting an observable system and how the OpenTelemetry standard makes observability easier, even if you aren’t using Splunk’s product.
Podcast episode
A murder mystery: who killed our user experience?: On this sponsored episode of the Stack Overflow Podcast, we talk with Greg Leffler of Splunk about the keys to instrumenting an observable system and how the OpenTelemetry standard makes observability easier, even if you aren’t using Splunk’s product.
byThe Stack Overflow Podcast
0 ratings
0% found this document useful
Overture Maps And The Daylight Distribution: In this podcast episode, Jennings Anderson, a research scientist at Meta, discusses the Overture Maps Foundation, a downstream product of OpenStreetMap. He explains his background in open map data and his interest in studying collaboration within the...
Podcast episode
Overture Maps And The Daylight Distribution: In this podcast episode, Jennings Anderson, a research scientist at Meta, discusses the Overture Maps Foundation, a downstream product of OpenStreetMap. He explains his background in open map data and his interest in studying collaboration within the...
byThe MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
0 ratings
0% found this document useful
Why Microservices Are Better Than Cloud Computing: This episode on Systems—one of the four Domains of Data Science UVA uses to define the field—explores the challenges of cloud computing within the framework of biomedical research. Phil Bourne, Dean of the UVA School of Data Science, speaks with computational biologist and associate professor Nathan Sheffield about a paper they co-wrote on systemic issues from cloud platforms that do not support FAIRness, including platform lock-in, poor integration across platforms, and duplicated efforts for users and developers. They suggest instead prioritizing microservices and access to modular data in smaller chunks or summarized form. Emphasizing modularity and interoperability would lead to a more powerful Unix-like ecosystem of web services for biomedical analysis and data retrieval. The two discuss how funders, developers, and researchers can support microservices as the next generation of cloud-based bioinformatics. From Cloud Computing to
Podcast episode
Why Microservices Are Better Than Cloud Computing: This episode on Systems—one of the four Domains of Data Science UVA uses to define the field—explores the challenges of cloud computing within the framework of biomedical research. Phil Bourne, Dean of the UVA School of Data Science, speaks with computational biologist and associate professor Nathan Sheffield about a paper they co-wrote on systemic issues from cloud platforms that do not support FAIRness, including platform lock-in, poor integration across platforms, and duplicated efforts for users and developers. They suggest instead prioritizing microservices and access to modular data in smaller chunks or summarized form. Emphasizing modularity and interoperability would lead to a more powerful Unix-like ecosystem of web services for biomedical analysis and data retrieval. The two discuss how funders, developers, and researchers can support microservices as the next generation of cloud-based bioinformatics. From Cloud Computing to
byUVA Data Points
0 ratings
0% found this document useful
Setting the Standard: Impact of Method Standardization in Chromatography
Podcast episode
Setting the Standard: Impact of Method Standardization in Chromatography
byThe Analytical Wavelength
0 ratings
0% found this document useful

Skip carousel

New Tools for Using the Sherwood Tables for Transceiver Selection
CQ Amateur Radio
Article
New Tools for Using the Sherwood Tables for Transceiver Selection
Jan 1, 2023
Receive performance has been one of the top criteria for transceiver selection by hams for decades. As the well-worn phrase goes, “if you can’t hear ‘em, you can’t work ‘em.” Rob Sherwood has been conducting bench tests on the receive performance of
10 min read
Soulver 3: Mac App Simplifies Readable Calculations And Conversions
MacWorld
Article
Soulver 3: Mac App Simplifies Readable Calculations And Conversions
Nov 19, 2019
3 min read
Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
Grid Modeling Overview: Four Types of Models Guiding the Transition to Clean Electricity
Union of Concerned Scientists
Article
Grid Modeling Overview: Four Types of Models Guiding the Transition to Clean Electricity
Apr 25, 2022
6 min read
Microcontrollers In Amateur Radio
CQ Amateur Radio
Article
Microcontrollers In Amateur Radio
Feb 1, 2023
3 min read
Top 10 Excel Functions That Everyone Should Know
Techfastly
Article
Top 10 Excel Functions That Everyone Should Know
Feb 4, 2021
5 min read
Channel Hopping
Racecar Engineering
Article
Channel Hopping
Jun 4, 2021
4 min read
Trace Engineering
Racecar Engineering
Article
Trace Engineering
Sep 6, 2019
5 min read
Experiments In Photogrammetry
British Columbia History
Article
Experiments In Photogrammetry
Jun 15, 2023
Ever since the fire of June 30, 2021, destroyed the Lytton Museum and Archives, I have been trying to assemble preservation methods designed to reduce the effect of another catastrop loss. To this end, I have been studying ways of making digital thre
2 min read
Clever CAD Coding For Clients And Cigars
Linux Format
Article
Clever CAD Coding For Clients And Cigars
Apr 2, 2024
Credit: http://openscad.org Tam Hanna’s minimal creative capability makes him ideally suited to teaching all kinds of workarounds for problems that require the use of creativity. Catch up by ordering back issues on page 58! The experiments performed
7 min read
APY Masterclass Framing A Dark Molecular Cloud
BBC Sky at Night
Article
APY Masterclass Framing A Dark Molecular Cloud
May 19, 2022
3 min read
Data Centers Aren’t The Energy Hogs We Thought
Futurity
Article
Data Centers Aren’t The Energy Hogs We Thought
Feb 28, 2020
2 min read
Comparing Time Series Data Like A Pro
Linux Format
Article
Comparing Time Series Data Like A Pro
Jun 1, 2021
8 min read
Mac Software
MacFormat
Article
Mac Software
Apr 6, 2021
3 min read
Occupational Therapy
Racecar Engineering
Article
Occupational Therapy
Jun 5, 2020
6 min read
Take An Introductory Tour Of RootsMagic 8
Who Do You Think You Are?
Article
Take An Introductory Tour Of RootsMagic 8
Dec 14, 2021
2 min read
Plot Data In A Radar Chart
MacLife
Article
Plot Data In A Radar Chart
Apr 26, 2022
2 min read
Priming for Pixlnsight
Australian Sky & Telescope
Article
Priming for Pixlnsight
Jun 8, 2023
9 min read
Plot Data In A Radar Chart
TechLife
Article
Plot Data In A Radar Chart
May 30, 2022
2 min read
The Midnight Design Solutions “Phaser” Transceiver Kit
CQ Amateur Radio
Article
The Midnight Design Solutions “Phaser” Transceiver Kit
Aug 1, 2020
19 min read
Public Logs: The Benefits Outweigh the Risks
CQ Amateur Radio
Article
Public Logs: The Benefits Outweigh the Risks
Feb 1, 2020
5 min read
Knightware’s Deep Sky Planner 8
Australian Sky & Telescope
Article
Knightware’s Deep Sky Planner 8
Aug 4, 2021
8 min read
Professor Newman on… Metrics
Amateur Photographer
Article
Professor Newman on… Metrics
Apr 15, 2023
2 min read
Model Behaviour
Racecar Engineering
Article
Model Behaviour
Sep 1, 2023
6 min read
Create A RESTful Server In Go
Linux Format
Article
Create A RESTful Server In Go
Oct 19, 2021
8 min read
How Spooky Science Helps Us Peer Inside The Planets
All About Space
Article
How Spooky Science Helps Us Peer Inside The Planets
Dec 3, 2020
An assistant professor of computational science at the EPFL research centre in Lausanne, Switzerland, involved in the current research on metallic hydrogen. Could you explain how the machine-learning techniques used in your research work? Why were th
1 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
Machine Learning – With Zero Programming
APC
Article
Machine Learning – With Zero Programming
Aug 12, 2019
6 min read
Create Visualisations And Cool Dashboards
Linux Format
Article
Create Visualisations And Cool Dashboards
Jan 14, 2020
8 min read
Dynamic Drive Part Two
Racecar Engineering
Article
Dynamic Drive Part Two
Dec 2, 2022
6 min read

Related categories

Skip carousel

Reviews for A General Introduction to Data Analytics

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

A General Introduction to Data Analytics - João Moreira

Preface

We are living in a period of history that will certainly be remembered as one where information began to be instantaneously obtainable, services were tailored to individual criteria, and people did what made them feel good (if it did not put their lives at risk). Every year, machines are able to do more and more things that improve our quality of life. More data is available than ever before, and will become even more so. This is a time when we can extract more information from data than ever before, and benefit more from it.

In different areas of business and in different institutions, new ways to collect data are continuously being created. Old documents are being digitized, new sensors count the number of cars passing along motorways and extract useful information from them, our smartphones are informing us where we are at each moment and what new opportunities are available, and our favorite social networks register to whom we are related or what things we like.

Whatever area we work in, new data is available: data on how students evaluate professors, data on the evolution of diseases and the best treatment options per patient, data on soil, humidity levels and the weather, enabling us to produce more food with better quality, data on the macro economy, our investments and stock market indicators over time, enabling fairer distribution of wealth, data on things we purchase, allowing us to purchase more effectively and at lower cost.

Students in many different domains feel the need to take advantage of the data they have. New courses on data analytics have been proposed in many different programs, from biology to information science, from engineering to economics, from social sciences to agronomy, all over the world.

The first books on data analytics that appeared some years ago were written by data scientists for other data scientists or for data science students. The majority of the people interested in these subjects were computing and statistics students. The books on data analytics were written mainly for them. Nowadays, more and more people are interested in learning data analytics. Students of economics, management, biology, medicine, sociology, engineering, and some other subjects are willing to learn about data analytics. This book intends not only to provide a new, more friendly textbook for computing and statistics students, but also to open data analytics to those students who may know nothing about computing or statistics, but want to learn these subjects in a simple way. Those who have already studied subjects such as statistics will recognize some of the content described in this book, such as descriptive statistics. Students from computing will be familiar with a pseudocode.

After reading this book, it is not expected that you will feel like a data scientist with ability to create new methods, but it is expected that you might feel like a data analytics practitioner, able to drive a data analytics project, using the right methods to solve real problems.

João Mendes Moreira

University of Porto, Porto, Portugal

André C. P. L. F. de Carvalho

University of São Paulo, São Carlos, Brazil

Tomáš Horváth

Eötvös Loránd University in Budapest

Pavol Jozef Šafárik University in Košice

October, 2017

Acknowledgments

The authors would like to thank Bruno Almeida Pimentel, Edésio Alcobaça Neto, Everlândio Fernandes, Victor Alexandre Padilha and Victor Hugo Barella for their useful comments.

Over the last several months, we have been in contact with several people from Wiley: Jon Gurstelle, Executive Editor on Statistics; Kathleen Pagliaro, Assistant Editor; Samantha Katherine Clarke and Kshitija Iyer, Project Editors; and Katrina Maceda, Production Editor. To all these wonderful people, we owe a deep sense of gratitude, especially now this project has been completed.

Lastly, we would like to thank our families for their constant love, support, patience, and encouragement.

J. A. T.

Presentational Conventions

Definition The definitions are presented in the format shown here.

Special sections and formats Whenever a method is described, three different sections are presented:

Assessing and evaluating results: how can we assess the results of a method? How to interpret them? This section is all about answering these questions.

Setting the hyper‐parameters: each method has its own hyper‐parameters that must be set. This section explains how to set them.

Advantages and disadvantages: a table summarizes the positive and negative characteristics of a given method.

About the Companion Website

This book is accompanied by a companion website:

www.wiley.com/go/moreira/dataanalytics

The website includes:

Presentation slides for instructors

Part I

Introductory Background

What Can We Do With Data?

Until recently, researchers working with data analysis were struggling to obtain data for their experiments. Recent advances in the technology of data processing, data storage and data transmission, associated with advanced and intelligent computer software, reducing costs and increasing capacity, have changed this scenario. It is the time of the Internet of Things, where the aim is to have everything or almost everything connected. Data previously produced on paper are now on‐line. Each day, a larger quantity of data is generated and consumed. Whenever you place a comment in your social network, upload a photograph, some music or a video, navigate through the Internet, or add a comment to an e‐commerce web site, you are contributing to the data increase. Additionally, machines, financial transactions and sensors such as security cameras, are increasingly gathering data from very diverse and widespread sources.

In 2012, it was estimated that, each year, the amount of data available in the world doubles [1]. Another estimate, from 2014, predicted that by 2020 all information will be digitized, eliminated or reinvented in 80% of processes and products of the previous decade [2]. In a third report, from 2015, it was predicted that mobile data traffic will be almost 10 times larger in 2020 [3]. The result of all these rapid increases of data is named by some the data explosion.

Despite the impression that this can give – that we are drowning in data – there are several benefits from having access to all these data. These data provide a rich source of information that can be transformed into new, useful, valid and human‐understandable knowledge. Thus, there is a growing interest in exploring these data to extract this knowledge, using it to support decision making in a wide variety of fields: agriculture, commerce, education, environment, finance, government, industry, medicine, transport and social care. Several companies around the world are realizing the gold mine they have and the potential of these data to support their work, reduce waste and dangerous and tedious work activities, and increase the value of their products and their profits.

The analysis of these data to extract such knowledge is the subject of a vibrant area known as data analytics, or simply analytics. You can find several definitions of analytics in the literature. The definition adopted here is:

Analytics

The science that analyze crude data to extract useful knowledge (patterns) from them.

This process can also include data collection, organization, pre‐processing, transformation, modeling and interpretation.

Analytics as a knowledge area involves input from many different areas. The idea of generalizing knowledge from a data sample comes from a branch of statistics known as inductive learning, an area of research with a long history. With the advances of personal computers, the use of computational resources to solve problems of inductive learning become more and more popular. Computational capacity has been used to develop new methods. At the same time, new problems have appeared requiring a good knowledge of computer sciences. For instance, the ability to perform a given task with more computational efficiency has become a subject of study for people working in computational statistics.

In parallel, several researchers have dreamed of being able to reproduce human behavior using computers. These were people from the area of artificial intelligence. They also used statistics for their research but the idea of reproducing human and biological behavior in computers was an important source of motivation. For instance, reproducing how the human brain works with artificial neural networks has been studied since the 1940s; reproducing how ants work with ant colony optimization algorithm since the 1990s. The term machine learning (ML) appeared in this context as the field of study that gives computers the ability to learn without being explicitly programmed, according to Arthur Samuel in 1959 [4].

In the 1990s, a new term appeared with a different slight meaning: data mining (DM). The 1990s was the decade of the appearance of business intelligence tools as consequence of the data facilities having larger and cheaper capacity. Companies start to collect more and more data, aiming to either solve or improve business operations, for example by detecting frauds with credit cards, by advising the public of road network constraints in cities, or by improving relations with clients using more efficient techniques of relational marketing. The question was of being able to mine the data in order to extract the knowledge necessary for a given task. This is the goal of data mining.

1.1 Big Data and Data Science

In the first years of the 20th century, the term big data has appeared. Big data, a technology for data processing, was initially defined by the three Vs, although some more Vs have been proposed since. The first three Vs allow us to define a taxonomy of big data. They are: volume, variety and velocity. Volume is concerned with how to store big data: data repositories for large amounts of data. Variety is concerned with how to put together data from different sources. Velocity concerns the ability to deal with data arriving very fast, in streams known as data streams. Analytics is also about discovering knowledge from data streams, going beyond the velocity component of big data.

Another term that has appeared and is sometimes used as a synonym for big data is data science. According to Provost and Fawcett [5], big data are data sets that are too large to be managed by conventional data‐processing technologies, requiring the development of new techniques and tools for data storage, processing and transmission. These tools include, for example, MapReduce, Hadoop, Spark and Storm. But data volume is not the only characterization of big data. The word big can refer to the number of data sources, to the importance of the data, to the need for new processing techniques, to how fast data arrive, to the combination of different sets of data so they can be analyzed in real time, and its ubiquity, since any company, nonprofit organization or individual has access to data now.

Thus big data is more concerned with technology. It provides a computing environment, not only for analytics, but also for other data processing tasks. These tasks include finance transaction processing, web data processing and georeferenced data processing.

Data science is concerned with the creation of models able to extract patterns from complex data and the use of these models in real‐life problems. Data science extracts meaningful and useful knowledge from data, with the support of suitable technologies. It has a close relationship to analytics and data mining. Data science goes beyond data mining by providing a knowledge extraction framework, including statistics and visualization.

Therefore, while big data gives support to data collection and management, data science applies techniques to these data to discover new and useful knowledge: big data collects and data science discovers. Other terms such as knowledge discovery or extraction, pattern recognition, data analysis, data engineering, and several others are also used. The definition we use of data analytics covers all these areas that are used to extract knowledge from data.

1.2 Big Data Architectures

As data increase in size, velocity and variety, new computer technologies become necessary. These new technologies, which include hardware and software, must be easily expanded as more data are processed. This property is known as scalability. One way to obtain scalability is by distributing the data processing tasks into several computers, which can be combined into clusters of computers. The reader should not confuse clusters of computers with clusters produced by clustering techniques, which are techniques from analytics in which a data set is partitioned to find groups within it.

Even if processing power is expanded by combining several computers in a cluster, creating a distributed system, conventional software for distributed systems usually cannot cope with big data. One of the limitations is the efficient distribution of data among the different processing and storage units. To deal with these requirements, new software tools and techniques have been developed.

One of the first techniques developed for big data processing using clusters was MapReduce. MapReduce is a programming model that has two steps: map and reduce. The most famous implementation of MapReduce is called Hadoop.

MapReduce divides the data set into parts – chunks – and stores in the memory of each cluster computer the chunk of the data set needed by this computer to accomplish its processing task. As an example, suppose that you need to calculate the average salary of 1 billion people and you have a cluster with 1000 computers, each with a processing unit and a storage memory. The people can be divided into 1000 chunks – subsets – with data from 1 million people each. Each chunk can be processed independently by one of the computers. The results produced by each these computers (the average salary of 1 million people) can be averaged, returning the final salary average.

To efficiently solve a big data problem, a distributed system must attend the following requirements:

Make sure that no chunk of data is lost and the whole task is concluded. If one or more computers has a failure, their tasks, and the corresponding data chunk, must be assumed by another computer in the cluster.

Repeat the same task, and corresponding data chunk, in more than one cluster computer; this is called redundancy. Thus, if one or more computer fails, the redundant computer carries on with the task.

Computers that have had faults can return to the cluster again when they are fixed.

Computers can be easily removed from the cluster or extra ones included in it as the processing demand changes.

A solution incorporating these requirements must hide from the data analyst the details of how the software works, such as how the data chunks and tasks are divided among the cluster computers.

1.3 Small Data

In the opposite direction from big data technologies and methods, there is a movement towards more personal, subjective analysis of chunks of data, termed small data. Small data is a data set whose volume and format allows its processing and analysis by a person or a small organization. Thus, instead of collecting data from several sources, with different formats, and generated at increasing velocities, creating large data repositories and processing facilities, small data favors the partition of a problem into small packages, which can be analyzed by different people or small groups in a distributed and integrated way.

People are continuously producing small data as they perform their daily activities, be it navigating the web, buying a product in a shop, undergoing medical examinations and using apps in their mobiles. When these data are collected to be stored and processed in large data servers they become big data. To be characterized as small data, a data set must have a size that allows its full understanding by an user.

The type of knowledge sought in big and small data is also different, with the first looking for correlations and the second for causality relations. While big data provide tools that allow companies to understand their customers, small data tools try to help customers to understand themselves. Thus, big data is concerned with customers, products and services, and small data is concerned with the individuals that produced the data.

1.4 What is Data?

But what is data about? Data, in the information age, are a large set of bits encoding numbers, texts, images, sounds, videos, and so on. Unless we add information to data, they are meaningless. When we add information, giving a meaning to them, these data become knowledge. But before data become knowledge, typically, they pass through several steps where they are still referred to as data, despite being a bit more organized; that is, they have some information associated with them.

Let us see the example of data collected from a private list of acquaintances or contacts.

Information as presented in Table 1.1, usually referred to as tabular data, is characterized by the way data are organized. In tabular data, data are organized in rows and columns, where each column represents a characteristic of the data and each row represents an occurrence of the data. A column is referred to as an attribute or, with the same meaning, a feature, while a row is referred to as an instance, or with the same meaning, an object.

Table 1.1 Data set of our private contact list.

Instance or Object

Examples of the concept we want to characterize.

Example 1.1

In the example in Table 1.1, we intend to characterize people in our private contact list. Each member is, in this case, an instance or object. It corresponds to a row of the table.

Attribute or Feature

Attributes, also called features, are characteristics of the instances.

Example 1.2

In Table 1.1, contact, age, education level and company are four different attributes.

The majority of the chapters in this book expect the data to be in tabular format; that is, already organized by rows and columns, each row representing an instance and each column representing an attribute. However, a table can be organized differently, having the instances per column and the attributes per row.

There are, however, data that are not possible to represent in a single table.

Example 1.3

As an example, if some of the contacts are relatives of other contacts, a second table, as shown in Table 1.2, representing the family relationships, would be necessary. You should note that each person referred to in Table 1.2 also exists in Table 1.1, i.e., there are relations between attributes of different tables.

Table 1.2 Family relations between contacts.

Data sets represented by several tables, making clear the relations between these tables, are called relational data sets. This information is easily handled using relational databases. In this book, only simple forms of relational data will be used. This is discussed in each chapter whenever necessary.

Example 1.4

In our example, data is split into two tables, one with the individual data of each contact (Table 1.1) and the other with the data about the family relations between them (Table 1.2).

1.5 A Short Taxonomy of Data Analytics

Now that we know what data are, we will look at what we can do with them. A natural taxonomy that exists in data analytics is:

Descriptive analytics: summarize or condense data to extract patterns

Predictive analytics: extract models from data to be used for future predictions.

In descriptive analytics tasks, the result of a given method or technique,¹ is obtained directly by applying an algorithm to the data. The result can be a statistic, such as an average, a plot, or a set of groups with similar instances, among other things, as we will see in this book. Let us see the definition of method and algorithm.

Method or technique

A method or technique is a systematic procedure that allows us to achieve an intended goal.

A method shows how to perform a given task. But in order to use a language closer to the language computers can understand, it is necessary to describe the method/technique through an algorithm.

Algorithm

An algorithm is a self‐contained, step‐by‐step set of instructions easily understandable by humans, allowing the implementation of a given method. They are self‐contained in order to be easily translated to an arbitrary programming language.

Example 1.5

The method to obtain the average age of my contacts uses the ages of each (we could use other methods, such as using the number of contacts for each different age). A possible algorithm for this very simple example is shown next.

In the limit, a method can be straightforward. It is possible, in many cases, to express it as a formula instead of as an algorithm.

Example 1.6

For instance, the average could be expressed as: .

We have seen an algorithm that describes a descriptive method. An algorithm can also describe predictive methods. In this last case it describes how to generate a model. Let us see what a model is.

Model

A model in data analytics is a generalization obtained from data that can be used afterwords to generate predictions for new given instances. It can be seen as a prototype that can be used to make predictions. Thus, model induction is a predictive task.

Example 1.7

If we apply an algorithm for induction of decision trees to provide an explanation of who, among our contacts, is a good company, we obtain a model, called a decision tree, like the one presented in Figure 1.1. It can be seen that people older than 38 years are typically better company than those whose age is equal or less than 38 more than 80% of people aged 38 or less are bad company, while more than 80% of people older than 38 are good company. This model could be used to predict whether a new contact is or not a good company. It would be enough to know the age of that new contact.

c01f001

Figure 1.1 A prediction model to classify someone as either good or bad company.

Now that we have a rough idea of what analytics is, let us see real examples of problems in data analytics.

1.6 Examples of Data Use

We will describe two real‐world problems from different areas as an introduction to the different subjects that are covered in this book. Many more could be presented. One of the problems is from medicine and the other is from economics. The problems were chosen with a view to the availability of relevant data, because the problems involved will be solved in the project chapters of the book (Chapters 7 and 12).

1.6.1 Breast Cancer in Wisconsin

Breast cancer is a well‐known problem that affects mainly women. The detection of breast tumors can be performed through a biopsy technique known as fine‐needle aspiration. This uses a fine needle to sample cells from the mass under study. Samples of breast mass obtained using fine‐needle aspiration were recorded in a set of images [6]. Then, a dataset was collected by extracting features from these images. The objective of the first problem is to detect different patterns of breast tumors in this dataset, to enable it to be used for diagnostic purposes.

1.6.2 Polish Company Insolvency Data

The second problem concerns the prediction of the economic wealth of Polish companies. Can we predict which companies will become insolvent in the next five years? The answer to this question is obviously relevant to institutions and shareholders.

1.7 A Project on Data Analytics

Every project needs a plan. Or, to be precise, a methodology to prepare the plan. A project on data analytics does not imply only the use of one or more specific methods. It implies:

understanding the problem to be solved

defining the objectives of the project

looking for the necessary data

preparing these data so that they can be used

identifying suitable methods and choosing between them

tuning the hyper‐parameters of each method (see below)

analyzing and evaluating the results

redoing the pre‐processing tasks and repeating the experiments

and so on.

In this book, we assume that in the induction of a model, there are both hyper‐parameters and parameters whose values are set. The values of the hyper‐parameters are set by the user, or some external optimization method. The parameter values, on the other hand, are model parameters whose values are set by a modeling or learning algorithm in its internal procedure. When the distinction is not clear, we use the term parameter. Thus, hyper‐parameters might be, for example, the number of layers and the activation function in a multi‐layer perceptron neural network and the number of clusters for the k‐means algorithm. Examples of parameters are the weights found by the backpropagation algorithm when training a multi‐layer perceptron neural network and the distribution of objects carried out by k‐means. Multi‐layer perceptron neural networks and k‐means will be explained later in this book.

How can we perform all these operations in an organized way? This section is all about methodologies for planning and developing projects in data analytics.

A brief history of methodologies for data analytics is presented first. Afterwards, two different methodologies are described:

a methodology from Academia, KDD

a methodology from industry, CRISP‐DM.

The latter is used in the cheat sheet and project chapters (Chapters 7 and 12).

1.7.1 A Little History on Methodologies for Data Analytics

Machine learning, knowledge discovery from data and related areas experienced strong development in the 1990s. Both in academia and industry, the research on these topics was advancing quickly. Naturally, methodologies for projects in these areas, now referred to as data analytics, become a necessity. In the mid‐1990s, both in academia and industry, different methodologies were presented.

The most successful methodology from academia came from the USA. This was the KDD process of Usama Fayyad, Gregory Piatetsky‐Shapiro and Padhraic Smyth [7]. Despite being from academia, the authors had considerable work experience in industry.

The most successful tool from industry, was and still is the CRoss‐Industry Standard Process for Data Mining (CRISP‐DM) [8]. Conceived in 1996, it later got underway as an European Union project under the ESPRIT funding initiative. The project had five partners from industry: SPSS, Teradata, Daimler AG, NCR Corporation and OHRA, an insurance company. In 1999 the first version was presented. An attempt to create a new version began between 2006 and 2008 but no new discoveries are known from these efforts. CRISP‐DM is nowadays used by many different practitioners and by several corporations, in particular IBM. However, despite its popularity, CRISP‐DM needs new developments in order to meet the new challenges from the age of big data.

Other methodologies exist. Some of them are domain‐specific: they assume the use of a given tool for data analytics. This is not the case for SEMMA, which, despite has been created by SAS, is tool independent. Each letter of its name, SEMMA, refers to one of its five steps: Sample, Explore, Modify, Model and Assess.

Polls performed by kdnuggets [9] over the years (2002, 2004, 2007 and 2014) show how methodologies on data analytics have been used through time (Figure 1.2).

Next, the KDD process and the CRISP‐DM methodologies are described in detail.

c01f002

Figure 1.2 The use of different methodologies on data analytics through time.

1.7.2 The KDD Process

Intended to be a methodology that could cope with all the processes necessary to extract knowledge from data, the KDD process proposes a sequence of nine steps. In spite of the sequence, the KDD process considers the possibility of going back to any previous step in order to redo some part of the process. The nine steps are:

Learning the application domain: What is expected in terms of the application domain? What are the characteristics of the problem; its specificities? A good understanding of the application domain is required.

Creating a target dataset: What data are needed for the problem? Which attributes? How will they be collected and put in the desired format (say, a tabular data set)? Once the application domain is known, the data analyst team should be able to identify the data necessary to accomplish the project.

Data cleaning and pre‐processing: How should missing values and/or outliers such as extreme values be handled? What data type should we choose for each attribute? It is necessary to put the data in a specific format, such as a tabular format.

Data reduction and projection: Which features should we include to represent the data? From the available features, which ones should be discarded? Should further information be added, such as adding the day of the week to a timestamp? This can be useful in some tasks. Irrelevant attributes should be removed.

Choosing the data mining function: Which type of methods should be used? Four types of method are: summarization, clustering, classification and regression. The first two are from the branch of descriptive analytics while

Enjoying the preview?

Page 1 of 1

A General Introduction to Data Analytics

About this ebook

João Moreira

Related authors

Related to A General Introduction to Data Analytics

Related ebooks

Mathematics For You

Related podcast episodes

Related articles

Related categories

Reviews for A General Introduction to Data Analytics

What did you think?

Book preview

A General Introduction to Data Analytics - João Moreira

Preface

Acknowledgments

Presentational Conventions

About the Companion Website

1.1 Big Data and Data Science

1.2 Big Data Architectures

1.3 Small Data

1.4 What is Data?

Example 1.1

Example 1.2

Example 1.3

Example 1.4

1.5 A Short Taxonomy of Data Analytics

Example 1.5

Example 1.6

Example 1.7

1.6 Examples of Data Use

1.6.1 Breast Cancer in Wisconsin

1.6.2 Polish Company Insolvency Data

1.7 A Project on Data Analytics

1.7.1 A Little History on Methodologies for Data Analytics

1.7.2 The KDD Process