Fundamentals of Data Science: Theory and Practice

Ebook672 pages6 hours

Fundamentals of Data Science: Theory and Practice

Name: Fundamentals of Data Science: Theory and Practice
Author: Jugal K. Kalita
ISBN: 9780323972635

By Jugal K. Kalita, Dhruba K. Bhattacharyya and Swarup Roy

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Fundamentals of Data Science: Theory and Practice presents basic and advanced concepts in data science along with real-life applications. The book provides students, researchers and professionals at different levels a good understanding of the concepts of data science, machine learning, data mining and analytics. Users will find the authors’ research experiences and achievements in data science applications, along with in-depth discussions on topics that are essential for data science projects, including pre-processing, that is carried out before applying predictive and descriptive data analysis tasks and proximity measures for numeric, categorical and mixed-type data.

The book's authors include a systematic presentation of many predictive and descriptive learning algorithms, including recent developments that have successfully handled large datasets with high accuracy. In addition, a number of descriptive learning tasks are included.

Presents the foundational concepts of data science along with advanced concepts and real-life applications for applied learning
Includes coverage of a number of key topics such as data quality and pre-processing, proximity and validation, predictive data science, descriptive data science, ensemble learning, association rule mining, Big Data analytics, as well as incremental and distributed learning
Provides updates on key applications of data science techniques in areas such as Computational Biology, Network Intrusion Detection, Natural Language Processing, Software Clone Detection, Financial Data Analysis, and Scientific Time Series Data Analysis
Covers computer program code for implementing descriptive and predictive algorithms

Skip carousel

LanguageEnglish

PublisherAcademic Press

Release dateNov 17, 2023

ISBN9780323972635

Author

Jugal K. Kalita

Dr. Jugal Kalita received his BTech degree from the Indian Institute of Technology in Kharagpur, India, his MS degree from the University of Saskatchewan, Canada, and his MS and PhD degrees from the University of Pennsylvania. He is a Professor of Computer Science at the University of Colorado at Colorado Springs. His research interests include machine learning and its applications to areas such as natural language processing, intrusion detection, and bioinformatics. He is the author of more than 250 research articles in reputed conferences and journals and has authored four books, including Network Traffic Anomaly Detection and Prevention from Springer, Gene Expression Data Analysis: A Statistical and Machine Learning Perspective from Chapman and Hall/CRC Press, and Recent Developments in Machine Learning and Data Analytics from Springer. He has received multiple National Science Foundation (NSF) grants

Related authors

Skip carousel

Related to Fundamentals of Data Science

Related ebooks

Skip carousel

Data Analysis in the Cloud: Models, Techniques and Applications
Ebook
Data Analysis in the Cloud: Models, Techniques and Applications
byDomenico Talia
Rating: 0 out of 5 stars
0 ratings
Exploring the World of Data Science and Machine Learning
Ebook
Exploring the World of Data Science and Machine Learning
byNIBEDITA Sahu
Rating: 0 out of 5 stars
0 ratings
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Ebook
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
byRob Botwright
Rating: 0 out of 5 stars
0 ratings
Practical Machine Learning for Data Analysis Using Python
Ebook
Practical Machine Learning for Data Analysis Using Python
byAbdulhamit Subasi
Rating: 0 out of 5 stars
0 ratings
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Data Science: Concepts and Practice
Ebook
Data Science: Concepts and Practice
byVijay Kotu
Rating: 3 out of 5 stars
3/5
Designing Machine Learning Systems with Python
Ebook
Designing Machine Learning Systems with Python
byDavid Julian
Rating: 0 out of 5 stars
0 ratings
Deep Learning in Bioinformatics: Techniques and Applications in Practice
Ebook
Deep Learning in Bioinformatics: Techniques and Applications in Practice
byHabib Izadkhah
Rating: 0 out of 5 stars
0 ratings
Deep Learning: Convergence to Big Data Analytics
Ebook
Deep Learning: Convergence to Big Data Analytics
byMurad Khan
Rating: 0 out of 5 stars
0 ratings
Artificial Intelligence in Data Mining: Theories and Applications
Ebook
Artificial Intelligence in Data Mining: Theories and Applications
byD. Binu
Rating: 0 out of 5 stars
0 ratings
Computational Learning Approaches to Data Analytics in Biomedical Applications
Ebook
Computational Learning Approaches to Data Analytics in Biomedical Applications
byKhalid Al-Jabery
Rating: 5 out of 5 stars
5/5
Data Analysis and Visualization Using Python: Analyze Data to Create Visualizations for BI Systems
Ebook
Data Analysis and Visualization Using Python: Analyze Data to Create Visualizations for BI Systems
byDr. Ossama Embarak
Rating: 0 out of 5 stars
0 ratings
Supervised Learning with Python: Concepts and Practical Implementation Using Python
Ebook
Supervised Learning with Python: Concepts and Practical Implementation Using Python
byVaibhav Verdhan
Rating: 0 out of 5 stars
0 ratings
Python Data Science: A Step-By-Step Guide to Data Analysis. What a Beginner Needs to Know About Machine Learning and Artificial Intelligence. Exercises Included
Ebook
Python Data Science: A Step-By-Step Guide to Data Analysis. What a Beginner Needs to Know About Machine Learning and Artificial Intelligence. Exercises Included
byAxel Ross
Rating: 0 out of 5 stars
0 ratings
Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets
Ebook
Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets
byAndreas François Vermeulen
Rating: 0 out of 5 stars
0 ratings
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
Ebook
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
bydaniel Huston
Rating: 0 out of 5 stars
0 ratings
Data Science Fusion: Integrating Maths, Python, and Machine Learning
Ebook
Data Science Fusion: Integrating Maths, Python, and Machine Learning
byNIBEDITA Sahu
Rating: 0 out of 5 stars
0 ratings
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next
Ebook
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next
byRupam Kumar Sharma
Rating: 0 out of 5 stars
0 ratings
PYTHON DATA ANALYTICS: Harnessing the Power of Python for Data Exploration, Analysis, and Visualization (2024)
Ebook
PYTHON DATA ANALYTICS: Harnessing the Power of Python for Data Exploration, Analysis, and Visualization (2024)
byNED MUNOZ
Rating: 0 out of 5 stars
0 ratings
Practical Data Analysis
Ebook
Practical Data Analysis
byHector Cuesta
Rating: 4 out of 5 stars
4/5
Introduction to Algorithms for Data Mining and Machine Learning
Ebook
Introduction to Algorithms for Data Mining and Machine Learning
byXin-She Yang
Rating: 0 out of 5 stars
0 ratings
Feature Extraction and Image Processing for Computer Vision
Ebook
Feature Extraction and Image Processing for Computer Vision
byMark Nixon
Rating: 4 out of 5 stars
4/5
A Python Data Analyst’s Toolkit: Learn Python and Python-based Libraries with Applications in Data Analysis and Statistics
Ebook
A Python Data Analyst’s Toolkit: Learn Python and Python-based Libraries with Applications in Data Analysis and Statistics
byGayathri Rajagopalan
Rating: 0 out of 5 stars
0 ratings
Introducing Data Science: Big data, machine learning, and more, using Python tools
Ebook
Introducing Data Science: Big data, machine learning, and more, using Python tools
byDavy Cielen
Rating: 5 out of 5 stars
5/5
Implementing Analytics: A Blueprint for Design, Development, and Adoption
Ebook
Implementing Analytics: A Blueprint for Design, Development, and Adoption
byNauman Sheikh
Rating: 0 out of 5 stars
0 ratings
Big Data: Principles and Paradigms
Ebook
Big Data: Principles and Paradigms
byRajkumar Buyya
Rating: 0 out of 5 stars
0 ratings
Clinical Research Computing: A Practitioner's Handbook
Ebook
Clinical Research Computing: A Practitioner's Handbook
byPrakash Nadkarni
Rating: 0 out of 5 stars
0 ratings
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
Ebook
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
bySteven Cooper
Rating: 3 out of 5 stars
3/5
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Ebook
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
bySteven Cooper
Rating: 0 out of 5 stars
0 ratings
Creating Good Data: A Guide to Dataset Structure and Data Representation
Ebook
Creating Good Data: A Guide to Dataset Structure and Data Representation
byHarry J. Foxwell
Rating: 0 out of 5 stars
0 ratings

Mathematics For You

Skip carousel

Algebra - The Very Basics
Ebook
Algebra - The Very Basics
byMetin Bektas
Rating: 5 out of 5 stars
5/5
Statistics 101: From Data Analysis and Predictive Modeling to Measuring Distribution and Determining Probability, Your Essential Guide to Statistics
Ebook
Statistics 101: From Data Analysis and Predictive Modeling to Measuring Distribution and Determining Probability, Your Essential Guide to Statistics
byDavid Borman
Rating: 4 out of 5 stars
4/5
Basic Math Notes
Ebook
Basic Math Notes
byErnest Bywater
Rating: 5 out of 5 stars
5/5
Geometry For Dummies
Ebook
Geometry For Dummies
byMark Ryan
Rating: 5 out of 5 stars
5/5
Basic Math & Pre-Algebra For Dummies
Ebook
Basic Math & Pre-Algebra For Dummies
byMark Zegarelli
Rating: 4 out of 5 stars
4/5
Algebra I Workbook For Dummies
Ebook
Algebra I Workbook For Dummies
byMary Jane Sterling
Rating: 3 out of 5 stars
3/5
Game Theory: A Simple Introduction
Ebook
Game Theory: A Simple Introduction
byK.H. Erickson
Rating: 4 out of 5 stars
4/5
Quantum Physics for Beginners
Ebook
Quantum Physics for Beginners
byMax Thomson
Rating: 4 out of 5 stars
4/5
The Everything Everyday Math Book: From Tipping to Taxes, All the Real-World, Everyday Math Skills You Need
Ebook
The Everything Everyday Math Book: From Tipping to Taxes, All the Real-World, Everyday Math Skills You Need
byChristopher Monahan
Rating: 5 out of 5 stars
5/5
Mental Math Secrets - How To Be a Human Calculator
Ebook
Mental Math Secrets - How To Be a Human Calculator
byRandy Silverman
Rating: 5 out of 5 stars
5/5
My Best Mathematical and Logic Puzzles
Ebook
My Best Mathematical and Logic Puzzles
byMartin Gardner
Rating: 5 out of 5 stars
5/5
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
Ebook
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
byS. Deviant
Rating: 4 out of 5 stars
4/5
Calculus For Dummies
Ebook
Calculus For Dummies
byMark Ryan
Rating: 4 out of 5 stars
4/5
Introducing Game Theory: A Graphic Guide
Ebook
Introducing Game Theory: A Graphic Guide
byIvan Pastine
Rating: 4 out of 5 stars
4/5
ACT Math & Science Prep: Includes 500+ Practice Questions
Ebook
ACT Math & Science Prep: Includes 500+ Practice Questions
byKaplan Test Prep
Rating: 3 out of 5 stars
3/5
Build a Mathematical Mind - Even If You Think You Can't Have One: Become a Pattern Detective. Boost Your Critical and Logical Thinking Skills.
Ebook
Build a Mathematical Mind - Even If You Think You Can't Have One: Become a Pattern Detective. Boost Your Critical and Logical Thinking Skills.
byAlbert Rutherford
Rating: 5 out of 5 stars
5/5
The Everything Guide to Algebra: A Step-by-Step Guide to the Basics of Algebra - in Plain English!
Ebook
The Everything Guide to Algebra: A Step-by-Step Guide to the Basics of Algebra - in Plain English!
byChristopher Monahan
Rating: 4 out of 5 stars
4/5
The Elements of Euclid for the Use of Schools and Colleges (Illustrated)
Ebook
The Elements of Euclid for the Use of Schools and Colleges (Illustrated)
byISAAC TODHUNTER
Rating: 0 out of 5 stars
0 ratings
The Golden Ratio: The Divine Beauty of Mathematics
Ebook
The Golden Ratio: The Divine Beauty of Mathematics
byGary B. Meisner
Rating: 5 out of 5 stars
5/5
See Ya Later Calculator: Simple Math Tricks You Can Do in Your Head
Ebook
See Ya Later Calculator: Simple Math Tricks You Can Do in Your Head
byEditors of Portable Press
Rating: 4 out of 5 stars
4/5
The Everything Guide to Pre-Algebra: A Helpful Practice Guide Through the Pre-Algebra Basics - in Plain English!
Ebook
The Everything Guide to Pre-Algebra: A Helpful Practice Guide Through the Pre-Algebra Basics - in Plain English!
byJane Cassie
Rating: 5 out of 5 stars
5/5
Calculus Made Easy
Ebook
Calculus Made Easy
bySilvanus P. Thompson
Rating: 4 out of 5 stars
4/5
Is God a Mathematician?
Ebook
Is God a Mathematician?
byMario Livio
Rating: 4 out of 5 stars
4/5
The Thirteen Books of the Elements, Vol. 1
Ebook
The Thirteen Books of the Elements, Vol. 1
byEuclid
Rating: 0 out of 5 stars
0 ratings
Mathematical Thinking - For People Who Hate Math: Level Up Your Analytical and Creative Thinking Skills. Excel at Problem-Solving and Decision-Making.
Ebook
Mathematical Thinking - For People Who Hate Math: Level Up Your Analytical and Creative Thinking Skills. Excel at Problem-Solving and Decision-Making.
byAlbert Rutherford
Rating: 3 out of 5 stars
3/5
The Little Book of Mathematical Principles, Theories & Things
Ebook
The Little Book of Mathematical Principles, Theories & Things
byRobert Solomon
Rating: 3 out of 5 stars
3/5
A Mind for Numbers | Summary
Ebook
A Mind for Numbers | Summary
bySummary Station
Rating: 4 out of 5 stars
4/5
GED® Math Test Tutor, 2nd Edition
Ebook
GED® Math Test Tutor, 2nd Edition
bySandra Rush
Rating: 0 out of 5 stars
0 ratings
Logicomix: An epic search for truth
Ebook
Logicomix: An epic search for truth
byApostolos Doxiadis
Rating: 4 out of 5 stars
4/5
Algebra I For Dummies
Ebook
Algebra I For Dummies
byMary Jane Sterling
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

[Bite] Data Science and the Scientific Method
Podcast episode
[Bite] Data Science and the Scientific Method
byDataCafé
0 ratings
0% found this document useful
Understanding Deep Learning - Prof. SIMON PRINCE [STAFF FAVOURITE]
Podcast episode
Understanding Deep Learning - Prof. SIMON PRINCE [STAFF FAVOURITE]
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Data Science and Privacy - sugarcoated or straight up? It Depends (with Katharine Jarmul of Cape Privacy)
Podcast episode
Data Science and Privacy - sugarcoated or straight up? It Depends (with Katharine Jarmul of Cape Privacy)
bySerious Privacy
0 ratings
0% found this document useful
Ep. 145 - Laura Anne Edwards, DATA OASIS founder, NASA Datanaut, TED Resident & SheCanHackIT on Sustainable Innovation and Big Data: Laura Anne Edwards is founder of DATA OASIS and serves as a NASA Datanaut, TED Resident and with SheCanHackIT. Brian Ardinger, Inside Outside Innovation founder, talks with Laura Anne about sustainable innovation and big data. Important Take Aways: Su
Podcast episode
Ep. 145 - Laura Anne Edwards, DATA OASIS founder, NASA Datanaut, TED Resident & SheCanHackIT on Sustainable Innovation and Big Data: Laura Anne Edwards is founder of DATA OASIS and serves as a NASA Datanaut, TED Resident and with SheCanHackIT. Brian Ardinger, Inside Outside Innovation founder, talks with Laura Anne about sustainable innovation and big data. Important Take Aways: Su
byInside Outside Innovation
0 ratings
0% found this document useful
Keeping ourselves honest when we work with observational healthcare data: The abundance of data in healthcare, and the valu…
Podcast episode
Keeping ourselves honest when we work with observational healthcare data: The abundance of data in healthcare, and the valu…
byLinear Digressions
0 ratings
0% found this document useful
[DataFramed Careers Series #3]: Accelerating Data Careers with Writing
Podcast episode
[DataFramed Careers Series #3]: Accelerating Data Careers with Writing
byDataFramed
0 ratings
0% found this document useful
4 + 1 Model of Data Science: Before diving into the complex world of data science it seemed to wise to establish a shared definition of the field. Here at the UVA School of Data Science, we have defined data science with the 4 + 1 Model. This model serves an outline for the first series of UVA Data Points. It also serves as a guiding definition within the School of Data Science, touching everything from research to course planning. In this introduction trailer, host Monica Manney discusses the history, development, and function of the 4 + 1 Model of Data Science with its main author, Raf Alvarado. Below is a brief expect from An Outline of the 4 + 1 Model of Data Science by Raf Alvarado: “The point of the 4 + 1 model, abstract as it is, is to provide a practical template for strategically planning the various elements of a school of data science. To serve as an effective template, a model must be general. But generality if often purchased at the cost of intuitive understanding. The fol
Podcast episode
4 + 1 Model of Data Science: Before diving into the complex world of data science it seemed to wise to establish a shared definition of the field. Here at the UVA School of Data Science, we have defined data science with the 4 + 1 Model. This model serves an outline for the first series of UVA Data Points. It also serves as a guiding definition within the School of Data Science, touching everything from research to course planning. In this introduction trailer, host Monica Manney discusses the history, development, and function of the 4 + 1 Model of Data Science with its main author, Raf Alvarado. Below is a brief expect from An Outline of the 4 + 1 Model of Data Science by Raf Alvarado: “The point of the 4 + 1 model, abstract as it is, is to provide a practical template for strategically planning the various elements of a school of data science. To serve as an effective template, a model must be general. But generality if often purchased at the cost of intuitive understanding. The fol
byUVA Data Points
0 ratings
0% found this document useful
007 Prof. Kristin Persson of the Materials Project – Building a Global Materials Informatics Platform: Summary: This episode focuses on Prof. Kristin Persson’s work directing the Materials Project, where she had her group have built an open-source materials informatics platform that reaches over 75,000 users worldwide. In this episode,...
Podcast episode
007 Prof. Kristin Persson of the Materials Project – Building a Global Materials Informatics Platform: Summary: This episode focuses on Prof. Kristin Persson’s work directing the Materials Project, where she had her group have built an open-source materials informatics platform that reaches over 75,000 users worldwide. In this episode,...
byDataLab: The Materials Informatics Podcast
0 ratings
0% found this document useful
Automated Data Labeling for AI Apps
Podcast episode
Automated Data Labeling for AI Apps
byThe Cloudcast
0 ratings
0% found this document useful
Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel: Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
Podcast episode
Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel: Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
byData Engineering Podcast
0 ratings
0% found this document useful
#104 Data Visualization with Dr. Curran Kelleher: Today I'm joined by Dr. Curran Kelleher. He's a data visualization expert and has taught a number of in-depth data visualization courses on freeCodeCamp's YouTube channel. We talk about what it's like to get a Ph.D. under one of the pioneers of data...
Podcast episode
#104 Data Visualization with Dr. Curran Kelleher: Today I'm joined by Dr. Curran Kelleher. He's a data visualization expert and has taught a number of in-depth data visualization courses on freeCodeCamp's YouTube channel. We talk about what it's like to get a Ph.D. under one of the pioneers of data...
byfreeCodeCamp Podcast
0 ratings
0% found this document useful
Project Jupyter with Jessica Forde, Yuvi Panda and Chris Holdgraf: This week Jessica Forde, Yuvi Panda and Chris Holdgraf join Melanie and Mark to discuss all things Project Jupyter.
Podcast episode
Project Jupyter with Jessica Forde, Yuvi Panda and Chris Holdgraf: This week Jessica Forde, Yuvi Panda and Chris Holdgraf join Melanie and Mark to discuss all things Project Jupyter.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Fast.ai, AutoML, and Software Engineering for ML: Jeremy Howard // Coffee Session #47
Podcast episode
Fast.ai, AutoML, and Software Engineering for ML: Jeremy Howard // Coffee Session #47
byMLOps.community
0 ratings
0% found this document useful
Four Most Commonly Asked Questions About AI with Dr. Jerry Smith: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is talking today about questions and...
Podcast episode
Four Most Commonly Asked Questions About AI with Dr. Jerry Smith: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is talking today about questions and...
byAI Live & Unbiased
0 ratings
0% found this document useful
Data jobs: Interview with data & machine learning expert Catherine Lopes PhD (Ep 42): Who would have thought that 2020 would be the year of data charts? That we would be glued to the daily news like never before, anxiously waiting to see more and more charts, expecting data analysts to tell us which way curves, bars, and pie charts ar...
Podcast episode
Data jobs: Interview with data & machine learning expert Catherine Lopes PhD (Ep 42): Who would have thought that 2020 would be the year of data charts? That we would be glued to the daily news like never before, anxiously waiting to see more and more charts, expecting data analysts to tell us which way curves, bars, and pie charts ar...
byThe Job Hunting Podcast
0 ratings
0% found this document useful
#23 - How to actually become an AI alignment researcher, according to Dr Jan Leike: Want to help steer the 21st century’s most transf…
Podcast episode
#23 - How to actually become an AI alignment researcher, according to Dr Jan Leike: Want to help steer the 21st century’s most transf…
by80,000 Hours Podcast
0 ratings
0% found this document useful
Understanding Graph Database Patterns
Podcast episode
Understanding Graph Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
Agile Applied AI Research with Parvez Ahammad - #492: Today we’re joined by Parvez Ahammad, head of data science applied research at LinkedIn. In our conversation, Parvez shares his interesting take on organizing principles for his organization, starting with how data science teams are broadly...
Podcast episode
Agile Applied AI Research with Parvez Ahammad - #492: Today we’re joined by Parvez Ahammad, head of data science applied research at LinkedIn. In our conversation, Parvez shares his interesting take on organizing principles for his organization, starting with how data science teams are broadly...
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Reining in Complexity: Data Science & Future of AI/ML Businesses: with @pwang @martin_casado AI/ML development is like reining in the natural world, more like physics and even metaphysics, where data and models are fluid. But this not just a philosophical observation; it has real implications for the margins, organizational structures, and building of such businesses. Especially as we’re in a tricky time of transition, where customers don’t even know what they’re asking for, yet are looking for AI/ML help or know it’s the future. So what does this all mean for the software value chain; for open source collaboration and commodification; for a new type of AI/ML company; and for the future of software businesses?
Podcast episode
Reining in Complexity: Data Science & Future of AI/ML Businesses: with @pwang @martin_casado AI/ML development is like reining in the natural world, more like physics and even metaphysics, where data and models are fluid. But this not just a philosophical observation; it has real implications for the margins, organizational structures, and building of such businesses. Especially as we’re in a tricky time of transition, where customers don’t even know what they’re asking for, yet are looking for AI/ML help or know it’s the future. So what does this all mean for the software value chain; for open source collaboration and commodification; for a new type of AI/ML company; and for the future of software businesses?
bya16z Podcast
0 ratings
0% found this document useful
Dask + Data Science Careers with Jacqueline Nolis - #480: Today we’re joined by Jacqueline Nolis, Head of Data Science at Saturn Cloud, and co-host of the . You might remember Jacqueline from our panel, where she shared her experience trying to navigate the suddenly hectic data science job market....
Podcast episode
Dask + Data Science Careers with Jacqueline Nolis - #480: Today we’re joined by Jacqueline Nolis, Head of Data Science at Saturn Cloud, and co-host of the . You might remember Jacqueline from our panel, where she shared her experience trying to navigate the suddenly hectic data science job market....
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
48. Big Data Wrangling for Core Sensing Technology
Podcast episode
48. Big Data Wrangling for Core Sensing Technology
byDiscovery to Recovery
0 ratings
0% found this document useful
114 - Project Orleans and the distributed database future with Dr. Philip Bernstein
Podcast episode
114 - Project Orleans and the distributed database future with Dr. Philip Bernstein
byMicrosoft Research Podcast
0 ratings
0% found this document useful
Analytics for a Better World - Parvathy Krishnan
Podcast episode
Analytics for a Better World - Parvathy Krishnan
byDataTalks.Club
0 ratings
0% found this document useful
Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs: An interview with Andy Dang about the open source WhyLogs library and how it simplifies the work of data logging for instrumenting your machine learning workflows and unlocking observability.
Podcast episode
Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs: An interview with Andy Dang about the open source WhyLogs library and how it simplifies the work of data logging for instrumenting your machine learning workflows and unlocking observability.
byData Engineering Podcast
0 ratings
0% found this document useful
Product Owners in Data Science - Anna Hannemann
Podcast episode
Product Owners in Data Science - Anna Hannemann
byDataTalks.Club
0 ratings
0% found this document useful
The Value of Analysts and Observability with Nick Heudecker: Nick Heudecker, who leads Market Strategy and Competitive Intelligence at Cirbl, joins Corey who, as it turns out, has some similarities with Corey. Nick also spent some time in Maine, as a cryptologist for the Navy, and also spent the months of deep wint
Podcast episode
The Value of Analysts and Observability with Nick Heudecker: Nick Heudecker, who leads Market Strategy and Competitive Intelligence at Cirbl, joins Corey who, as it turns out, has some similarities with Corey. Nick also spent some time in Maine, as a cryptologist for the Navy, and also spent the months of deep wint
byScreaming in the Cloud
0 ratings
0% found this document useful
EP 38: Big Data in genomics - why we need 'the cloud' and AI to make sense of it all with Dr Maria Chatzou Dunford: Genomic data, is big data - so how do we actually make sense of this huge amount of data? And why should we use 'the cloud’ to store and analyse it? We talk to Dr Maria Chatzou Dunford, CEO and Co-Founder of LifeBit, a company that wants to democratise analysis of genetic big data.
Podcast episode
EP 38: Big Data in genomics - why we need 'the cloud' and AI to make sense of it all with Dr Maria Chatzou Dunford: Genomic data, is big data - so how do we actually make sense of this huge amount of data? And why should we use 'the cloud’ to store and analyse it? We talk to Dr Maria Chatzou Dunford, CEO and Co-Founder of LifeBit, a company that wants to democratise analysis of genetic big data.
byThe Genetics Podcast
0 ratings
0% found this document useful
Julien Le Dem: Why Data Lineage Matters: Julien has a unique history of building open frameworks that make data platforms interoperable. He’s contributed in various ways to Apache Arrow, Apache Iceberg, Apache Parquet, and Marquez, and is currently leading OpenLineage, an open framework...
Podcast episode
Julien Le Dem: Why Data Lineage Matters: Julien has a unique history of building open frameworks that make data platforms interoperable. He’s contributed in various ways to Apache Arrow, Apache Iceberg, Apache Parquet, and Marquez, and is currently leading OpenLineage, an open framework...
byThe Analytics Engineering Podcast
0 ratings
0% found this document useful
Derwen, Inc. with Paco Nathan: This week, Jon and Michelle bring you another fascinating interview from our time at Next!
Podcast episode
Derwen, Inc. with Paco Nathan: This week, Jon and Michelle bring you another fascinating interview from our time at Next!
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful

Skip carousel

How And Where You Use Machine-learning
APC
Article
How And Where You Use Machine-learning
Oct 7, 2019
4 min read
Team Encodes Digital ‘Hello’ Into Lab-made DNA
Futurity
Article
Team Encodes Digital ‘Hello’ Into Lab-made DNA
Mar 26, 2019
4 min read
Why We Need To Fear The Risk Of AI Model Collapse
Evening Standard
Article
Why We Need To Fear The Risk Of AI Model Collapse
Dec 17, 2023
4 min read
Finding Your Data
APC
Article
Finding Your Data
Sep 9, 2019
4 min read
What Tech Can Learn from the Fruit Fly’s Search Algorithm
Nautilus
Article
What Tech Can Learn from the Fruit Fly’s Search Algorithm
Nov 13, 2017
5 min read
Finding A New Career In AI
APC
Article
Finding A New Career In AI
Mar 23, 2020
4 min read
‘Deep Learning’ Goes Faster With Organized Data
Futurity
Article
‘Deep Learning’ Goes Faster With Organized Data
Jun 5, 2017
Researchers have found that a technique for speedy data lookup, called hashing, can dramatically reduce the amount of computation required for deep learning, a demanding form of machine learning. “This applies to any deep-learning architecture, and t
2 min read
Prototype Paves Way For ‘Computer-on-a-chip’
Futurity
Article
Prototype Paves Way For ‘Computer-on-a-chip’
Feb 22, 2019
2 min read
How To Make Sense From And With AI ?
The European Business Review
Article
How To Make Sense From And With AI ?
Sep 25, 2021
4 min read
Cryptographers Solve Decades-Old Privacy Problem
Nautilus
Article
Cryptographers Solve Decades-Old Privacy Problem
Nov 17, 2023
4 min read
How To Train Computers Faster For ‘Extreme’ Datasets
Futurity
Article
How To Train Computers Faster For ‘Extreme’ Datasets
Dec 12, 2019
4 min read
Safer Cyber
Cosmos Magazine
Article
Safer Cyber
Mar 14, 2024
3 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
CRISPR Has a Terrible Name
The Atlantic
Article
CRISPR Has a Terrible Name
Apr 11, 2017
7 min read
Is Artificial Intelligence Permanently Inscrutable?: Despite new biology-like tools, some insist interpretation is impossible.
Nautilus
Article
Is Artificial Intelligence Permanently Inscrutable?: Despite new biology-like tools, some insist interpretation is impossible.
Sep 1, 2016
Dmitry Malioutov can’t say much about what he built. As a research scientist at IBM, Malioutov spends part of his time building machine learning systems that solve difficult problems faced by IBM’s corporate clients. One such program was meant for a
13 min read
Is Artificial Intelligence Permanently Inscrutable?
Nautilus
Article
Is Artificial Intelligence Permanently Inscrutable?
Sep 1, 2016
Dmitry Malioutov can’t say much about what he built. As a research scientist at IBM, Malioutov spends part of his time building machine learning systems that solve difficult problems faced by IBM’s corporate clients. One such program was meant for a
13 min read
Science Is Becoming Less Human
The Atlantic
Article
Science Is Becoming Less Human
Dec 11, 2023
This summer, a pill intended to treat a chronic, incurable lung disease entered mid-phase human trials. Previous studies have demonstrated that the drug is safe to swallow, although whether it will improve symptoms of the painful fibrosis that it tar
8 min read
Why a Hedge Fund Started a Video Game Competition
Nautilus
Article
Why a Hedge Fund Started a Video Game Competition
Nov 30, 2017
There’s a weird way in which a hedge fund is a confluence of everything. There’s the money of course—Two Sigma, located in lower Manhattan, manages over $50 billion, an amount that has grown 600 percent in 6 years and is roughly the size of the econo
9 min read
Deep Learning Technique for Object Detection
Techfastly
Article
Deep Learning Technique for Object Detection
Jun 1, 2021
3 min read
AI And Design: Questions Of Ethics
Architecture Australia
Article
AI And Design: Questions Of Ethics
Mar 4, 2024
Artificial intelligence (AI) is a very old idea, but the term AI and the field of AI as it relates to modern programmable digital computing have taken their contemporary forms in the past 70 years.1Today, we interact with AI technologies constantly,
5 min read
The Race To Exascale Supercomputers
Maximum PC
Article
The Race To Exascale Supercomputers
Jun 21, 2022
9 min read
You Won’t Believe How Well This Algorithm Spots Clickbait
Futurity
Article
You Won’t Believe How Well This Algorithm Spots Clickbait
Aug 29, 2019
3 min read
The Deep Learning Revolution For Artificial Intelligence
Facility Management
Article
The Deep Learning Revolution For Artificial Intelligence
Mar 28, 2019
3 min read
New DNA Data Storage Is A ‘Biological Camera’
Futurity
Article
New DNA Data Storage Is A ‘Biological Camera’
Jul 13, 2023
A new “biological camera” harnesses living cells and their inherent biological mechanisms to encode and store data, say researchers. The feat represents a significant breakthrough in encoding and storing images directly within DNA, creating a new mod
2 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
Questions for Angela Zutavern, Machine Intelligence Expert, Booz Allen Hamilton
Rotman Management
Article
Questions for Angela Zutavern, Machine Intelligence Expert, Booz Allen Hamilton
Jan 1, 2018
You believe that the world of leadership has hit an inflection point. How so? As useful as popular mental models and heuristics are, machine models now outstrip human performance in about half of the portfolio of cognitive tasks. Going forward, we wi
6 min read
Cambridge-1 And The Future Of Medicine
PC Pro Magazine
Article
Cambridge-1 And The Future Of Medicine
Sep 9, 2021
7 min read
Synthetic Data As A Double-Edged Sword In Africa's AI Revolution
Forbes Africa
Article
Synthetic Data As A Double-Edged Sword In Africa's AI Revolution
Sep 29, 2023
Artificial intelligence (AI) is transforming companies and economies worldwide, including in Africa. Data is an essential component in the training of AI systems. Unfortunately, the lack of accurate, high-quality data is a significant impediment in A
3 min read
For National STEM Day, Argonne Lab’s Valerie Taylor Talks About AI, ‘Star Trek’ And Diversity In The Sciences
Chicago Tribune
Article
For National STEM Day, Argonne Lab’s Valerie Taylor Talks About AI, ‘Star Trek’ And Diversity In The Sciences
Nov 9, 2023
5 min read

Related categories

Skip carousel

Reviews for Fundamentals of Data Science

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Fundamentals of Data Science - Jugal K. Kalita

1: Introduction

The secret of business is to know something that nobody else knows.

— Aristotle Onassis

Abstract

In today's data-driven era, data serves as the lifeblood of any organization and society at large. Data Science emerges as an indispensable force driving innovation and informed decision-making. This introductory chapter lays the foundation for our comprehensive exploration in the book, ‘Fundamentals of Data Science - Theory and Practice.’ The chapter discusses the foundations of Data Science, including predictive analytics, descriptive analytics, diagnostic analytics, and prescriptive analytics, as well as its distinguishing properties, methodology, and real-world applications. We highlight Data Science's broad goals, which range from uncovering hidden knowledge and forecasting future outcomes to intelligent data grouping and providing actionable insights. Furthermore, we examine the boundaries of Data Science, refuting popular myths and clarifying its symbiotic relationship with other related disciplines. We walk through the Data Science pipeline, highlighting the critical stages of data collection, preparation, learning model creation, and knowledge interpretation. Finally, we explore the enormous scope of Data Science applications, demonstrating how it has transformed industries such as healthcare, computational biology, business, smart gadgets, and transportation. Data Science is at the forefront of current innovation as each industry utilizes the power of data-driven decision-making.

Keywords

Data Science; Analytics; Predictive analytics; Descriptive analytics; Diagnostic analytics; Prescriptive analytics; Hidden knowledge discovery; Data Science objectives; Data Science applications

Consumer satisfaction is a fundamental performance indicator and a key element of an enterprise's success. The success of any enterprise relies on its understanding of customer expectations and needs, buying behaviors, and levels of satisfaction. Modern giant business houses analyze customer expectations and perceptions of the quality and value of products to make effective decisions regarding product launch and update, servicing, and marketing.

Due to the availability of fast internet technologies and low-cost storage devices, it has become convenient to capture voluminous amounts of consumer opinions and records of consumer activities. However, discovering meaningful consumer feedback from a sea of heterogeneous sources of reviews and activity records is just like finding a needle in a haystack. Data Science appears to be the savior in isolating novel, unknown, and meaningful information that helps proper decision making.

Data Science is the study of methods for programming computers to process, analyze, and summarize data from various perspectives to gain revealing and impactful insights and solve a vast array of problems. It is able to answer questions that are difficult to address through simple database queries and reporting techniques. Data Science aims to address many of the same research questions as statistics and psychology, but with differences in emphasis. Data Science is primarily concerned with the development, accuracy, and effectiveness of the resulting computer systems. Statistics seek to understand the phenomena that generate the data, often with the goal of testing different hypotheses about the phenomena. Psychological studies aspire to understand the mechanisms underlying the behaviors exhibited by people such as concept learning, skill acquisition, and strategy change.

Google Maps is a brilliant product developed by Google, Inc., using Data Science to facilitate easy navigation. But how does it work? It collects location data continuously from different reliable heterogeneous sources, including GPS locations via mobile phones of millions of users who keep their location services on. It captures location, velocity, and itinerary-related data automatically. Efficient Data Science algorithms are applied to the collected data to predict traffic jams and road hazards, the shortest routes, and the time to reach the destination. Massive quantities of collected past, current, and near current traffic data help Google predict real-time traffic patterns.

1.1 Data, information, and knowledge

To introduce the arena of Data Science, it is of utmost importance to understand the data-processing stack. Data Science-related processing starts with a collection of raw data. Any facts about events that are unprocessed and unorganized are called data. Generally, data are received raw and hardly convey any meaning. Data, in their original form, are useless until processed further to extract hidden meaning. Data can be (i) operational or transactional data such as customer orders, inventory levels, and financial transactions, (ii) nonoperational data, such as market-research data, customer demographics, and financial forecasting, (iii) heterogeneous data of different structures, types, formats such as MR images and clinical observations, and (iv) metadata, i.e., data about the data, such as logical database designs or data dictionary definitions.

Information is the outcome of processing raw data in a meaningful way to obtain summaries of interest. To extract information from data, one has to categorize, contextualize, and condense data. For example, information may indicate a trend in sales for a given period of time, or it may represent a buying pattern for customers in a certain place during a season. With rapid developments in computer and communication technologies, the transformation of data into information has become easier. In a true sense, Data Science digs into the raw data to explore hidden patterns and novel insights from the data.

Knowledge represents the human understanding of a subject matter, gained through systematic analysis and experience. Knowledge results from an integration of human perspectives and processes to derive meaningful conclusions. Some researchers [5] define knowledge with reference to a subject matter from three perspectives, (i) understanding (know-why), (ii) cognition or recognition (know-what), and (iii) capacity to act (know-how). Knowledge in humans can be stored only in brains, not in any other media. The brain has the ability to interconnect it all together. Unlike human beings, computers are not capable of understanding what they process, and they cannot make independent decisions. Hence, computers are not artificial brains! While building knowledge, our brain is dependent on two sources, i.e., data and information. To understand the relationship between data and information, consider an example. If you take a photograph of your house, the raw image is an example of data. However, details of how the house looks in terms of attributes such as the number of stories, the colors of the walls, and its apparent size, constitute information. If you send your photograph via email or message to your friend, you are actually not sending your house or its description to your friend. From the photograph, it is up to your friend, how he/she perceives its appearance or looks. If it so happens that the image is corrupted or lost, still your original house will be retained. Hence, even if the information is destroyed, the data source remains.

The key concepts of data, information, and knowledge are often illustrated as a pyramid, where data are the starting point placed at the base of the pyramid (see Fig. 1.1), and it ends in knowledge generation. If we collect knowledge from related concepts, domains, and processes further, it gives rise to wisdom. We skip discussions on wisdom as the concept is highly abstract and controversial and difficult to describe. Usually, the sizes of repositories to store data, information, and knowledge become smaller as we move upward in the pyramid, where data in their original form have lower importance than information and knowledge. It is worth mentioning that quality raw data lead to more significant information and knowledge generation. Hence, good-quality data collection is a stepping stone for effective information and knowledge mining.

Figure 1.1 Data, Information, and Knowledge pyramid, and intermediate conversion layers. The directions of the arrowheads indicate increase in size and importance.

1.2 Data Science: the art of data exploration

Data Science is a multifaceted and multidisciplinary domain dedicated to extracting novel and relevant patterns hidden inside data. It encompasses mathematical and statistical models, efficient algorithms, high-performance computing systems, and systematic processes to dig inside structured or unstructured data to explore and extract nontrivial and actionable knowledge, ultimately being useful in the real world and having an impact.

The success of Data Science depends on many factors. The bulk of efforts have concentrated on developing effective exploratory algorithms. It usually involves mathematical theories and expensive computations to apply the theory to large-scale raw data.

1.2.1 Brief history

The dawn of the 21st century is known as the Age of Data. Data have become the new fuel for almost every organization as references to data have infiltrated the vernacular of various communities, both in industry and academia. Many data-driven applications have become amazingly successful, assisted by research in Data Science. Although Data Science has become a buzzword recently, its roots are more than half a century old. In 1962, John Wilder Tukey, a famous American mathematician published an article entitled The Future Of Data Analysis [8] that sought to establish a science focused on learning from data. After six years, another pioneer named Peter Naur, a Danish computer scientist introduced the term Datalogy as the science of data and of data processes [6], followed by the publication of a book in 1974, Concise Survey of Computer Methods [7], that defined the term Data Science as the science of dealing with data. Later, in 1977, The International Association for Statistical Computing (IASC) was founded with a plan for linking traditional statistical methodology, modern computer technology, and the knowledge of domain experts in order to convert data into information and knowledge. Tukey also published a major work entitled, Exploratory Data Analysis [9], that laid an emphasis on hypothesis testing during data analysis, giving rise to the term data-driven discovery. Following this, the first Knowledge Discovery in Databases (KDD) workshop was organized in 1989, becoming the annual ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).¹

Later, in 1996, Fayyad et al. [2] introduced the term Data Mining, the application of specific algorithms for extracting patterns from data. By the dawn of the 2000s, many journals started recognizing the field and notable figures like William S. Cleveland, John Chambers, and Leo Breiman expanded boundaries of statistical modeling, envisioning a new epoch in statistics focused on Data Science [1].

The term Data Scientist was first introduced in 2008 by Dhanurjay Patil and Jeff Hammerbacher of LinkedIn and Facebook [10].

1.2.2 General pipeline

Data Science espouses a series of systematic steps for converting data into information in the form of patterns or decisions. Data Science has evolved by borrowing concepts from statistics, machine learning, artificial intelligence, and database systems to support the automatic discovery of interesting patterns in large data sets. A Data Science pipeline is made of the following four major phases. An illustrative representation of a typical Data Science workflow [4] is depicted in Fig. 1.2.

Figure 1.2 Major steps in Data Science pipeline for decision making and analysis.

1.2.2.1 Data collection and integration

Data are initially collected, and integrated if collection involves multiple sources. For any successful Data Science and -analysis activity, data collection is one of the most important steps. The quality of collected data carries great weight. If the collected samples are not sufficient to describe the overall system or process under study, downstream activities are likely to become useless despite employing sophisticated computing methods. The quality of the outcome is highly dependent on the quality of data collection.

It has been observed that dependence on a single source of data is always precarious. Integration of multifaceted and multimodal data may offer better results than working with a single source of information. In fact, information from one source may complement those from other sources when one source of data is not sufficient to understand a system or process well. However, the integration of multisource data itself is a challenging task and needs due attention. Integration should be deliberate rather than random mixing to deliver better results.

1.2.2.2 Data preparation

Raw data collected from input sources are not always suitable for downstream exploration. The presence of noise and missing values, and the prevalence of nonuniform data structures and standards may negatively affect final decision making. Hence, it is of utmost importance to prepare the raw data before downstream processing. Preprocessing also filters uninformative or possibly misleading values such as outliers.

1.2.2.3 Learning-model construction

Different machine learning models are suitable for learning different types of data patterns. Iterative learning via refinement is often more successful in understanding data distributions. A plethora of models are available to a data scientist and choices must be made judiciously. Models are usually used to explain the data or extract relevant patterns to describe the data or predict associations.

1.2.2.4 Knowledge interpretation and presentation

Finally, results need to be interpreted and explained by domain experts. Each step of analysis may trigger corrections or refinements that are applied to the preceding steps.

1.2.3 Multidisciplinary science

Data Science is a multidisciplinary domain of study to extract, identify, and analyze novel knowledge from raw data by applying computing and statistical tools, together with domain experts for the interpretation of outcomes. It involves mathematical and statistical tools for effective data analysis and modeling, pattern recognition and machine learning to assist in decision making, data and text mining for hidden pattern extraction, and database technologies for effective large data storage and management (Fig. 1.3). Due to the complex nature of the data elements and their relationships, most often, understanding the data itself is challenging. Before understanding the underlying distribution of data elements, it may not be very fruitful to apply any statistical or computational tools for knowledge extraction. Visualization may play a large role in deciphering the interrelationships among the data elements, thereby helping decide the appropriate computational models or tool for subsequent data analysis. The presence of noise in the data may be discovered and eliminated by looking into distribution plots. However, it is a well-known fact that visualizing multidimensional data itself is challenging and needs special attention. With the availability of low-cost data-generation devices and fast communication technologies, Big Dataor vast amount of data have become ubiquitous. Dealing with Big Data for Data Science needs high-performance computing platforms. The science and engineering of parallel and distributed computing are important discipline that need to be integrated into the Data Science ecosystem. Recently, it has become convenient to integrate parallel computing due to the wide availability of relatively inexpensive Graphical Processing Units (GPU). Last but not least, knowledge of and expertise in the domain in which Data Science approaches are applied play major roles during problem formulation and interpretation of solutions.

Figure 1.3 Data Science joins hands with a variety of other disciplines.

1.3 What is not Data Science?

In recent years, the term Data Science has become a buzzword in the world of business and intelligent computing. As usual, high demand and popularity invite misinterpretation and hype. It is important to be aware of the terms that are used as well as misused in the context of Data Science.

Machine Learning is not a branch of Data Science. It provides the technology or the tools to facilitate smart decision making using the software. Data Science uses Machine Learning as a tool for autonomous pattern analysis and decision making.

There is a prevalent fallacy that techniques of Data Science are applicable to only a very large amounts of data or so-called Big Data. This is not true, as even smaller amounts of data can also be analyzed usefully. The quality of the data in hand and the completeness of the data are always important. It is true that a Machine Learning system is likely to be able to extract more accurate knowledge when large amounts of relevant data are used to draw intuitions about the underlying patterns.

It is true that statistical techniques play a great role in effective data analysis. Statistics complements and enhances Data Science [3] for efficient and effective analysis of large collections of data. Statistics use mathematical models to infer and analyze data patterns by studying data distributions of collected samples from the past. However, Data Science cannot be considered as completely dependent on statistics alone. Statistics is used mostly to describe past data, whereas Data Science performs predictive learning for actionable decision making. A number of nonparametric Data Science learning models help understand the data very well without knowing the underlying data distributions.

Last but not least, people often give more importance to scripting languages, such as Python or R, and tools available for ready use rather than understanding the theory of Data Science. Of course, knowledge of tools greatly helps in developing intelligent systems quickly and effectively. Without understanding the models and formalisms, quite often many users concentrate on using the readily and freely available preimplemented models. Knowing only prescripted or programmed tools is not sufficient to have a good overall understanding of Data Science so that existing tools can be adapted and used efficiently to solve complex problems. Proficiency in data-analysis tools without deeper knowledge of data analysis does not make for a good data scientist.

1.4 Data Science tasks

Data Science-related activities are broadly classified into predictive and descriptive tasks. The former deals with novel inferences based on acquired knowledge and the latter describes the inherent patterns hidden inside data. With the rise in business-analysis applications, the span of Data Science tasks has extended further into two related tasks, namely diagnostic and prescriptive. Somewhat simplistic, but differentiating, views of the four tasks can be obtained by asking four different questions: What is likely to happen? (Predictive), What is happening? (Descriptive), Why is it happening? (Diagnostic) and What do I need to do? (Prescriptive), respectively.²

1.4.1 Predictive Data Science

Predictive tasks apply supervised Machine Learning to predict the future by learning from past experiences. Examples of predictive analysis are classification, regression, and deviation detection. Some predictive techniques are presented below. Classification attempts to assign a given instance into one of several prespecified classes based on behaviors and correlations of collected and labeled samples with the target class. A classifier is designed based on patterns in existing data samples (training data). The trained model is then used for inferring the class of unknown samples. The overall objective of any good classification technique is to learn from the training samples and to build accurate descriptions for each class. For example, spam filtering separates incoming emails into safe and suspicious emails based on the signatures or attributes of the email. Similar to classification, prediction techniques infer the future state based on experiences from the past. The prime difference between classification and prediction models is in the type of outcome they produce. Classification assigns each sample to one of several prespecified classes. In contrast, prediction outcomes are continuous valued as prediction scores. The creation of predictive models is similar to classification. The prediction of the next day's or next week's weather or temperatures is a classic example of a prediction task based on observations of the patterns of weather for the last several years in addition to current conditions. Time-series data analysis predicts future trends in time-series data to find regularities, similar sequences or subsequences, sequential patterns, periodicities, trends, and deviations. For example, predicting trends in the stock values for a company based on its stock history, business situation, competitor performance, and current

Enjoying the preview?

Page 1 of 1

Fundamentals of Data Science: Theory and Practice

About this ebook

Jugal K. Kalita

Related authors

Related to Fundamentals of Data Science

Related ebooks

Mathematics For You

Related podcast episodes

Related articles

Related categories

Reviews for Fundamentals of Data Science

What did you think?

Book preview

Fundamentals of Data Science - Jugal K. Kalita

1: Introduction

Abstract

Keywords

1.1 Data, information, and knowledge

1.2 Data Science: the art of data exploration

1.2.1 Brief history

1.2.2 General pipeline

1.2.3 Multidisciplinary science

1.3 What is not Data Science?

1.4 Data Science tasks

1.4.1 Predictive Data Science