Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next

Ebook1,073 pages9 hours

Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next

Name: Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next
Author: Rupam Kumar Sharma
ISBN: 9789389845679

By Rupam Kumar Sharma

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book introduces the fundamental concepts of Data Science, which has proved to be a major game-changer in business solving problems.
Topics covered in the book include fundamentals of Data Science, data preprocessing, data plotting and visualization, statistical data analysis, machine learning for data analysis, time-series analysis, deep learning for Data Science, social media analytics, business analytics, and Big Data analytics. The content of the book describes the fundamentals of each of the Data Science related topics together with illustrative examples as to how various data analysis techniques can be implemented using different tools and libraries of Python programming language.
Each chapter contains numerous examples and illustrative output to explain the important basic concepts. An appropriate number of questions is presented at the end of each chapter for self-assessing the conceptual understanding. The references presented at the end of every chapter will help the readers to explore more on a given topic.

Skip carousel

Computers

LanguageEnglish

PublisherBPB Online LLP

Release dateJun 2, 2020

ISBN9789389845679

Author

Rupam Kumar Sharma

Related authors

Skip carousel

Related to Data Science Fundamentals and Practical Approaches

Related ebooks

Skip carousel

Data Science: What the Best Data Scientists Know About Data Analytics, Data Mining, Statistics, Machine Learning, and Big Data – That You Don't
Ebook
Data Science: What the Best Data Scientists Know About Data Analytics, Data Mining, Statistics, Machine Learning, and Big Data – That You Don't
byHerbert Jones
Rating: 5 out of 5 stars
5/5
Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition)
Ebook
Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition)
byPrateek Gupta
Rating: 0 out of 5 stars
0 ratings
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Data Analytics with Python: Data Analytics in Python Using Pandas
Ebook
Data Analytics with Python: Data Analytics in Python Using Pandas
byFrank Millstein
Rating: 3 out of 5 stars
3/5
Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples
Ebook
Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples
byPrateek Gupta
Rating: 0 out of 5 stars
0 ratings
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
Ebook
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
byIke Beck
Rating: 0 out of 5 stars
0 ratings
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
Ebook
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
byPurna Chander Rao. Kathula
Rating: 5 out of 5 stars
5/5
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Ebook
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
byPeter Bradley
Rating: 0 out of 5 stars
0 ratings
Data Analytics for Businesses 2019: Master Data Science with Optimised Marketing Strategies using Data Mining Algorithms (Artificial Intelligence, Machine Learning, Predictive Modelling and more)
Ebook
Data Analytics for Businesses 2019: Master Data Science with Optimised Marketing Strategies using Data Mining Algorithms (Artificial Intelligence, Machine Learning, Predictive Modelling and more)
byRiley Adams
Rating: 5 out of 5 stars
5/5
Data Science for Business: Predictive Modeling, Data Mining, Data Analytics, Data Warehousing, Data Visualization, Regression Analysis, Database Querying, and Machine Learning for Beginners
Ebook
Data Science for Business: Predictive Modeling, Data Mining, Data Analytics, Data Warehousing, Data Visualization, Regression Analysis, Database Querying, and Machine Learning for Beginners
byHerbert Jones
Rating: 0 out of 5 stars
0 ratings
Practical Data Analysis
Ebook
Practical Data Analysis
byHector Cuesta
Rating: 4 out of 5 stars
4/5
Data Science: Concepts and Practice
Ebook
Data Science: Concepts and Practice
byVijay Kotu
Rating: 3 out of 5 stars
3/5
Python Data Science: A Step-By-Step Guide to Data Analysis. What a Beginner Needs to Know About Machine Learning and Artificial Intelligence. Exercises Included
Ebook
Python Data Science: A Step-By-Step Guide to Data Analysis. What a Beginner Needs to Know About Machine Learning and Artificial Intelligence. Exercises Included
byAxel Ross
Rating: 0 out of 5 stars
0 ratings
Data Science and Machine Learning Interview Questions Using Python: A Complete Question Bank to Crack Your Interview
Ebook
Data Science and Machine Learning Interview Questions Using Python: A Complete Question Bank to Crack Your Interview
byVishwanathan Narayanan
Rating: 0 out of 5 stars
0 ratings
Applied Machine Learning Solutions with Python: Production-ready ML Projects Using Cutting-edge Libraries and Powerful Statistical Techniques (English Edition)
Ebook
Applied Machine Learning Solutions with Python: Production-ready ML Projects Using Cutting-edge Libraries and Powerful Statistical Techniques (English Edition)
bySiddhanta Bhatta
Rating: 0 out of 5 stars
0 ratings
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
Ebook
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
bySebastian Raschka
Rating: 5 out of 5 stars
5/5
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Ebook
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
byCalvert Long
Rating: 0 out of 5 stars
0 ratings
Machine Learning Algorithms for Data Scientists: An Overview
Ebook
Machine Learning Algorithms for Data Scientists: An Overview
byVinaitheerthan Renganathan
Rating: 0 out of 5 stars
0 ratings
Machine Learning: A Comprehensive, Step-by-Step Guide to Learning and Understanding Machine Learning Concepts, Technology and Principles for Beginners: 1
Ebook
Machine Learning: A Comprehensive, Step-by-Step Guide to Learning and Understanding Machine Learning Concepts, Technology and Principles for Beginners: 1
byPeter Bradley
Rating: 0 out of 5 stars
0 ratings
Practical Machine Learning for Data Analysis Using Python
Ebook
Practical Machine Learning for Data Analysis Using Python
byAbdulhamit Subasi
Rating: 0 out of 5 stars
0 ratings
Designing Machine Learning Systems with Python
Ebook
Designing Machine Learning Systems with Python
byDavid Julian
Rating: 0 out of 5 stars
0 ratings
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Ebook
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
bySteven Cooper
Rating: 0 out of 5 stars
0 ratings
Data Preparation and Exploration: Applied to Healthcare Data
Ebook
Data Preparation and Exploration: Applied to Healthcare Data
byRobert Hoyt
Rating: 0 out of 5 stars
0 ratings
Practical Machine Learning with Spark: Uncover Apache Spark’s Scalable Performance with High-Quality Algorithms Across NLP, Computer Vision and ML
Ebook
Practical Machine Learning with Spark: Uncover Apache Spark’s Scalable Performance with High-Quality Algorithms Across NLP, Computer Vision and ML
byGourav Gupta
Rating: 0 out of 5 stars
0 ratings
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
Ebook
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
bySteven Cooper
Rating: 3 out of 5 stars
3/5
Deep Learning With Python Illustrated Guide For Beginners & Intermediates: The Future Is Here!: The Future Is Here!, #2
Ebook
Deep Learning With Python Illustrated Guide For Beginners & Intermediates: The Future Is Here!: The Future Is Here!, #2
byWilliam Sullivan
Rating: 1 out of 5 stars
1/5
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
Ebook
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
byAlok Kumar
Rating: 0 out of 5 stars
0 ratings
Hands-on Supervised Learning with Python
Ebook
Hands-on Supervised Learning with Python
byMadeleine Shang
Rating: 0 out of 5 stars
0 ratings
Machine Learning for Finance
Ebook
Machine Learning for Finance
bySaurav Singla
Rating: 0 out of 5 stars
0 ratings
Data Analysis with Python: Introducing NumPy, Pandas, Matplotlib, and Essential Elements of Python Programming (English Edition)
Ebook
Data Analysis with Python: Introducing NumPy, Pandas, Matplotlib, and Essential Elements of Python Programming (English Edition)
byRituraj Dixit
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance
Ebook
Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance
byKatherine Johnson Martinko
Rating: 0 out of 5 stars
0 ratings
AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice
Ebook
AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice
bySeth Reichelson
Rating: 0 out of 5 stars
0 ratings
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Going Text: Mastering the Command Line
Ebook
Going Text: Mastering the Command Line
byBrian Schell
Rating: 4 out of 5 stars
4/5
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Remote/WebCam Notarization : Basic Understanding
Ebook
Remote/WebCam Notarization : Basic Understanding
byJeannie Eunice Franks
Rating: 3 out of 5 stars
3/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
Podcast episode
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Measuring Your Python Learning Progress
Podcast episode
Measuring Your Python Learning Progress
byThe Real Python Podcast
100%
100% found this document useful
#54 Women in Data Science
Podcast episode
#54 Women in Data Science
byDataFramed
0 ratings
0% found this document useful
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
Podcast episode
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
byHow to Data (Joshiverse- Journey of a Budding Data Scientist)
0 ratings
0% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
Exploring The Evolving Role Of Data Engineers: An interview with Maxime Beauchemin about how the technological progression in the data ecosystem is driving a constant change in the role and responsibilities of data engineers.
Podcast episode
Exploring The Evolving Role Of Data Engineers: An interview with Maxime Beauchemin about how the technological progression in the data ecosystem is driving a constant change in the role and responsibilities of data engineers.
byData Engineering Podcast
100%
100% found this document useful
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
Podcast episode
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
byData Engineering Podcast
0 ratings
0% found this document useful
Getting Technical about the Data Center Revolution with Jonathan Friedmann, CEO of Speedata
Podcast episode
Getting Technical about the Data Center Revolution with Jonathan Friedmann, CEO of Speedata
byMaking Data Simple
0 ratings
0% found this document useful
007: Data Cleansing & Analysis with Oz du Soleil: Oz du Soleil is an Excel MVP since 2015 and is an expert in data cleansing & analysis. He has an Excel blog over at www.datascopic.net which is his commitment to data literacy. He’s the leading author on the revised version of Guerrilla Data...
Podcast episode
007: Data Cleansing & Analysis with Oz du Soleil: Oz du Soleil is an Excel MVP since 2015 and is an expert in data cleansing & analysis. He has an Excel blog over at www.datascopic.net which is his commitment to data literacy. He’s the leading author on the revised version of Guerrilla Data...
byLearn Microsoft Excel with MyExcelOnline
0 ratings
0% found this document useful
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
Podcast episode
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
byDataFramed
100%
100% found this document useful
Chapter 1: What is Data Science?
Podcast episode
Chapter 1: What is Data Science?
byBuild a Career in Data Science
0 ratings
0% found this document useful
167 | Visualization and Statistics with Andrew Gelman and Jessica Hullman
Podcast episode
167 | Visualization and Statistics with Andrew Gelman and Jessica Hullman
byData Stories
0 ratings
0% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
Open Source TensorFlow with Yifei Feng: Yifei Feng, a TensorFlow software engineer, shares with Melanie and Mark about her work on the open source TensorFlow project and the tools she builds.
Podcast episode
Open Source TensorFlow with Yifei Feng: Yifei Feng, a TensorFlow software engineer, shares with Melanie and Mark about her work on the open source TensorFlow project and the tools she builds.
byGoogle Cloud Platform Podcast
100%
100% found this document useful
78: Mindset of a Rockstar Data Analyst w/ Trevor Tapscott: Our focus for this inspiring episode of AOF is mindset, especially if you want to be a standout data analyst! I have brought one of my first ever followers and day ones! Trevor Tapscott is a VP and Analytics Consultant at Wells Fargo and has been in...
Podcast episode
78: Mindset of a Rockstar Data Analyst w/ Trevor Tapscott: Our focus for this inspiring episode of AOF is mindset, especially if you want to be a standout data analyst! I have brought one of my first ever followers and day ones! Trevor Tapscott is a VP and Analytics Consultant at Wells Fargo and has been in...
byAnalytics on Fire
0 ratings
0% found this document useful
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
Podcast episode
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
Podcast episode
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Anaconda + Pyston and more: with Peter Wang, CEO of Anaconda
Podcast episode
Anaconda + Pyston and more: with Peter Wang, CEO of Anaconda
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
Build Your Analytics With A Collaborative And Expressive SQL IDE Using Querybook: An interview about the Querybook SQL IDE for big data analytics and how you can use it to build more expressive and maintainable analytics.
Podcast episode
Build Your Analytics With A Collaborative And Expressive SQL IDE Using Querybook: An interview about the Querybook SQL IDE for big data analytics and how you can use it to build more expressive and maintainable analytics.
byData Engineering Podcast
0 ratings
0% found this document useful
#059 - 10 Python clean code tips drawn from code reviews
Podcast episode
#059 - 10 Python clean code tips drawn from code reviews
byPybites Podcast
0 ratings
0% found this document useful
What is beyond PoCs? ML project-hurdles you should be prepared to take with Balázs Kégl - 016: Why do we do PoCs all the time and why do we struggle with Real projects? We are going to talk about ML project-hurdles with the head of AI at Huawei Paris, Balazs Kegl.
Podcast episode
What is beyond PoCs? ML project-hurdles you should be prepared to take with Balázs Kégl - 016: Why do we do PoCs all the time and why do we struggle with Real projects? We are going to talk about ML project-hurdles with the head of AI at Huawei Paris, Balazs Kegl.
byMachine Learning Cafe
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
126 | FlowingData with Nathan Yau
Podcast episode
126 | FlowingData with Nathan Yau
byData Stories
0 ratings
0% found this document useful
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
Podcast episode
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
byNew Books in Science, Technology, and Society
0 ratings
0% found this document useful
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
Podcast episode
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
byNew Books in Economics
0 ratings
0% found this document useful
AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens
Podcast episode
AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens
byMicrosoft Research Podcast
0 ratings
0% found this document useful
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
Podcast episode
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
byNew Books in Business, Management, and Marketing
0 ratings
0% found this document useful
Episode 16: Cate Huston
Podcast episode
Episode 16: Cate Huston
bySwiftly Speaking
0 ratings
0% found this document useful
Fast.ai, AutoML, and Software Engineering for ML: Jeremy Howard // Coffee Session #47
Podcast episode
Fast.ai, AutoML, and Software Engineering for ML: Jeremy Howard // Coffee Session #47
byMLOps.community
0 ratings
0% found this document useful

Skip carousel

Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read
Understanding ELT & ETL
Techfastly
Article
Understanding ELT & ETL
Apr 1, 2021
8 min read
How to Make Predictive Analytics Work for Your Business
Entrepreneur
Article
How to Make Predictive Analytics Work for Your Business
Jul 1, 2014
1 min read
Why Is ELT Better For Cloud Data Warehousing?
Techfastly
Article
Why Is ELT Better For Cloud Data Warehousing?
Apr 1, 2021
2 min read
DJANGO Create A Database-driven Website
Linux Format
Article
DJANGO Create A Database-driven Website
Jun 4, 2019
The Django web framework was named after the famous guitarist Django Reinhardt and was first created by web developers at a small newspaper in Kansas. The main goals of Django is to enable fast development of complex websites with database needs. It
7 min read
What You Need to Know About Data Modeling
Entrepreneur
Article
What You Need to Know About Data Modeling
Jan 1, 2013
2 min read
How To Make Sense From And With AI ?
The European Business Review
Article
How To Make Sense From And With AI ?
Sep 25, 2021
4 min read
Why We Need To Fear The Risk Of AI Model Collapse
Evening Standard
Article
Why We Need To Fear The Risk Of AI Model Collapse
Dec 17, 2023
4 min read
You Won’t Believe How Well This Algorithm Spots Clickbait
Futurity
Article
You Won’t Believe How Well This Algorithm Spots Clickbait
Aug 29, 2019
3 min read
AI And Design: Questions Of Ethics
Architecture Australia
Article
AI And Design: Questions Of Ethics
Mar 4, 2024
Artificial intelligence (AI) is a very old idea, but the term AI and the field of AI as it relates to modern programmable digital computing have taken their contemporary forms in the past 70 years.1Today, we interact with AI technologies constantly,
5 min read
Q&A
Rotman Management
Article
Q&A
May 1, 2023
Describe the capability that companies like Netflix, UPS, Amazon and Caesars Entertainment have in common. These are all leading firms in their industries with respect to leveraging analytics as a source of competitive advantage. We now have so much
7 min read
Generative AI: What Leaders Need To Know
Rotman Management
Article
Generative AI: What Leaders Need To Know
Jan 1, 2024
12 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
AppleMagazine
Article
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 28, 2023
4 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
TechLife News
Article
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 29, 2023
4 min read
Fact-check And Verify Information
Post South Africa
Article
Fact-check And Verify Information
Mar 13, 2024
Q: What is AI? A: AI is the acronym for artificial intelligence (AI) and refers to the development of computer systems capable of performing tasks that typically require human intelligence, such as visual perception, speech recognition, decision-maki
3 min read
Why CEOs Must Delve into Design Thinking
Business Today
Article
Why CEOs Must Delve into Design Thinking
Feb 19, 2018
4 min read
The Deep Learning Revolution For Artificial Intelligence
Facility Management
Article
The Deep Learning Revolution For Artificial Intelligence
Mar 28, 2019
3 min read
01 Ready Or Not, AI Is Here To Assist You
HWM Singapore
Article
01 Ready Or Not, AI Is Here To Assist You
Jul 11, 2023
4 min read
‘MBAs THAT DON’T FOCUS ON DATA & TECH WON’T DO WELL’
Business Today
Article
‘MBAs THAT DON’T FOCUS ON DATA & TECH WON’T DO WELL’
Oct 28, 2022
6 min read
Playing With Numbers
India Today
Article
Playing With Numbers
Jul 18, 2019
In the last few years, we have probably created more data digitally than in the rest of human history. Think about the millions of Internet searches and social media posts that are made every minute, and the resultant data that corporations and gover
3 min read
The Era of Human + Machine Innovation
Rotman Management
Article
The Era of Human + Machine Innovation
Jan 1, 2019
Interview by Karen Christensen In today's environment, organizations that don't keep up with customers' evolving needs are doomed. What is the best way to get a handle on these evolving needs? The first step in understanding your customers is to acce
5 min read
Federated Learning Uses The Data Right On Our Devices
Futurity
Article
Federated Learning Uses The Data Right On Our Devices
Jul 21, 2022
2 min read
DIVERSITY’S NEW FRONTIER: Diversity of Thought
Rotman Management
Article
DIVERSITY’S NEW FRONTIER: Diversity of Thought
Sep 1, 2017
11 min read
Hack It Right
India Today
Article
Hack It Right
Jun 13, 2019
After attending the two-day security conference ' BountyCon' organised jointly by Facebook and Google in Singapore in March, Rohit Kumar, a second-year student of BCA (Hons) in computer application from Lovely Professional University (LPU), Punjab, w
4 min read
Quantum Leap
Marketing
Article
Quantum Leap
Jul 11, 2019
6 min read
Genesis Of The Purple People
Business Today
Article
Genesis Of The Purple People
Oct 28, 2021
11 min read
Brain Trust
Fast Company
Article
Brain Trust
Aug 8, 2016
5 min read
Technology At The Crossroads
AQ: Australian Quarterly
Article
Technology At The Crossroads
Dec 31, 2018
7 min read

Related categories

Skip carousel

Reviews for Data Science Fundamentals and Practical Approaches

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Data Science Fundamentals and Practical Approaches - Rupam Kumar Sharma

CHAPTER 1

Fundamentals of Data Science

The goal is to turn data into information, and information into insight

— Carly Fiorina

Data, in today’s technology-driven world, is vital in decision making. The rate at which data is being generated per day is tremendous. Every company is using data to comprehend their customers better. Data science and data analytics can gain meaningful insights that help companies in identifying possible areas of growth, streamlining of costs, better product opportunities, and effective company decisions. Data analysis can bring an impact in every sector, be it healthcare, medicine, stock market, academic institutes, and so on. Undoubtedly, data will keep growing in momentum for the next few decades and for this, IT jobs are monotonically expanding to deal with the bulk amount of Big Data that has been realized as the need of the hour in data analysis.

This chapter elaborately discusses data science which is one of the most demanding careers in the 21st century. The world of data science may comprise of simple tasks such as estimating the sales of products in the coming year and viewing the trend of products in the market,or many complex tasks such as prediction of disease based on complex neural network model and classifying and recommending products based on fuzzy logic theory. John Wills, the Director of Data Engineering at Slack, has defined a data scientist as a Person who is better at statistics than any software engineer and better at software engineering than any statistician. Thus, data scientist plays a pivotal role in data analysis which is currently a very demanding area of study that is being explored at an exponential growth to gain hidden insights for better decision making.

Structure

The next few sections in this chapter will discuss the following topics:

Introduction to data science

Why learn data science?

Data analytics lifecycle

Types of data analysis

Types of jobs in business analytics

Data science tools

Fundamental areas of study in data science

Role of SQL in data science

Pros and cons of data science

Conclusion

References

Points to remember

Exercises

Objectives

After studying this chapter, you should be able to:

Understand the concept and need for data science.

Discuss the various phases in the data analytics lifecycle.

Learn the various types of data analytics and the important tools applied in data science.

Analyze the fundamental areas of study in data science

1.1. Introduction to data science

Data science is the task of scrutinizing and processing raw data to reach a meaningful conclusion. Data is mined and classified to detect and study behavioral data and patterns, and the techniques used for this may vary according to the requirements. All data that is available for analysis can be classified into four types. They are nominal data, ordinal data, interval data, and ratio data. A common useful acronym used for these four types of data is NOIR (Nominal Ordinal Interval Ratio), which means black in French. A detailed description of each of these types of data is provided in Chapter 2: Data Preprocessing.

For data collection, there are two major sources of data – primary and secondary. Primary data is data that is never collected before and can be gathered in a variety of ways such as, participatory or non-participatory observation, conducting interviews, collecting data through questionnaires or schedules, and so on. Secondary data, on the other hand, is data that is already gathered and can be accessed and used by other users easily. Secondary data can be from existing case studies, government reports, newspapers, journals, books and also from many popular dedicated websites that provide several datasets. Few standard popular websites for downloading datasets include the UCI Machine Learning Repository, the Kaggle datasets, IMDB datasets, and Stanford Large Network Dataset Collection. Though there are clear benefits of using readily available secondary data, it must be however verified as to how authenticated and valid such data is.

It is said that we all are data analysts in varying degrees of our everyday lives. We analyze the need and working principle of an electronic gadget before purchasing it, or we predict the demand of a particular course for the next few years in terms of job prospects before enrolling our children in that particular course. We do not need to be an exceptionally good expert in analytics to do analysis. The need for complex data analysis has been immensely felt over these years in main business sectors and companies to discover historical patterns for improving the performance of the business in the future.

1.2. Why learn data science?

There has been a revolutionary change in the behavioral pattern of customers in case of online purchases, stock market investment, advertising products to other customers, and so on. Each of these activities requires an in-depth analysis of existing relevant data which makes data science a promising field of study in today’s fast-growing data-driven world.

Few of the industry verticals where data science has found its prominence and is used for operational and strategic decision making are discussed below:

Ecommerce: Ecommerce sites hugely involve data science for maximizing revenue and profitability. These sites analyze the shopping and purchasing behavior of customers and accordingly recommend products to customers for more purchases online.

Finance: The finance market is an emerging field in the data industry. The financial analytics market takes care of risk analysis, fraud detection, shareholders’upcoming share status, working capital management, and so on.

Retail: Retail industries take care of a 360-degree view and feedback reviews of customers. The retail analytics market analyzes customers’ purchasing trends and demands in order to get products based on customers’ liking. Retail industries involve data science for optimal pricing, personalized offers, better marketing strategies, market basket analysis, stock management, and so on.

Healthcare: The healthcare sector also nowadays heavily relies on analytics of patient data to predict diseases and health issues. Healthcare industries make an analysis of data-driven patient quality care, improved patient care, classification of the type of symptoms of patients and predicted health deficiencies, and so on.

Education: The sources of data in education is vast, starting from student-centric data, enrollment in various courses, scholarship and fee details, examination results, and so on. Education analytics play a major role in academic institutions for better admission scenario, empowerment of students for successful examination results, and all-round student performance.

Human Resource (HR): HR analytics involves HR-related data that can be used for building strong leadership, employee acquisition, employee retention, workforce optimization, and performance management.

Sports: Nowadays, sports analytics is often used in international tournaments to analyze the performance of players, the predicted scores, prevention of injuries, and the possibility of winning or losing a match by a particular team.

The use of data science is nowadays found in every prominent domain, few of which have been addressed above. The few other sectors that need a mention are telecom industries, sales, supply chain management, risk monitoring, manufacturing industries, and IT companies. The recent competitions in businesses and companies consider data science no longer as an optional requirement but rather hire data analysts and data scientists for the same to deal with hidden massive data to provide meaningful results and generate reports to arrive at profit-making decisions. Also, the recent trends in the job market show that data analysts, data scientists, and data engineers have a huge demand in the IT companies and this demand will continue for the next decade. Hence, making data analyst, data scientist, or data engineer as a career can uplift your job profile and the demand will be witnessed in many companies in the years to come.

1.3. Data analytics lifecycle

While the terms data science and data analytics are often used interchangeably, the two terms are quite different based on the difference in the scope of their performances. Data science is an umbrella term that comprises a large variety of fields compared to data analytics which is more focused and can be considered to be a subset of data science. Hence to understand data science thoroughly, let us first try to understand the various phases in the data analytics lifecycle.

Data analytics involves mainly six important phases that are carried out in a cycle - data discovery, data preparation, planning of data models, the building of data models, communication of results, and operationalization. Figure 1.1 illustrates the six phases of the data analytics lifecycle that is followed one phase after another to complete one cycle. It is interesting to note that these six phases of data analytics can follow both forward and backward movement between each phase and are iterative. The lifecycle of the data analytics provides a framework for the best performances of each phase from the creation of the project until its completion. This framework was built by a large team of data scientists with much care and experiments. The key stakeholders in data science projects are business analysts, data engineers, database administrators, project managers, executive project sponsors, and data scientists.

Figure 1.1: The Data Analytics Life Cycle

Let us now briefly discuss all the six phases of the data analytics lifecycle followed in any data science projects:

1.3.1. Data discovery

In this first phase of data analytics, the stakeholders regularly perform the following tasks - examine the business trends, make case studies of similar data analytics, and study the domain of the business industry. The entire team makes an assessment of the in-house resources, the in-house infrastructure, total time involved, and technology requirements. Once all these assessments and evaluations are completed, the stakeholders start formulating the initial hypothesis for resolving all business challenges in terms of the current market scenario.

1.3.2. Data preparation

In the second phase after the data discovery phase, data is prepared by transforming it from a legacy system into a data analytics form by using the sandbox platform. A sandbox is a scalable platform commonly used by the data scientists for data preprocessing. It includes huge CPUs, high capacity storage and high I/O capacity. The IBM Netezza 1000 is one such data sandbox platform used by the IBM Company for handling data marts. The stakeholders involved during this phase are mostly involved in the preprocessing of data for preliminary results by using a standard sandbox platform.

1.3.3. Model planning

The third phase of the lifecycle is model planning, where the data analytics team makes proper planning of the methods to be adapted and the various workflow to be followed during the next phase of model building. At this stage, the various division of work among the team is decided to clearly define the workload among the team members. The data prepared in the previous phase is further explored to understand the various features and their relationships and also perform feature selection for applying it to the model.

1.3.4. Model building

The next phase of the lifecycle is model building in which the team works on developing datasets for training and testing as well as for production purposes. Also, the execution of the model, based on the planning made in the previous phase, is carried out. The kind of environment needed for execution of the model is decided and prepared so that if a more robust environment is required, it is accordingly applied.

1.3.5. Communicate results

Phase five of the life cycle checks the results of the project to find whether it is a success or failure. The result is scrutinized by the entire team along with its stakeholders to draw inferences on the key findings and summarize the entire work done. Also, the business values are quantified and an elaborate narrative on the key findings is prepared that is discussed among the various stakeholders.

1.3.6. Operationalization

In phase six, a final report is prepared by the team along with the briefings, source codes, and related documents. The last phase also involves running the pilot project to implement the model and test it in a real-time environment. As data analytics help build models that lead to better decision making, it, in turn, adds values to individuals, customers, business sectors and other organizations. While proceeding through theses six phases, the various stakeholders that can be involved in the planning, implementation, and decision-making are data analysts, business intelligence analysts, database administrators, data engineers, executive project sponsors, project managers, and data scientists. All these stakeholders are rigorously involved in the proper planning and completion of the project, keeping in note the various crucial factors to be considered for the success of the project.

1.4. Types of data analysis

There are many different ways to analyze data. Some forms are more complex than others based on which data analysis has been broadly divided into four types, namely descriptive analysis, diagnostic analysis, predictive analysis, and prescriptive analysis. Figure 1.2 demonstrates the level of complexity of each of these four types of data analysis.

Figure 1.2: Four types of data analysis based on the level of complexity

Let us briefly discuss each of the four types of data analysis and find how each of these types differs from one another:

1.4.1. Descriptive analysis

Descriptive analysis is the simplest and the most common type of data analysis used by companies and other sectors. This type of data analysis is mostly used in businesses to generate monthly revenue reports, sales leads, and key performance indicators (KPI) dashboards. It describes the main aspects of the data being analyzed. The data dealt with are large in volume and often include the entire population. The results or reports generated are based on data that are already available.

The main emphasis in the descriptive analysis is given on ‘what has happened?’ by analyzing valuable information found from the available past data. For example, with descriptive analysis, a data analyst will be able to generate the statistical results of the performance of the cricket players of team India. For generating such results, the data may need to be integrated from multiple data sources to gain meaningful insights through statistical analysis.

1.4.2. Diagnostic analysis

The diagnostic analysis differs from the descriptive analysis by simply not emphasizing only ‘what has happened?’ but also on ‘why it happened?’ This type of data analysis tries to gain a deeper understanding of the reasons behind the pattern of data found in the past. Here, business intelligence comes into play by digging down to find the root cause of the pattern or nature of data obtained. For example, with diagnostic analysis, a data analyst will be able to find why the performance of each player of the cricket team of India has risen (or degraded) in the recent past six months.

The diagnostic analysis deals with the critical aspect of finding the reason behind a particular change or cause in a phenomenon. This is undoubtedly a major task in the field of data analysis as an analyst has to be critical and correct enough to find the reason behind a particular cause of occurrence to make a gain or profit in various fields. For this purpose, an analyst often uses machine learning techniques to use business intelligence for a deeper understanding of a given problem.

1.4.3. Predictive analysis

Predictive analysis, as the name suggests, deals with prediction of future based on the available current and past data. The main emphasis in the descriptive analysis is given on ‘what is likely to happen?’ by utilizing previous data to find the future outcome. For example, with predictive analysis, a data analyst will be able to predict the performance of each player of the Indian cricket team for the upcoming international cricket world cup. Such prediction can help the Board of Cricket Council of India (BCCI) to decide on the players’ selection for the upcoming international cricket tournament.

Predictive analysis is applied in many domains such as risk management, sales forecasting, weather forecasting, and prediction of the performance of each team. Though descriptive and diagnostics analyses are more common in nature, data analysts are also largely hired in companies to predict future trends in businesses and other marketing sectors. In most cases prediction is made by dividing the available dataset into the training set and testing set and the machine learning algorithm is applied to check the accuracy level of prediction. If the accuracy of prediction is found to be at a satisfactory level, the algorithm is then used to predict future data. However, it is important to remember that the predicted solution provides an approximate forecasted result and may vary from the actual result, as accuracy is not guaranteed to a hundred percent.

1.4.4. Prescriptive analysis

The final type of data analysis which is the highest in terms of complexity is called predictive analysis. In this type of data analysis, the insights gained from all the other three types of data analyzes are combined to determine the kind of action to be taken to solve a certain situation. Predictive analysis prescribes what steps are needed to be taken to avoid a future problem. It involves a high degree of responsibility, time, and complicacy to reach to informed decision-making. Thus, the predictive analysis makes recommendations based on the forecasting done in predictive analysis.

To summarize the four main types of data analytics, it should be remembered that descriptive analysis is mainly involved in explaining what has happened till date, diagnostic analysis emphasizes more on finding why it has happened in a particular way, predictive analysis makes a forecast on what might happen in the near future, while prescriptive analysis emphasizes on recommending actions based on the forecast. All these types of analyses are usually carried out by a data analyst or data scientist to deal with the given data and produce a meaningful outcome based on the type of analysis required to be made.

1.5. Types of jobs in data analytics

The various key stakeholders in any data analysis project include the data analyst, the data scientist, the data engineer, the database administrator, and the analytics manager. Each stakeholder has a clear role to play for a business problem right from understanding the essentials of the problem, proper planning, implementation of the project, analyzing the various outcomes of the project, solving the bottlenecks visible in the outcomes, and generating reports by drawing inferences about the success of the project. Figure 1.3 shows some of the key stakeholders involved in any data analytics-based project.

Figure 1.3: Some of thekey stakeholders in the Data Analytics projects

Though it is a big team that may involve many other stakeholders such as analytics specialist, business intelligence consultant, chief creative officer, ETL Developers, project sponsor, and many more, the main prominent workers involved in the project are few and play a pivotal role in bringing success to a project. The leader of any business project clearly defines the role of the stakeholders and the estimated timeline of each of the work assigned. Let us briefly discuss six such main stakeholders involved in business analytics, namely the data analyst, the data scientist, the data engineer, the database administrator, the data architect, and the analytics manager.

1.5.1. Data analyst

The main role of a data analyst is to extract data and interpret the information attained from the data for analyzing the outcome of a given problem in business. In this process, the analyst also discovers the various bottlenecks that are found in the results and provides possible solutions for the same. Extraction of information from given existing data is done using one or more standard methodologies such as data cleaning, data transformation, data visualization, and data modeling. Using these methodologies, a data analyst is able to make careful data-driven decisions.

The major skills required to be a data analyst are Python and/or R programming skills, Structured Query Language (SQL), Statistical Analysis Software (SAS), SAS Miner, Microsoft Excel and/or Tableau. The key areas and techniques which a data analyst should be well-versed with include the following:

Data preprocessing, which is an important step in data analysis, involves data cleaning, data integration, data transformation, and data reduction. The task of data preprocessing is discussed elaborately in Chapter 2: Data Preprocessing.

Data visualization, which is the graphical representation of data that can make information easy to analyze and understand. The task of data visualization is discussed elaborately in Chapter 3: Data Plotting and Visualization.

Statistical modeling, which mainly involves two important kinds, descriptive or summary statistics and inferential statistics. The task of statistical data analysis is discussed elaborately in Chapter 4: Statistical Data Analysis.

Programming skills, for which a data analyst may thoroughly practice and learn R and/or Python programming, that is mainly used in data analysis.

Communication and presentation skills, which is required for communicating with the team regarding the various reports and outcome of results obtained after proper data analysis.

To summarize, a few of the major tasks that a data analyst is involved in are data acquisition, data management, data cleaning, and data filtering, data interpretation using statistical analysis, improving data quality and statistical efficiency, data visualization, and analytics reporting.

1.5.2. Data scientist

A data scientist incurs all the skills of a data analyst with the additional skills of data wrangling, complex machine learning, Big Data tools, and software engineering. It is observed that both data analysts and data scientists use the same tools and practices. However, the scope and nature of the problem addressed by a data scientist differ from a data analyst. Data scientists mainly deal with large and complex data that can be of high dimension, and carry out appropriate machine learning and visualization tools to convert the complex data into easily interpretable meaningful information.

Some of the fundamental prerequisites that a data scientist should be thorough with are as follows:

Statistics: Statistics is the most prerequisite field in the area of data science. Data science is mostly about statistics and to master in data science, good knowledge in statistics is mandatory. The two kinds of statistics mostly used in data science are – Descriptive statistics and Inferential statistics.

Mathematics: To enhance one’s skills in machine learning, a data scientist should have a profound knowledge of Mathematics. The two most important topics in Mathematics in terms of use in data science are linear algebra and Calculus. While linear algebra is all about the study of vectors and linear functions, Calculus is the mathematical study of the continuous change. Many of the concepts of Linear algebra such as tensors and vectors are used in many areas of machine learning. Similarly, calculus is also required in various areas of machine learning such as optimization techniques.

Computer programming: A data scientist should be a programming lover. Other than the basic computer application skills such as mastering in Microsoft Excel, a data scientist should have programming skills to be able to easily write code in Python or R for any given data science project. MS Excel can be used as a basic tool for a beginner in the field of data science as it can easily handle complex numerical calculations as well as allow plotting of data visualization graphs. Both Python and R are considered as excellent programming tools for handling statistical analysis and machine learning skills.

Database handling: A data scientist also often has to deal with data that are stored in databases. In the case of Relational Database Management Systems (RDBMS), a data scientist should have the prerequisites of handling database queries using SQL commands. As data extraction is a primary task in data science, SQL is an important tool for accessing and manipulating data that is maintained in databases.

Data scientists can be engineers but are usually not involved in maintaining data architecture. The primary task of a data scientist is to use machine learning and deep learning-based techniques to make an in-depth analysis of input data. This is where a data analyst lacks his/her skills as an analyst may not possess much machine learning or deep learning-based skills.

1.5.3. Data engineer

The job of a data engineer comes first, and then the data is handed over to a data analyst or data scientist for analysis. Thus, the role of a data engineer is not to analyze data but rather to prepare, manage and convert data into a form that can be readily used by a data analyst or data scientist. Also, the advanced skills required by a data engineer are far different from the other two.

With special training, a data engineer can design, build, integrate, and maintain data from multiple (homogenous or heterogeneous) sources. Few of the prominent work a data engineer is involved in include the following:

Developing and maintaining data architectures.

Aligning data architectures with the business or project requirements.

Improving data quality and raising data efficiency.

Performing predictive and prescriptive modeling for given input data.

Determining activities that can be automated.

Engaging oneself with the other stakeholders to explain the details of the converted data so that it can be used by the data analyst or data scientist for further analysis.

The major skills required to be a data analyst are Ruby, Java, C++, Python and/or R programming skills, Hive, NoSQL, MapReduce technologies, and MATLAB. Good knowledge of ETL tools and some popular APIs will be an added benefit to his/her profile. Data engineers have a demanding role in data analytics as they help in assuring that data are made available in a form that can be easily used for analysis and interpretation. If the raw data is not initially handled by a data engineer, no machine learning or deep learning model would be able to handle such complex raw bulky data that is initially received by the team for business analysis.

1.5.4. Database administrator

The Database Administrator (DBA), as the name suggests, operates and administers the database. The technical skills required by a DBA are SQL, scripting, database performance tuning, and system and network design. The backup and recovery of databases are also handled by a DBA. This job is critical as a business functions properly only when the database is stored and managed well. Few of the prominent work a database administrator is involved in include the following:

Database designing as per end-user requirements.

Providing (or revoking) rights to (or from) database end-users.

Enabling efficient data backup and data recovery mechanisms.

Database related training to end-users.

Ensuring data privacy and security.

Managing data integrity for end-users.

Monitoring the performances of the database.

The proper functioning of databases is solely the responsibility of a DBA. If at any point in time the database functioning fails, the DBA should be able to quickly and efficiently manage data recovery mechanisms to recover the functioning of data. Thorough knowledge of SQL and related scripting languages makes a DBA well-tuned to manage any database queries that need to be handled by various end-users of a database.

1.5.5. Data architect

The data architect provides the support of various tools and platforms that are required by data engineers to carry out various tests with precision. Data architects should be well equipped with knowledge of data modeling and data warehousing. The other additional skills required by a data architect are Extraction, Transformation, and Load (ETL), and knowledge of Hive, Pig, and Spark. Few of the prominent work a database administrator is involved in include the following:

Designing data models.

Developing database solutions.

Providing structural requirements for new software applications.

Managing data migration and optimization of database systems.

Providing Management Information System (MIS) support.

Administering system performance by troubleshooting, testing, and assimilating new elements.

The main task of data architects is to design and implement database systems, data models, and components of data architecture. Data architects also have a wide variety of knowledge on various kinds of data available, both offline and in the cloud environment. Also, he/she possesses the capability of managing data warehouses and ETL operations.

1.5.6. Analytics manager

The analytics manager is involved in the overall management of the various data analytics operations as discussed in this section. For each of the stakeholders of data analytics that have been mentioned in this section, the analytics manager deals with the team leader of each group and monitors and manages the work of each team. The major skills required to be an analytics manager are Python and/or R programming skills, Structured Query Language (SQL), and Statistical Analysis Software (SAS). Also, an analytics manager should have good leadership and social skills. Few of the prominent work an analytics manager is involved in include the following:

Leading the data analysts’ team.

Having a thorough understanding of the business requirements and objectives.

Configuring and implementing data analytics solutions.

Ensuring the quality results of the reports developed by every team.

Keeping an update on recent industry and business trends.

The analytics manager should have the out-of-box thinking skills to lead and direct every team towards effective result generation. With leadership skills, the analytics manager skillfully controls the team and thoroughly studies the needs of a project to develop the best solutions for the same.

1.6. Data science tools

There are many popular tools and techniques used by data scientists and data analysts. One best thing about these tools is that most of these tools are popular, user-friendly and open-source and provide good performance in the field of data science. Let us discuss six such open-source tools that can be learned and adapted by any beginner or researcher who wants to explore in the field of data science.

1.6.1. Python programming

Python is an open-source tool and falls under object-oriented scripting language. It was found in the 1980s by Guido van Rossum and is famous for the implementation of data preprocessing, statistical analysis, machine learning and deep learning, which are the core tasks in any data science project. Python is versatile and can run on any platform such as UNIX, Windows, and Mac operating systems. It can also be assembled on any database platform like the SQL server or a MongoDB database.This book contains illustrative Python codes throughout the rest of the chapters for carrying out data analysis for various purposes. The rich set of libraries (more than 200,000) available in Python makes a Python programmer’s life easy and interesting and this is one of the core strengths of this language. Many types of visualization graphs can also be plotted using Python codes (explained in Chapter 3: Data Plotting and Visualization) that makes data interpretation easy for a data analyst.

1.6.2. R programming

R is also an open-source tool that is often used for data science. It was developed by Ross Ihaka and Robert Gentleman, both of whose first names start with the letter R and hence the name ‘R’ has been given for this language. Data handling and manipulation are easily done using R programming. It is versatile and can run on any platform such as UNIX, Windows, and Mac operating systems. R also has a rich collection of libraries (more than 11,556) that can be easily installed as per requirements. This makes the R language popular and is widely used for data analytics for handling major tasks such as classical statistical tests, time-series forecasting, and machine learning such as classification and regression, and many more. The basic visualization graphs can also be effortlessly plotted through R codes that make data interpretation easy using this language. While often the comparison is made between the two programming languages mostly used in data science – R and Python, data scientists often come to the conclusion that there is no clear cut advantage of one from the other. Rather, a good grasp of both the languages can make a data scientist switch between either of the language based on the need and requirements of the project.

1.6.3. SAS

SAS (Statistical Analysis System) is a programming environment and language used for advanced data handling such as criminal investigation, business intelligence, and predictive analysis. It was initially released in 1976 and has been written in C language. It is supported in various operating systems such as Windows, Unix/Linux, and IBM mainframes. It is mainly used for integrating data from multiple sources and generating statistical results based on the input data fed into the environment. SAS data can be generated in a wide variety of formats such as PDF, HTML, Excel, and many more. This software has more than 200 components, each of which is dedicated to handling a specific task. For instance, the SAS/STAT component mainly handles statistical analysis, the SAS/QC component deals with quality control, and the SAS/INSIGHT component manages data handling.

1.6.4. Tableau Public

Tableau is data visualization software which has its free version named as Tableau Public. It was developed in 2003 by four founders from the United States. It has an interesting interface that allows connectivity to both local and cloud-based data sources. The preparation, analysis, and presentation of input data can be all done in Tableau with various drag and drop features and easy available menus. Tableau is well-suited for big-data analytics and generates powerful data visualization graphs that make it very popular in the data analytics market. A very interesting functionality of Tableau is its ability to plot latitude and longitude coordinates for geospatial data and generate graphical maps based on these coordinate values.

1.6.5. Microsoft Excel

Microsoft Excel is a data analytics tool widely used due to its simplicity and easy interpretation of complex data analytical tasks. It was released in the year 1987 by the Microsoft Company to handle numerical calculations efficiently. It is of type spreadsheet and can handle complex numerical calculations, generate pivot tables, and display graphics. An analyst may use R, Python, SAS or Tableau and will also still use MS Excel for its simplicity and efficient data modeling capabilities. However, it is not an open-source application and can be used if one has Windows, macOS, or Android operating system installed in one’s machine.

1.6.6. RapidMiner

RapidMiner is a data science software platform developed by the RapidMiner Company in the year 2006. It is written in the Java language and has a GUI that is used for designing and executing workflows related to data analytics. It also has template-based frameworks that can handle several data analysis tasks such as data preprocessing, data mining, machine learning, ETL handling, and data visualization. The RapidMiner Studio Free Edition has one logical processor and can be used by a beginner who wants to master the software for data analysis.

1.6.7. Knime

Knime (Konstanz Information Miner) Analytics platform is an open-source data analytics and reporting platform. Knime was developed in 2004 by a team of software engineers from Germany. It is mainly used for applying statistical analysis, data mining, ETL handling, and machine learning. The Knime workbench has several components such as Knime Explorer, Workflow editor, Workflow Coach, Node Repository, Description, Outline, Knime Hub Search, and Console. Here, the individual tasks are represented as nodes which are displayed as colored boxes and also have input ports, output ports, and status. The interconnected nodes form a workflow that can be used in a data analytics project for performing various tasks such as reading a file, data transformations, and creating visualizations. The core architecture of Knime is designed in such a way that it practically has almost no limitations on the input data fed into the system. This is a big advantage of using Knime as a data science tool as large volumes of data are needed to be dealt with for analysis in data science.

1.6.8. Apache Spark

Apache Spark is open-source software developed in 2014 by the Apache Spark developers. It is versatile and can run on any platform such as UNIX, Windows, and Mac operating systems. Spark has a remarkable advantage of having high speed when dealing with large datasets and is found to be more efficient than the MapReduce technique used in a Hadoop framework. Apache Spark mainly consists of the Spark Core which is a distributed execution engine. Many libraries are built on top of the Spark Core that helps in enabling many data analysis tasks such as handling SQL queries, drawing visualization graphs, and machine learning. Other than the Spark Core, the other components available in Apache Spark are Spark SQL, Spark Streaming, MLlib (Machine Learning Library), and GraphX.

There are several other data science tools that are also used by data scientists. The eight tools listed above are very popular as these tools are freely downloadable and can be explored to learn the utilities of each. In reality, a data scientist will simply not work on only one tool but will use a combination of the analytics tool based on the efficiency and requirements of the project.

1.7. Fundamental areas of study in data science

Data science is a broad term that encompasses multiple disciplines. It is a rapidly growing field of study that uses scientific methods to extract meaningful insights from given input data. The rapid growth in the field of data science has opened the eyes of researchers interested in this field to explore more into the multiple disciplines that encompass data science. Let us discuss a few of these broad areas that are fundamental aspects to be covered for mastering in data science.

1.7.1. Machine learning

Both machine learning and data science are like buzzwords in today’s technical world. Though data science includes machine learning as one of its fundamental areas of study, machine learning in itself is a vast research area of study that requires good skills and experience to expertise. The basic idea of machine learning is to allow machines (computers) to independently learn from the wealth of data that is fed as input into the machine. To master in machine learning, a learner needs to have an in-depth knowledge of computer fundamentals, programming skills, data modeling, and evaluation skills, probability, and statistics.

With the advancement of new technology, machines are being trained to behave like a human in decision-making capability. In doing so, it is necessary to automate decisions that can be inferred by the machines with the interaction with the environment and understanding from past knowledge. The field of machine learning deals with all those algorithms that help machines to get self-trained in this process. Machine learning techniques are broadly categorized into three types - supervised machine learning, unsupervised machine learning, and reinforcement learning. To master data science, it is good to be thorough with all the types of machine learning that are immensely used by a data scientist for the extraction of meaningful output from the data provided as input.

1.7.2. Deep learning

Deep learning is often used in data science as it is computationally very competent compared to traditional machine learning methods, which require human intervention before being machine trained. The big players in the market such as Google, Microsoft, and Amazon need to deal with large volumes of data on a daily basis for business analysis and effective decision-making. Deep learning helps in analyzing a bulk amount of data through a hierarchical learning process. The amount of data generated in these companies is massive, raw and unstructured for which deep learning approaches are used to generate meaningful results.

Deep learning approaches have proven to outperform all other machine learning techniques especially in the field of image and speech recognition systems. A deep learning network obliges representation learning incorporating multiple levels of representation. In a simple sense, it could be understood as such that the higher levels of the network amplify input aspects that are relevant to classification ignoring the irrelevant features that are not significant to the classification process. The interesting fact to note is that these layers of features in the deep network are not designed by human engineers but are learned from data using general-purpose learning procedures.

1.7.3. Natural Language Processing (NLP)

Natural Language Processing (NLP) will ever remain a standard requirement in the field of data science. NLP is a branch of artificial intelligence, just like machine learning. NLP focuses on bridging the gap between human communication and computer understanding. Nowadays, thanks to NLP that made it possible to analyze language-based data equally as humans, such as reading text, understanding speech, measuring sentiments from the text, and extracting valuable text from a bulk amount of available text. The field of NLP is found to be highly beneficial for resolving ambiguity in the various languages spoken worldwide and is a key area of study for text analytics as well as speech recognition.

NLP, as an important branch of data science, plays a vital role in extracting insights from the input text. Industry experts have predicted that the demand for NLP in data science will grow immensely in the years to come. One of the key areas where NLP is playing a pivotal role in data science is while dealing with multi-channel data like mobile data or any social media data. Through the use of NLP, these multi-channel data are being assessed and evaluated to understand customer sentiments, moods, and priorities. NLP has already emerged as a game-changer in the field of data science and business analytics.

1.7.4. Statistical data analysis

Statistics is a branch of mathematics that includes the collection, analysis, interpretation, and validation of stored data. Statistical data analysis allows the execution of statistical operations using quantitative approaches. Few such important concepts in statistical data analysis include descriptive statistics, data distributions, conditional probability, hypothesis-testing, and regression. Statistical analysis is an essential area of study in data analytics as it provides tools and techniques to analyze and draw inferences from the provided data. It is an excellent discipline for handling data that needs to be analyzed or to deal with uncertainty by quantifying some results.

There are two main kinds of statistics – descriptive statistics and inferential statistics. While descriptive statistics are mainly used for presenting, organizing and summarizing data of a given dataset, inferential statistics are used to draw conclusions about a population based on data observed in a sample. Also, statistical data analysis deals with data that is essential of two types, namely, continuous data and discrete data. The fundamental difference between continuous data and discrete data is that continuous data do not have separate distinct values and cannot be counted, whereas discrete data are distinct and can be counted.

1.7.5. Knowledge discovery and data mining

Data mining, a major step in Knowledge Discovery from Data (KDD), has evolved as a prominent field in all these years as the demand for discovering meaningful patterns from the data has given rise to meaningful output for data analysis.We are living in a data age where infinite volumes of data are being generated every second. However, we may be data-rich but may become information poor if these data are not rightly utilized. Data alone makes no sense in the analysis world until this data is converted and interpreted to some meaningful form and this is done through the process of data mining in KDD.

The few prominent applications of data mining include target marketing, customer relationship management, loan approval decision-making in the banking, identifying customer behavior in retail industries, and fraud detection in financial and other sectors. KDD includes a series of clearly defined steps – data selection, data cleaning, data integration, data transformation, data mining, and pattern evaluation. Data mining tasks are either descriptive or predictive. Description based data mining tasks help find human-interpretable patterns that describe the data. Few examples of description based data mining tasks include sequential pattern discovery, clustering, and association rule mining. Prediction-based data mining tasks, on the other hand, use some variables to predict unknown or future values of other variables. Few examples of prediction-based data mining tasks include classification, regression, and deviation detection.

1.7.6. Text mining

Text mining is similar to text analytics and includes the method of deriving high-quality information from text. It is a variation of data mining that derives high-quality information by formulating patterns and trends using various methods such asstatistical pattern learning. Some of the prominent text mining tasks include text clustering, document summarization, sentiment analysis through text, text categorization, and concept extraction. In data science, text mining broadly involves considering the text as input data and then applying various text mining analysis such as lexical analysis or pattern recognition, to interpret the gathered information from the given text.

Text analytics may involve statistical and machine learning techniques for carrying out the mining of textual sources of data. Text analytics is extensively used for research in data science, business intelligence, or exploratory data analysis. The seven main steps involved in text analytics include language identification, tokenization, sentence breaking, speech tagging, chunking, syntax parsing, and sentence chaining. While the term text mining was widely used initially in the context of data mining, the term text analytics is more often used nowadays as a buzzword being a promising area in the field of data science.

1.7.7. Recommender systems

The various web services such as Amazon, YouTube, and Netflix, and various e-commerce sites such as Flipkart and Snapdeal use recommender systems to provide suggestions to online users about new and relevant items. The items (such as videos, music, appliances, or books) suggested are based on the types of items being accessed by the useron a particular website. This indirectly helps inproviding a pleasant user experience as well as the revenue generation of these businesses increases drastically. In a typical recommender system, the dataset containing customer and product information is fed as input to a filtering technique. There are many standard filtering techniques applied to recommender systems. Four such filtering techniques that are widely used are collaborative filtering, content-based filtering, demographic filtering, and hybrid filtering. The type of filtering techniques to be used largely depends on the type of data that the recommender system will be processing and the type of recommendations it needs to generate. After the process of filtering is over, any memory-based or model-based method of recommendations is applied to make predictions of items for a list of users and finally, top-N recommendations are given as output for each user.

Nowadays, building efficient recommender systems are a part and parcel of every online business as they indirectly help in generating a huge amount of revenue and make the business flourish well when compared to other competitors. There are, however, several noteworthy challenges met by all recommender systems techniques for generation of top recommendations for a customer of a site which can be summarized as follows:

Are commender system has thousands, lakhs or even millions of distinct products as well as visited customers for an e-commerce site, all of which have to be considered for providing recommendations.

The cold-start problem in the recommender system arises for first-time customers who have not visited the e-commerce site before and hence no information can be fetched based on his/her previous activities to provide some recommendations.

The older customers, on the other hand, may have an abundant amount of information stored based on the number of purchases and ratings made by the frequently visited customers.

The most challenging task is to generate recommendations in a real-time setup which demands that the recommender system technique should provide quick results in not more than half a second by also considering optimum accuracy of recommendations.

Recommender systems have become essential in every industry, business and service sectors, and, hence, have received much attention in recent years. The three main phases involved

Enjoying the preview?

Page 1 of 1

Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next

About this ebook

Rupam Kumar Sharma

Related authors

Related to Data Science Fundamentals and Practical Approaches

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Data Science Fundamentals and Practical Approaches

What did you think?

Book preview

Data Science Fundamentals and Practical Approaches - Rupam Kumar Sharma

CHAPTER 1

Fundamentals of Data Science

Structure

Objectives

1.1. Introduction to data science

1.2. Why learn data science?

1.3. Data analytics lifecycle

1.3.1. Data discovery

1.3.2. Data preparation

1.3.3. Model planning

1.3.4. Model building

1.3.5. Communicate results

1.3.6. Operationalization

1.4. Types of data analysis

1.4.1. Descriptive analysis

1.4.2. Diagnostic analysis

1.4.3. Predictive analysis

1.4.4. Prescriptive analysis

1.5. Types of jobs in data analytics

1.5.1. Data analyst

1.5.2. Data scientist

1.5.3. Data engineer

1.5.4. Database administrator

1.5.5. Data architect

1.5.6. Analytics manager

1.6. Data science tools

1.6.1. Python programming

1.6.2. R programming

1.6.3. SAS

1.6.4. Tableau Public

1.6.5. Microsoft Excel

1.6.6. RapidMiner

1.6.7. Knime

1.6.8. Apache Spark

1.7. Fundamental areas of study in data science

1.7.1. Machine learning

1.7.2. Deep learning

1.7.3. Natural Language Processing (NLP)

1.7.4. Statistical data analysis

1.7.5. Knowledge discovery and data mining

1.7.6. Text mining

1.7.7. Recommender systems