Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets

Ebook1,189 pages6 hours

Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets

Name: Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets
Author: Andreas François Vermeulen
ISBN: 9781484230541

By Andreas François Vermeulen

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Learn how to build a data science technology stack and perform good data science with repeatable methods. You will learn how to turn data lakes into business assets.
The data science technology stack demonstrated in Practical Data Science is built from components in general use in the industry. Data scientist Andreas Vermeulen demonstrates in detail how to build and provision a technology stack to yield repeatable results. He shows you how to apply practical methods to extract actionable business knowledge from data lakes consisting of data from a polyglot of data types and dimensions.

What You'll Learn

Become fluent in the essential concepts and terminology of data science and data engineering
Build and use a technology stack that meets industry criteria
Master the methods for retrieving actionable business knowledge
Coordinate the handling ofpolyglot data types in a data lake for repeatable results

Who This Book Is For

Data scientists and data engineers who are required to convert data from a data lake into actionable knowledge for their business, and students who aspire to be data scientists and data engineers

Skip carousel

LanguageEnglish

PublisherApress

Release dateFeb 21, 2018

ISBN9781484230541

Author

Andreas François Vermeulen

Related authors

Skip carousel

Related to Practical Data Science

Related ebooks

Skip carousel

Virtual You: How Building Your Digital Twin Will Revolutionize Medicine and Change Your Life
Ebook
Virtual You: How Building Your Digital Twin Will Revolutionize Medicine and Change Your Life
byPeter Coveney
Rating: 0 out of 5 stars
0 ratings
Medicare Meets Mephistopheles
Ebook
Medicare Meets Mephistopheles
byDavid A. Hyman
Rating: 0 out of 5 stars
0 ratings
Microservices for the Enterprise: Designing, Developing, and Deploying
Ebook
Microservices for the Enterprise: Designing, Developing, and Deploying
byKasun Indrasiri
Rating: 0 out of 5 stars
0 ratings
On Power: My Journey Through the Corridors of Power and How You Can Get More Power
Ebook
On Power: My Journey Through the Corridors of Power and How You Can Get More Power
byGene Simmons
Rating: 3 out of 5 stars
3/5
Buildings That Create Jobs: A Guide to Business Incubators from the Family Who Invented the Concept
Ebook
Buildings That Create Jobs: A Guide to Business Incubators from the Family Who Invented the Concept
byB. Thomas Mancuso
Rating: 0 out of 5 stars
0 ratings
Everyone Is an Entrepreneur: Selling Economic Self-Determination in a Post-Soviet World
Ebook
Everyone Is an Entrepreneur: Selling Economic Self-Determination in a Post-Soviet World
byGregory V. Diehl
Rating: 3 out of 5 stars
3/5
Practical TLA+: Planning Driven Development
Ebook
Practical TLA+: Planning Driven Development
byHillel Wayne
Rating: 0 out of 5 stars
0 ratings
Data for All
Ebook
Data for All
byJohn K. Thompson
Rating: 0 out of 5 stars
0 ratings
Data Cleaning A Complete Guide - 2021 Edition
Ebook
Data Cleaning A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Getting Started with Istio Service Mesh: Manage Microservices in Kubernetes
Ebook
Getting Started with Istio Service Mesh: Manage Microservices in Kubernetes
byRahul Sharma
Rating: 0 out of 5 stars
0 ratings
The Future Is Trust
Ebook
The Future Is Trust
byRick Kitagawa
Rating: 0 out of 5 stars
0 ratings
Dark Transmissions: A Tale of the Jinxed Thirteenth
Ebook
Dark Transmissions: A Tale of the Jinxed Thirteenth
byDavila LeBlanc
Rating: 4 out of 5 stars
4/5
Mobilizing Minds: Creating Wealth From Talent in the 21st Century Organization
Ebook
Mobilizing Minds: Creating Wealth From Talent in the 21st Century Organization
byLowell L. Bryan
Rating: 3 out of 5 stars
3/5
Enterprise Bug Busting: From Testing through CI/CD to Deliver Business Results
Ebook
Enterprise Bug Busting: From Testing through CI/CD to Deliver Business Results
byRosalind Radcliffe
Rating: 0 out of 5 stars
0 ratings
Practical Event-Driven Microservices Architecture: Building Sustainable and Highly Scalable Event-Driven Microservices
Ebook
Practical Event-Driven Microservices Architecture: Building Sustainable and Highly Scalable Event-Driven Microservices
byHugo Filipe Oliveira Rocha
Rating: 5 out of 5 stars
5/5
Python Data Science: A Step-By-Step Guide to Data Analysis. What a Beginner Needs to Know About Machine Learning and Artificial Intelligence. Exercises Included
Ebook
Python Data Science: A Step-By-Step Guide to Data Analysis. What a Beginner Needs to Know About Machine Learning and Artificial Intelligence. Exercises Included
byAxel Ross
Rating: 0 out of 5 stars
0 ratings
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
Ebook
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
bydaniel Huston
Rating: 0 out of 5 stars
0 ratings
The Decision Maker's Handbook to Data Science: A Guide for Non-Technical Executives, Managers, and Founders
Ebook
The Decision Maker's Handbook to Data Science: A Guide for Non-Technical Executives, Managers, and Founders
byStylianos Kampakis
Rating: 0 out of 5 stars
0 ratings
Data Analytics for Businesses 2019: Master Data Science with Optimised Marketing Strategies using Data Mining Algorithms (Artificial Intelligence, Machine Learning, Predictive Modelling and more)
Ebook
Data Analytics for Businesses 2019: Master Data Science with Optimised Marketing Strategies using Data Mining Algorithms (Artificial Intelligence, Machine Learning, Predictive Modelling and more)
byRiley Adams
Rating: 5 out of 5 stars
5/5
Deep Learning: Convergence to Big Data Analytics
Ebook
Deep Learning: Convergence to Big Data Analytics
byMurad Khan
Rating: 0 out of 5 stars
0 ratings
Creating Good Data: A Guide to Dataset Structure and Data Representation
Ebook
Creating Good Data: A Guide to Dataset Structure and Data Representation
byHarry J. Foxwell
Rating: 0 out of 5 stars
0 ratings
Data Science Fundamentals for Python and MongoDB
Ebook
Data Science Fundamentals for Python and MongoDB
byDavid Paper
Rating: 0 out of 5 stars
0 ratings
Practical DataOps: Delivering Agile Data Science at Scale
Ebook
Practical DataOps: Delivering Agile Data Science at Scale
byHarvinder Atwal
Rating: 0 out of 5 stars
0 ratings
Cognitive Computing Recipes: Artificial Intelligence Solutions Using Microsoft Cognitive Services and TensorFlow
Ebook
Cognitive Computing Recipes: Artificial Intelligence Solutions Using Microsoft Cognitive Services and TensorFlow
byAdnan Masood
Rating: 0 out of 5 stars
0 ratings
Data Science Career Guide Interview Preparation
Ebook
Data Science Career Guide Interview Preparation
byGradient Publication
Rating: 0 out of 5 stars
0 ratings
Data Science: What the Best Data Scientists Know About Data Analytics, Data Mining, Statistics, Machine Learning, and Big Data – That You Don't
Ebook
Data Science: What the Best Data Scientists Know About Data Analytics, Data Mining, Statistics, Machine Learning, and Big Data – That You Don't
byHerbert Jones
Rating: 5 out of 5 stars
5/5
Practical Machine Learning with Python: A Problem-Solver's Guide to Building Real-World Intelligent Systems
Ebook
Practical Machine Learning with Python: A Problem-Solver's Guide to Building Real-World Intelligent Systems
byDipanjan Sarkar
Rating: 0 out of 5 stars
0 ratings
Building Intelligent Systems: A Guide to Machine Learning Engineering
Ebook
Building Intelligent Systems: A Guide to Machine Learning Engineering
byGeoff Hulten
Rating: 0 out of 5 stars
0 ratings
Data Analytics with Python: Data Analytics in Python Using Pandas
Ebook
Data Analytics with Python: Data Analytics in Python Using Pandas
byFrank Millstein
Rating: 3 out of 5 stars
3/5

Computers For You

Skip carousel

Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
Ebook
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
byHadelin de Ponteves
Rating: 0 out of 5 stars
0 ratings
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
Network+ Study Guide & Practice Exams
Ebook
Network+ Study Guide & Practice Exams
byRobert Shimonski
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
Ebook
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
byTriumph Books
Rating: 5 out of 5 stars
5/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Practical Lock Picking: A Physical Penetration Tester's Training Guide
Ebook
Practical Lock Picking: A Physical Penetration Tester's Training Guide
byDeviant Ollam
Rating: 5 out of 5 stars
5/5
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
Ebook
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
byMaximus Wilson
Rating: 0 out of 5 stars
0 ratings
AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice
Ebook
AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice
bySeth Reichelson
Rating: 0 out of 5 stars
0 ratings
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance
Ebook
Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance
byKatherine Johnson Martinko
Rating: 0 out of 5 stars
0 ratings
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
Summary of Dotcom Secrets: by Russell Brunson - The Underground Playbook for Growing Your Company Online with Sales Funnels - A Comprehensive Summary
Ebook
Summary of Dotcom Secrets: by Russell Brunson - The Underground Playbook for Growing Your Company Online with Sales Funnels - A Comprehensive Summary
byAlexander Cooper
Rating: 5 out of 5 stars
5/5
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Master Builder Roblox: The Essential Guide
Ebook
Master Builder Roblox: The Essential Guide
byTriumph Books
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
Ebook
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
byDexter Jackson
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

This Week In Machine Learning & AI - 5/27/16: The White House on AI & Aggressive Self-Driving Cars: This Week in Machine Learning & AI brings you the…
Podcast episode
This Week In Machine Learning & AI - 5/27/16: The White House on AI & Aggressive Self-Driving Cars: This Week in Machine Learning & AI brings you the…
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Use Your Data Warehouse To Power Your Product Analytics With NetSpring: With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.
Podcast episode
Use Your Data Warehouse To Power Your Product Analytics With NetSpring: With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.
byData Engineering Podcast
0 ratings
0% found this document useful
Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel: Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
Podcast episode
Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel: Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
byData Engineering Podcast
0 ratings
0% found this document useful
Designing Data Platforms For Fintech Companies: Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
Podcast episode
Designing Data Platforms For Fintech Companies: Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
byData Engineering Podcast
0 ratings
0% found this document useful
Ep. 145 - Laura Anne Edwards, DATA OASIS founder, NASA Datanaut, TED Resident & SheCanHackIT on Sustainable Innovation and Big Data: Laura Anne Edwards is founder of DATA OASIS and serves as a NASA Datanaut, TED Resident and with SheCanHackIT. Brian Ardinger, Inside Outside Innovation founder, talks with Laura Anne about sustainable innovation and big data. Important Take Aways: Su
Podcast episode
Ep. 145 - Laura Anne Edwards, DATA OASIS founder, NASA Datanaut, TED Resident & SheCanHackIT on Sustainable Innovation and Big Data: Laura Anne Edwards is founder of DATA OASIS and serves as a NASA Datanaut, TED Resident and with SheCanHackIT. Brian Ardinger, Inside Outside Innovation founder, talks with Laura Anne about sustainable innovation and big data. Important Take Aways: Su
byInside Outside Innovation
0 ratings
0% found this document useful
The Future of Data Science Platforms is Accessibility // Skylar Payne // Coffee Session #65
Podcast episode
The Future of Data Science Platforms is Accessibility // Skylar Payne // Coffee Session #65
byMLOps.community
0 ratings
0% found this document useful
Justin Dux - Be The Match: Sign-up for The Technopath Way Weekly Newsletter here: technopath.ac-page.com/the-technopath-way-sign-up Be The Match How can I check if I’m still registered in the database if I signed up years ago? You can call this number to find out: 1 (800)...
Podcast episode
Justin Dux - Be The Match: Sign-up for The Technopath Way Weekly Newsletter here: technopath.ac-page.com/the-technopath-way-sign-up Be The Match How can I check if I’m still registered in the database if I signed up years ago? You can call this number to find out: 1 (800)...
byThe Technopath Way: Productivity through tech for nonprofits
0 ratings
0% found this document useful
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Podcast episode
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
Modern Customer Data Platform Principles: Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).
Podcast episode
Modern Customer Data Platform Principles: Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).
byData Engineering Podcast
0 ratings
0% found this document useful
[Bite] Data Science and the Scientific Method
Podcast episode
[Bite] Data Science and the Scientific Method
byDataCafé
0 ratings
0% found this document useful
Big Data, Data Lakes, and Blockchain with Rahul Pathak, Executive at Amazon Web Services: Everyone knows that data is exploding. What most people don’t realize is the pace and ways in which data is changing our everyday lives. According to , we’re seeing a “roughly 10x increase in data every 5 years, and the types of data that’s...
Podcast episode
Big Data, Data Lakes, and Blockchain with Rahul Pathak, Executive at Amazon Web Services: Everyone knows that data is exploding. What most people don’t realize is the pace and ways in which data is changing our everyday lives. According to , we’re seeing a “roughly 10x increase in data every 5 years, and the types of data that’s...
byMission Daily
0 ratings
0% found this document useful
Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI: Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.
Podcast episode
Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI: Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.
byData Engineering Podcast
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Data Sharing Across Business And Platform Boundaries: Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
Podcast episode
Data Sharing Across Business And Platform Boundaries: Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
byData Engineering Podcast
0 ratings
0% found this document useful
Reconciling The Data In Your Databases With Datafold: A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.
Podcast episode
Reconciling The Data In Your Databases With Datafold: A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.
byData Engineering Podcast
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Podcast episode
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
Podcast episode
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
byData Engineering Podcast
0 ratings
0% found this document useful
AI and the Democratization of Data of with Alonso Castañeda Andrade: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is joined today by , who is the...
Podcast episode
AI and the Democratization of Data of with Alonso Castañeda Andrade: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is joined today by , who is the...
byAI Live & Unbiased
0 ratings
0% found this document useful
Four Most Commonly Asked Questions About AI with Dr. Jerry Smith: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is talking today about questions and...
Podcast episode
Four Most Commonly Asked Questions About AI with Dr. Jerry Smith: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is talking today about questions and...
byAI Live & Unbiased
0 ratings
0% found this document useful
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub: Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
Podcast episode
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub: Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
byData Engineering Podcast
0 ratings
0% found this document useful
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
Podcast episode
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
byData Engineering Podcast
0 ratings
0% found this document useful
Privacy-aware Data Pipelines with Skyflow’s Piper Keyes: A data analytics pipeline is important to modern businesses because it allows them to extract valuable insights from the large amounts of data they generate and collect on a daily basis. This leads to better decision making, improved efficiency, and ...
Podcast episode
Privacy-aware Data Pipelines with Skyflow’s Piper Keyes: A data analytics pipeline is important to modern businesses because it allows them to extract valuable insights from the large amounts of data they generate and collect on a daily basis. This leads to better decision making, improved efficiency, and ...
byPartially Redacted: Data Privacy, Security & Compliance
0 ratings
0% found this document useful
Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15: Snorkel: Extracting Value From Dark Data With Python (Interview)
Podcast episode
Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15: Snorkel: Extracting Value From Dark Data With Python (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
How Column-Aware Development Tooling Yields Better Data Models: Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.
Podcast episode
How Column-Aware Development Tooling Yields Better Data Models: Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.
byData Engineering Podcast
0 ratings
0% found this document useful
Why Experimental Design in Biotech is Broken and How to Fix It w/ Markus Gershater
Podcast episode
Why Experimental Design in Biotech is Broken and How to Fix It w/ Markus Gershater
byData in Biotech
0 ratings
0% found this document useful
Data jobs: Interview with data & machine learning expert Catherine Lopes PhD (Ep 42): Who would have thought that 2020 would be the year of data charts? That we would be glued to the daily news like never before, anxiously waiting to see more and more charts, expecting data analysts to tell us which way curves, bars, and pie charts ar...
Podcast episode
Data jobs: Interview with data & machine learning expert Catherine Lopes PhD (Ep 42): Who would have thought that 2020 would be the year of data charts? That we would be glued to the daily news like never before, anxiously waiting to see more and more charts, expecting data analysts to tell us which way curves, bars, and pie charts ar...
byThe Job Hunting Podcast
0 ratings
0% found this document useful
Quantifying The Return On Investment For Your Data Team: As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.
Podcast episode
Quantifying The Return On Investment For Your Data Team: As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.
byData Engineering Podcast
0 ratings
0% found this document useful
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
Podcast episode
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
byData Engineering Podcast
0 ratings
0% found this document useful
Building A Data Mesh Platform At PayPal: There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process.
Podcast episode
Building A Data Mesh Platform At PayPal: There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process.
byData Engineering Podcast
0 ratings
0% found this document useful
Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite: The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache.
Podcast episode
Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite: The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache.
byData Engineering Podcast
0 ratings
0% found this document useful

Skip carousel

The Russian Nesting Doll Theory of Motivation
Writer's Digest
Article
The Russian Nesting Doll Theory of Motivation
Apr 12, 2021
7 min read
How And Where You Use Machine-learning
APC
Article
How And Where You Use Machine-learning
Oct 7, 2019
4 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
Finding Your Data
APC
Article
Finding Your Data
Sep 9, 2019
4 min read
Questions for Angela Zutavern, Machine Intelligence Expert, Booz Allen Hamilton
Rotman Management
Article
Questions for Angela Zutavern, Machine Intelligence Expert, Booz Allen Hamilton
Jan 1, 2018
You believe that the world of leadership has hit an inflection point. How so? As useful as popular mental models and heuristics are, machine models now outstrip human performance in about half of the portfolio of cognitive tasks. Going forward, we wi
6 min read
Finding A New Career In AI
APC
Article
Finding A New Career In AI
Mar 23, 2020
4 min read
Why Your Organisation Needs To Lift Its Data Game
NZBusiness and Management
Article
Why Your Organisation Needs To Lift Its Data Game
Oct 22, 2019
From problems stemming from the recent New Zealand census to data collected by Facebook, data has been in the news a lot lately. It may seem obvious that large organisations such as Statistics New Zealand and Facebook need to continually improve thei
3 min read
Q&A
Rotman Management
Article
Q&A
May 1, 2023
Describe the capability that companies like Netflix, UPS, Amazon and Caesars Entertainment have in common. These are all leading firms in their industries with respect to leveraging analytics as a source of competitive advantage. We now have so much
7 min read
The Deep Learning Revolution For Artificial Intelligence
Facility Management
Article
The Deep Learning Revolution For Artificial Intelligence
Mar 28, 2019
3 min read
Why We Need To Fear The Risk Of AI Model Collapse
Evening Standard
Article
Why We Need To Fear The Risk Of AI Model Collapse
Dec 17, 2023
4 min read
Small Data
PC Pro Magazine
Article
Small Data
Oct 8, 2022
3 min read
How To Make Sense From And With AI ?
The European Business Review
Article
How To Make Sense From And With AI ?
Sep 25, 2021
4 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
Data-driven Decision Making That Uses Data, Mind And Heart
The European Business Review
Article
Data-driven Decision Making That Uses Data, Mind And Heart
Jan 31, 2020
14 min read
Four Critical Skills For Tomorrow’s Innovation Workforce
Rotman Management
Article
Four Critical Skills For Tomorrow’s Innovation Workforce
Sep 1, 2020
12 min read
Datafication
PC Pro Magazine
Article
Datafication
May 11, 2023
3 min read
Adoption of Cognitive Computing Across Various Industries
Techfastly
Article
Adoption of Cognitive Computing Across Various Industries
Dec 1, 2021
5 min read
How Can AI Help Your Business?
PC Pro Magazine
Article
How Can AI Help Your Business?
Jun 8, 2023
7 min read
Safer Cyber
Cosmos Magazine
Article
Safer Cyber
Mar 14, 2024
3 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
Team Encodes Digital ‘Hello’ Into Lab-made DNA
Futurity
Article
Team Encodes Digital ‘Hello’ Into Lab-made DNA
Mar 26, 2019
4 min read
Leighton Wolffe
Cannabis & Tech Today
Article
Leighton Wolffe
Mar 20, 2020
The cannabis industry has plenty of data floating around, but how much is put to use? As with most big data, it’s desperately underutilized. Lighting, irrigation, and HVAC systems could be transmitting information about crop health twenty-four hours
4 min read
How Data Will Transform Our Lives
Money Magazine
Article
How Data Will Transform Our Lives
May 3, 2023
4 min read
Enter the Industry 4.0 Era Today by Using “Dark Data” You Already Have
The European Business Review
Article
Enter the Industry 4.0 Era Today by Using “Dark Data” You Already Have
Aug 2, 2019
7 min read
Mining Actionable Information with Smart Capture
The European Business Review
Article
Mining Actionable Information with Smart Capture
May 22, 2018
4 min read
What Tech Can Learn from the Fruit Fly’s Search Algorithm
Nautilus
Article
What Tech Can Learn from the Fruit Fly’s Search Algorithm
Nov 13, 2017
5 min read
Why a Hedge Fund Started a Video Game Competition
Nautilus
Article
Why a Hedge Fund Started a Video Game Competition
Nov 30, 2017
There’s a weird way in which a hedge fund is a confluence of everything. There’s the money of course—Two Sigma, located in lower Manhattan, manages over $50 billion, an amount that has grown 600 percent in 6 years and is roughly the size of the econo
9 min read
Machine Learning – With Zero Programming
APC
Article
Machine Learning – With Zero Programming
Aug 12, 2019
6 min read
The Democratization of Judgment
Rotman Management
Article
The Democratization of Judgment
Jan 1, 2018
8 min read
“How Do You Launch A Product Without Alienating Or Damaging Your Customers?”
PC Pro Magazine
Article
“How Do You Launch A Product Without Alienating Or Damaging Your Customers?”
Feb 10, 2022
6 min read

Related categories

Skip carousel

Reviews for Practical Data Science

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Practical Data Science - Andreas François Vermeulen

Andreas François VermeulenPractical Data Sciencehttps://doi.org/10.1007/978-1-4842-3054-1_1

1. Data Science Technology Stack

Andreas François Vermeulen¹

(1)

West Kilbride North Ayrshire, UK

The Data Science Technology Stack covers the data processing requirements in the Rapid Information Factory ecosystem. Throughout the book, I will discuss the stack as the guiding pattern.

In this chapter, I will help you to recognize the basics of data science tools and their influence on modern data lake development. You will discover the techniques for transforming a data vault into a data warehouse bus matrix. I will explain the use of Spark, Mesos, Akka, Cassandra, and Kafka, to tame your data science requirements.

I will guide you in the use of elastic search and MQTT (MQ Telemetry Transport) , to enhance your data science solutions. I will help you to recognize the influence of R as a creative visualization solution. I will also introduce the impact and influence on the data science ecosystem of such programming languages as R, Python, and Scala.

Rapid Information Factory Ecosystem

The Rapid Information Factory ecosystem is a convention of techniques I use for my individual processing developments. The processing route of the book will be formulated on this basis, but you are not bound to use it exclusively. The tools I discuss in this chapter are available to you without constraint. The tools can be used in any configuration or permutation that is suitable to your specific ecosystem.

I recommend that you begin to formulate an ecosystem of your own or simply adopt mine. As a prerequisite, you must become accustomed to a set of tools you know well and can deploy proficiently.

Note

Remember: Your data lake will have its own properties and features, so adopt your tools to those particular characteristics.

Data Science Storage Tools

This data science ecosystem has a series of tools that you use to build your solutions. This environment is undergoing a rapid advancement in capabilities, and new developments are occurring every day.

I will explain the tools I use in my daily work to perform practical data science. Next, I will discuss the following basic data methodologies.

Schema-on-Write and Schema-on-Read

There are two basic methodologies that are supported by the data processing tools. Following is a brief outline of each methodology and its advantages and drawbacks.

Schema-on-Write Ecosystems

A traditional relational database management system (RDBMS) requires a schema before you can load the data. To retrieve data from my structured data schemas, you may have been running standard SQL queries for a number of years.

Benefits include the following:

In traditional data ecosystems, tools assume schemas and can only work once the schema is described, so there is only one view on the data.

The approach is extremely valuable in articulating relationships between data points, so there are already relationships configured.

It is an efficient way to store dense data.

All the data is in the same data store.

On the other hand, schema-on-write isn’t the answer to every data science problem. Among the downsides of this approach are that

Its schemas are typically purpose-built, which makes them hard to change and maintain.

It generally loses the raw/atomic data as a source for future analysis.

It requires considerable modeling/implementation effort before being able to work with the data.

If a specific type of data can’t be stored in the schema, you can’t effectively process it from the schema.

At present, schema-on-write is a widely adopted methodology to store data.

Schema-on-Read Ecosystems

This alternative data storage methodology does not require a schema before you can load the data. Fundamentally, you store the data with minimum structure. The essential schema is applied during the query phase.

Benefits include the following:

It provides flexibility to store unstructured, semi-structured, and disorganized data.

It allows for unlimited flexibility when querying data from the structure.

Leaf-level data is kept intact and untransformed for reference and use for the future.

The methodology encourages experimentation and exploration.

It increases the speed of generating fresh actionable knowledge.

It reduces the cycle time between data generation to availability of actionable knowledge.

Schema-on-read methodology is expanded on in Chapter 6.

I recommend a hybrid between schema-on-read and schema-on-write ecosystems for effective data science and engineering. I will discuss in detail why this specific ecosystem is the optimal solution when I cover the functional layer’s purpose in data science processing.

Data Lake

A data lake is a storage repository for a massive amount of raw data. It stores data in native format, in anticipation of future requirements. You will acquire insights from this book on why this is extremely important for practical data science and engineering solutions. While a schema-on-write data warehouse stores data in predefined databases, tables, and records structures, a data lake uses a less restricted schema-on-read-based architecture to store data. Each data element in the data lake is assigned a distinctive identifier and tagged with a set of comprehensive metadata tags.

A data lake is typically deployed using distributed data object storage, to enable the schema-on-read structure. This means that business analytics and data mining tools access the data without a complex schema. Using a schema-on-read methodology enables you to load your data as is and start to get value from it instantaneously.

I will discuss and provide more details on the reasons for using a schema-on-read storage methodology in Chapters 6–11.

For deployment onto the cloud, it is a cost-effective solution to use Amazon’s Simple Storage Service (Amazon S3) to store the base data for the data lake. I will demonstrate the feasibility of using cloud technologies to provision your data science work. It is, however, not necessary to access the cloud to follow the examples in this book, as they can easily be processed using a laptop.

Data Vault

Data vault modeling, designed by Dan Linstedt, is a database modeling method that is intentionally structured to be in control of long-term historical storage of data from multiple operational systems. The data vaulting processes transform the schema-on-read data lake into a schema-on-write data vault. The data vault is designed into the schema-on-read query request and then executed against the data lake.

I have also seen the results stored in a schema-on-write format, to persist the results for future queries. The techniques for both methods are discussed in Chapter 9. At this point, I expect you to understand only the rudimentary structures required to formulate a data vault.

The structure is built from three basic data structures: hubs, inks, and satellites. Let’s examine the specific data structures, to clarify why they are compulsory.

Hubs

Hubs contain a list of unique business keys with low propensity to change. They contain a surrogate key for each hub item and metadata classification of the origin of the business key.

The hub is the core backbone of your data vault, and in Chapter 9, I will discuss in more detail how and why you use this structure.

Links

Associations or transactions between business keys are modeled using link tables. These tables are essentially many-to-many join tables, with specific additional metadata.

The link is a singular relationship between hubs to ensure the business relationships are accurately recorded to complete the data model for the real-life business. In Chapter 9, I will explain how and why you would require specific relationships.

Satellites

Hubs and links form the structure of the model but store no chronological characteristics or descriptive characteristics of the data. These characteristics are stored in appropriated tables identified as satellites.

Satellites are the structures that store comprehensive levels of the information on business characteristics and are normally the largest volume of the complete data vault data structure. In Chapter 9, I will explain how and why these structures work so well to model real-life business characteristics.

The appropriate combination of hubs, links, and satellites helps the data scientist to construct and store prerequisite business relationships. This is a highly in-demand skill for a data modeler.

The transformation to this schema-on-write data structure is discussed in detail in Chapter 9, to point out why a particular structure supports the processing methodology. I will explain in that chapter why you require particular hubs, links, and satellites.

Data Warehouse Bus Matrix

The Enterprise Bus Matrix is a data warehouse planning tool and model created by Ralph Kimball and used by numerous people worldwide over the last 40+ years. The bus matrix and architecture builds upon the concept of conformed dimensions that are interlinked by facts.

The data warehouse is a major component of the solution required to transform data into actionable knowledge. This schema-on-write methodology supports business intelligence against the actionable knowledge. In Chapter 10, I provide more details on this data tool and give guidance on its use.

Data Science Processing Tools

Now that I have introduced data storage, the next step involves processing tools to transform your data lakes into data vaults and then into data warehouses. These tools are the workhorses of the data science and engineering ecosystem. Following are the recommended foundations for the data tools I use.

Spark

Apache Spark is an open source cluster computing framework. Originally developed at the AMP Lab of the University of California, Berkeley, the Spark code base was donated to the Apache Software Foundation, which now maintains it as an open source project. This tool is evolving at an incredible rate.

IBM is committing more than 3,500 developers and researchers to work on Spark-related projects and formed a dedicated Spark technology center in San Francisco to pursue Spark-based innovations.

SAP, Tableau, and Talend now support Spark as part of their core software stack. Cloudera, Hortonworks, and MapR distributions support Spark as a native interface.

Spark offers an interface for programming distributed clusters with implicit data parallelism and fault-tolerance. Spark is a technology that is becoming a de-facto standard for numerous enterprise-scale processing applications.

I discovered the following modules using this tool as part of my technology toolkit.

Spark Core

Spark Core is the foundation of the overall development. It provides distributed task dispatching, scheduling, and basic I/O functionalities.

This enables you to offload the comprehensive and complex running environment to the Spark Core. This safeguards that the tasks you submit are accomplished as anticipated. The distributed nature of the Spark ecosystem enables you to use the same processing request on a small Spark cluster, then on a cluster of thousands of nodes, without any code changes. In Chapter 10, I will discuss how you accomplish this.

Spark SQL

Spark SQL is a component on top of the Spark Core that presents a data abstraction called Data Frames. Spark SQL makes accessible a domain-specific language (DSL) to manipulate data frames. This feature of Spark enables ease of transition from your traditional SQL environments into the Spark environment. I have recognized its advantage when you want to enable legacy applications to offload the data from their traditional relational-only data storage to the data lake ecosystem.

Spark Streaming

Spark Streaming leverages Spark Core’s fast scheduling capability to perform streaming analytics. Spark Streaming has built-in support to consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets. The process of streaming is the primary technique for importing data from the data source to the data lake.

Streaming is becoming the leading technique to load from multiple data sources. I have found that there are connectors available for many data sources. There is a major drive to build even more improvements on connectors, and this will improve the ecosystem even further in the future.

In Chapters 7 and 11, I will discuss the use of streaming technology to move data through the processing layers.

MLlib Machine Learning Library

Spark MLlib is a distributed machine learning framework used on top of the Spark Core by means of the distributed memory-based Spark architecture.

In Spark 2.0, a new library, spark.mk, was introduced to replace the RDD-based data processing with a DataFrame-based model. It is planned that by the introduction of Spark 3.0, only DataFrame-based models will exist.

Common machine learning and statistical algorithms have been implemented and are shipped with MLlib, which simplifies large-scale machine learning pipelines, including

Dimensionality reduction techniques, such as singular value decomposition (SVD) and principal component analysis (PCA)

Summary statistics, correlations, stratified sampling, hypothesis testing, and random data generation

Collaborative filtering techniques, including alternating least squares (ALS)

Classification and regression: support vector machines, logistic regression, linear regression, decision trees, and naive Bayes classification

Cluster analysis methods, including k-means and latent Dirichlet allocation (LDA)

Optimization algorithms, such as stochastic gradient descent and limited-memory BFGS (L-BFGS)

Feature extraction and transformation functions

In Chapter 10, I will discuss the use of machine learning proficiency to support the automatic processing through the layers.

GraphX

GraphX is a powerful graph-processing application programming interface (API) for the Apache Spark analytics engine that can draw insights from large data sets. GraphX provides outstanding speed and capacity for running massively parallel and machine-learning algorithms.

The introduction of the graph-processing capability enables the processing of relationships between data entries with ease. In Chapters 9 and 10, I will discuss the use of a graph database to support the interactions of the processing through the layers.

Mesos

Apache Mesos is an open source cluster manager that was developed at the University of California, Berkeley. It delivers efficient resource isolation and sharing across distributed applications. The software enables resource sharing in a fine-grained manner, improving cluster utilization.

The Enterprise version of Mesos is Mesosphere Enterprise DC/OS. This runs containers elastically, and data services support Kafka, Cassandra, Spark, and Akka.

In microservices architecture, I aim to construct a service that spawns granularity, processing units and lightweight protocols through the layers. In Chapter 6, I will discuss the use of fine-grained microservices know-how to support data processing through the framework.

Akka

The toolkit and runtime methods shorten development of large-scale data-centric applications for processing. Akka is an actor-based message-driven runtime for running concurrency, elasticity, and resilience processes. The use of high-level abstractions such as actors, streams, and futures facilitates the data science and engineering granularity processing units.

The use of actors enables the data scientist to spawn a series of concurrent processes by using a simple processing model that employs a messaging technique and specific predefined actions/behaviors for each actor. This way, the actor can be controlled and limited to perform the intended tasks only. In Chapter 7-11, I will discuss the use of different fine-grained granularity processes to support data processing throughout the framework.

Cassandra

Apache Cassandra is a large-scale distributed database supporting multi–data center replication for availability, durability, and performance.

I use DataStax Enterprise (DSE) mainly to accelerate my own ability to deliver real-time value at epic scale, by providing a comprehensive and operationally simple data management layer with a unique always-on architecture built in Apache Cassandra. The standard Apache Cassandra open source version works just as well, minus some extra management it does not offer as standard. I will just note that, for graph databases, as an alternative to GraphX, I am currently also using DataStax Enterprise Graph. In Chapter 7-11, I will discuss, the use of these large-scale distributed database models to process data through data science structures.

Kafka

This is a high-scale messaging backbone that enables communication between data processing entities. The Apache Kafka streaming platform, consisting of Kafka Core, Kafka Streams, and Kafka Connect, is the foundation of the Confluent Platform.

The Confluent Platform is the main commercial supporter for Kafka (see www.confluent.io/ ). Most of the Kafka projects I am involved with now use this platform. Kafka components empower the capture, transfer, processing, and storage of data streams in a distributed, fault-tolerant manner throughout an organization in real time.

Kafka Core

At the core of the Confluent Platform is Apache Kafka. Confluent extends that core to make configuring, deploying, and managing Kafka less complex.

Kafka Streams

Kafka Streams is an open source solution that you can integrate into your application to build and execute powerful stream-processing functions.

Kafka Connect

This ensures Confluent-tested and secure connectors for numerous standard data systems. Connectors make it quick and stress-free to start setting up consistent data pipelines. These connectors are completely integrated with the platform, via the schema registry.

Kafka Connect enables the data processing capabilities that accomplish the movement of data into the core of the data solution from the edge of the business ecosystem. In Chapter 7-11, I will discuss the use of this messaging pipeline to stream data through the configuration.

Elastic Search

Elastic search is a distributed, open source search and analytics engine designed for horizontal scalability, reliability, and stress-free management. It combines the speed of search with the power of analytics, via a sophisticated, developer-friendly query language covering structured, unstructured, and time-series data. In Chapter 11, I will discuss, the use of this elastic search to categorize data within the framework.

R

R is a programming language and software environment for statistical computing and graphics. The R language is widely used by data scientists, statisticians, data miners, and data engineers for developing statistical software and performing data analysis.

The capabilities of R are extended through user-created packages using specialized statistical techniques and graphical procedures. A core set of packages is contained within the core installation of R, with additional packages accessible from the Comprehensive R Archive Network (CRAN).

Knowledge of the following packages is a must:

sqldf (data frames using SQL): This function reads a file into R while filtering data with an sql statement. Only the filtered part is processed by R, so files larger than those R can natively import can be used as data sources.

forecast (forecasting of time series): This package provides forecasting functions for time series and linear models.

dplyr (data aggregation): Tools for splitting, applying, and combining data within R

stringr (string manipulation): Simple, consistent wrappers for common string operations

RODBC, RSQLite, and RCassandra database connection packages: These are used to connect to databases, manipulate data outside R, and enable interaction with the source system.

lubridate (time and date manipulation): Makes dealing with dates easier within R

ggplot2 (data visualization): Creates elegant data visualizations, using the grammar of graphics. This is a super-visualization capability.

reshape2 (data restructuring): Flexibly restructures and aggregates data, using just two functions: melt and dcast (or acast).

randomForest (random forest predictive models): Leo Breiman and Adele Cutler’s random forests for classification and regression

gbm (generalized boosted regression models): Yoav Freund and Robert Schapire’s AdaBoost algorithm and Jerome Friedman’s gradient boosting machine

I will discuss each of these packages as I guide you through the book. In Chapter 6, I will discuss, the use of R to process the sample data within the sample framework. I will provide examples that demonstrate the basic ideas and engineering behind the framework and the tools.

Please note that there are many other packages in CRAN, which is growing on a daily basis. Investigating the different packages to improve your capabilities in the R environment is time well spent.

Scala

Scala is a general-purpose programming language. Scala supports functional programming and a strong static type system. Many high-performance data science frameworks are constructed using Scala, because of its amazing concurrency capabilities. Parallelizing masses of processing is a key requirement for large data sets from a data lake. Scala is emerging as the de-facto programming language used by data-processing tools. I provide guidance on how to use it, in the course of this book. Scala is also the native language for Spark, and it is useful to master this language.

Python

Python is a high-level, general-purpose programming language created by Guido van Rossum and released in 1991. It is important to note that it is an interpreted language: Python has a design philosophy that emphasizes code readability. Python uses a dynamic type system and automatic memory management and supports multiple programming paradigms (object-oriented, imperative, functional programming, and procedural).

Thanks to its worldwide success, it has a large and comprehensive standard library. The Python Package Index (PyPI) ( https://pypi.python.org/pypi ) supplies thousands of third-party modules ready for use for your data science projects. I provide guidance on how to use it, in the course of this book.

I suggest that you also install Anaconda. It is an open source distribution of Python that simplifies package management and deployment of features (see www.continuum.io/downloads ).

MQTT (MQ Telemetry Transport)

MQTT stands for MQ Telemetry Transport. The protocol uses publish and subscribe, extremely simple and lightweight messaging protocols. It was intended for constrained devices and low-bandwidth, high-latency, or unreliable networks. This protocol is perfect for machine-to-machine- (M2M) or Internet-of-things-connected devices.

MQTT-enabled devices include handheld scanners, advertising boards, footfall counters, and other machines. In Chapter 7, I will discuss how and where you can use MQTT technology and how to make use of the essential benefits it generates. The apt use of this protocol is critical in the present and future data science environments. In Chapter 11, will discuss the use of MQTT for data collection and distribution back to the business.

What’s Next?

As things change daily in the current ecosphere of ever-evolving and -increasing collections of tools and technological improvements to support data scientists, feel free to investigate technologies not included in the preceding lists. I have acquainted you with my toolbox. This Data Science Technology Stack has served me well, and I will show you how to use it to achieve success.

Note

My hard-earned advice is to practice with your tools. Make them your own! Spend time with them, cultivate your expertise.

Andreas François VermeulenPractical Data Sciencehttps://doi.org/10.1007/978-1-4842-3054-1_2

2. Vermeulen-Krennwallner-Hillman-Clark

Andreas François Vermeulen¹

(1)

West Kilbride North Ayrshire, UK

Let’s begin by constructing a customer. I have created a fictional company for which you will perform the practical data science as your progress through this book. You can execute your examples in either a Windows or Linux environment. You only have to download the desired example set.

Any source code or other supplementary material referenced in this book is available to readers on GitHub, via this book’s product page, located at www.apress.com/9781484230534 .

Windows

I suggest that you create a directory called c:\VKHCG to process all the examples in this book. Next, from GitHub, download and unzip the DS_VKHCG_Windows.zip file into this directory.

Linux

I also suggest that you create a directory called ./VKHCG, to process all the examples in this book. Then, from GitHub, download and untar the DS_VKHCG_Linux.tar.gz file into this directory.

Warning

If you change this directory to a new location, you will be required to change everything in the sample scripts to this new location, to get maximum benefit from the samples.

These files are used to create the sample company’s script and data directory, which I will use to guide you through the processes and examples in the rest of the book.

It’s Now Time to Meet Your Customer

Vermeulen-Krennwallner-Hillman-Clark Group (VKHCG) is a hypothetical medium-size international company. It consists of four subcompanies: Vermeulen PLC, Krennwallner AG, Hillman Ltd, and Clark Ltd.

Vermeulen PLC

Vermeulen PLC is a data processing company that processes all the data within the group companies, as part of their responsibility to the group. The company handles all the information technology aspects of the business.

This is the company for which you have just been hired to be the data scientist. Best of luck with your future.

The company supplies

Data science

Networks, servers, and communication systems

Internal and external web sites

Data analysis business activities

Decision science

Process automation

Management reporting

For the purposes of this book, I will explain what other technologies you need to investigate at every section of the framework, but the examples will concentrate only on specific concepts under discussion, as the overall data science field is more comprehensive than the few selected examples.

By way of examples, I will assist you in building a basic Data Science Technology Stack and then advise you further with additional discussions on how to get the stack to work at scale.

The examples will show you how to process the following business data:

Customers

Products

Location

Business processes

A number of handy data science algorithms

I will explain how to

Create a network routing diagram using geospatial analysis

Build a directed acyclic graph (DAG) for the schedule of jobs, using graph theory

If you want to have a more detailed view of the company’s data, take a browse at these data sets in the company’s sample directory (./VKHCG/01-Vermeulen/00-RawData).

Later in this chapter, I will give you a more detailed walk-through of each data set.

Krennwallner AG

Krennwallner AG is an advertising and media company that prepares advertising and media content for the customers of the group.

It supplies

Advertising on billboards

Advertising and content management for online delivery

Event management for key customers

Via a number of technologies, it records who watches what media streams. The specific requirement we will elaborate is how to identify the groups of customers who will have to see explicit media content. I will explain how to

Pick content for specific billboards

Understand online web site visitors’ data per country

Plan an event for top-10 customers at Neuschwanstein Castle

If you want to have a more in-depth view of the company’s data, have a glance at the sample data sets in the company’s sample directory (./VKHCG/02-Krennwallner/00-RawData).

Hillman Ltd

The Hillman company is a supply chain and logistics company. It provisions a worldwide supply chain solution to the businesses, including

Third-party warehousing

International shipping

Door-to-door logistics

The principal requirement that I will expand on through examples is how you design the distribution of a customer’s products purchased online.

Through the examples, I will follow the product from factory to warehouse and warehouse to customer’s door.

I will explain how to

Plan the locations of the warehouses within the United Kingdom

Plan shipping rules for best-fit international logistics

Choose what the best packing option is for shipping containers for a given set of products

Create an optimal delivery route for a set of customers in Scotland

If you want to have a more detailed view of the company’s data, browse the data sets in the company’s sample directory (./VKHCG/ 03-Hillman/00-RawData).

Clark Ltd

The Clark company is a venture capitalist and accounting company that processes the following financial responsibilities of the group:

Financial insights

Venture capital management

Investments planning

Forex (foreign exchange) trading

I will use financial aspects of the group companies to explain how you apply practical data science and data engineering to common problems for the hypothetical financial data.

I will explain to you how to prepare

A simple forex trading planner

Accounting ratios

Profitability

Gross profit for sales

Gross profit after tax for sales

Return on capital employed (ROCE)

Asset turnover

Inventory turnover

Accounts receivable days

Accounts payable days

Processing Ecosystem

Five years ago, VKHCG consolidated its processing capability by transferring the concentrated processing requirements to Vermeulen PLC to perform data science as a group service. This resulted in the other group companies sustaining 20% of the group business activities; however, 90% of the data processing of the combined group’s business activities was reassigned to the core team. Vermeulen has since consolidated Spark, Python, Mesos, Akka, Cassandra, Kafka, elastic search, and MQTT (MQ Telemetry Transport) processing into a group service provider and processing entity.

I will use R or Python for the data processing in the examples. I will also discuss the complementary technologies and advise you on what to consider and request for your own environment.

Note

The complementary technologies are used regularly in the data science environment. Although I cover them briefly, that does not make them any less significant.

VKHCG uses the R processing engine to perform data processing in 80% of the company business activities, and the other 20% is done by Python. Therefore, we will prepare an R and a Python environment to perform the examples. I will quickly advise you on how to obtain these additional environments, if you require them for your own specific business requirements.

I will cover briefly the technologies that we are not using in the examples but that are known to be beneficial.

Scala

Scala is popular in the data science community, as it supports massive parallel processing in an at-scale manner. You can install the language from the following core site: www.scala-lang.org/download/ . Cheat sheets and references are available to guide you to resources to help you master this programming language.

Note

Many of my larger clients are using Scala as their strategical development language.

Apache Spark

Apache Spark is a fast and general engine for large-scale data processing that is at present the fastest-growing processing engine for large-scale data science projects. You can install the engine from the following core site: http://spark.apache.org/ . For large-scale projects, I use the Spark environment within DataStax Enterprise ( www.datastax.com ), Hortonworks ( https://hortonworks.com/ ), Cloudera ( www.cloudera.com/ ), and MapR ( https://mapr.com/ ).

Note

Spark is now the most sought-after common processing engine for at-scale data processing, with support increasing by the day. I recommend that you master this engine, if you want to advance your career in data science at-scale.

Apache Mesos

Apache Mesos abstracts CPU, memory, storage, and additional computation resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to effortlessly build and run processing solutions effectively. It is industry proven to scale to 10,000s of nodes. This empowers the data scientist to run massive parallel analysis and processing in an efficient manner. The processing environment is available from the following core site: http://mesos.apache.org/ . I want to give Mesosphere Enterprise DC/OS an honorable mention, as I use it for many projects. See https://mesosphere.com , for more details.

Note

Mesos is a cost-effective processing approach supporting growing dynamic processing requirements in an at-scale processing environment.

Akka

Akka supports building powerful concurrent and distributed applications to perform massive parallel processing, while sharing the common processing platform at-scale. You can install the engine from the following core site: http://akka.io/ . I use Akka processing within the Mesosphere Enterprise DC/OS environment.

Apache Cassandra

Apache Cassandra database offers support with scalability and high availability, without compromising performance. It has linear scalability and a reputable fault-tolerance, as it is widely used by numerous big companies. You can install the engine from the following core site: http://cassandra.apache.org/ .

I use Cassandra processing within the Mesosphere Enterprise DC/OS environment and DataStax Enterprise for my Cassandra installations.

Note

I recommend that you consider Cassandra as an at-scale database, as it supports the data science environment with stable data processing capability.

Kafka

Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, and impressively fast. You can install the engine from the following core site: http://kafka.apache.org/ . I use Kafka processing within the Mesosphere Enterprise DC/OS environment, to handle the ingress of data into my data science environments.

Note

I advise that you look at Kafka as a data transport, as it supports the data science environment with robust data collection facility.

Message Queue Telemetry Transport

Message Queue Telemetry Transport (MQTT) is a machine-to-machine (M2M) and Internet of things connectivity protocol. It is an especially lightweight publish/subscribe messaging transport. It enables connections to locations where a small code footprint is essential, and lack of network bandwidth is a barrier to communication. See http://mqtt.org/ for details.

Note

This protocol is common in sensor environments, as it provisions the smaller code footprint and lower bandwidths that sensors demand.

Now that I have covered the items you should know about but are not going to use in the examples, let’s look at what you will use.

Example Ecosystem

The examples require the following environment. The two setups required within VKHCG’s environment are Python and R.

Python

Python is a high-level programming language created by Guido van Rossum and first released in 1991. Its reputation is growing, as today, various training institutes are covering the language as part of their data science prospectus.

I suggest you install Anaconda, to enhance your Python development. It is an open source distribution of Python that simplifies package management and deployment of features (see www.continuum.io/downloads ).

Ubuntu

I use an Ubuntu desktop and server installation to perform my data science (see www.ubuntu.com/ ), as follows:

sudo apt-get install python3 python3-pip python3-setuptools

CentOS/RHEL

If you want to use CentOS/RHEL, I suggest you employ the following install process:

sudo yum install python3 python3-pip python3-setuptools

Windows

If you want to use Windows, I suggest you employ the following install process.

Download the software from www.python.org/downloads/windows/ .

Is Python3 Ready?

Once installation is completed, you must test your environment as follows:

Python3 --version

On success, you should see a response like this

Python 3.4.3+

Congratulations, Python is now ready.

Python Libraries

One of the most important features of Python is its libraries, which are extensively available and make it stress-free to include verified data science processes into your environment.

To investigate extra packages, I suggest you review the PyPI —Python Package Index ( https://pypi.python.org/ ).

You have to set up a limited set of Python libraries to enable you to complete the examples.

Warning

Please ensure that you have verified all the packages you use. Remember: Open source is just that—open. Be vigilant!

Pandas

This provides a high-performance set of data structures and data-analysis tools for use in your data science.

Ubuntu

Install this by using

sudo apt-get install python-pandas

Centos/RHEL

Install this by using

yum install python-pandas

PIP

Install this by using

pip install pandas

More information on Pandas development is available at http://pandas.pydata.org/ . I suggest following the cheat sheet ( https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf ), to guide you through the basics of using Pandas.

I will explain, via examples, how to use these Pandas tools.

Note

I suggest that you master this package, as it will support many of your data loading and storing processes, enabling overall data science processing.

Matplotlib

Matplotlib is a Python 2D and 3D plotting library that can produce various plots, histograms, power spectra, bar charts, error charts, scatterplots, and limitless advance visualizations of your data science results.

Ubuntu

Install this by using

sudo apt-get install python-matplotlib

CentOS/RHEL

Install this by using

Sudo yum install python-matplotlib

PIP

Install this by using:

pip install matplotlib

Explore http://matplotlib.org/ for more details on the visualizations that you can accomplish with exercises included in these packages.

Note

I recommend that you spend time mastering your visualization skills. Without these skills, it is nearly impossible to communicate your data science results.

NumPy

NumPy is the fundamental package for scientific computing, based on a general homogeneous multidimensional array structure with processing tools. Explore www.numpy.org/ for further details.

I will use some of the tools in the examples but suggest you practice with the general tools, to assist you with your future in data science.

SymPy

SymPy is a Python library for symbolic mathematics. It assists you in simplifying complex algebra formulas before including them in your code. Explore www.sympy.org for details on this package’s capabilities.

Scikit-Learn

Scikit-Learn is an efficient set of tools for data mining and data analysis packages. It provides support for data science classification, regression, clustering, dimensionality reduction, and preprocessing for feature extraction and normalization. This tool supports both supervised learning and unsupervised learning processes.

I will use many of the processes from this package in the examples. Explore http://scikit-learn.org for more details on this wide-ranging package.

Congratulations. You are now ready to execute the Python examples. Now, I will guide you through the second setup for the R environment.

R

R is the core processing engine for statistical computing and graphics. Download the software from www.r-project.org/ and follow the installation guidance for the specific R installation you require.

Ubuntu

Install this by using

sudo apt-get install r-base

CentOS/RHEL

Install this by using

sudo yum install R

Windows

From https://cran.r-project.org/bin/windows/base/ , install the software that matches your environment.

Development Environment

VKHCG uses the RStudio development environment for its data science and engineering within the group.

R Studio

RStudio produces a stress-free R ecosystem containing a code editor, debugging, and a visualization toolset. Download the relevant software from www.rstudio.com/ and follow the installation guidance for the specific installation you require.

Ubuntu

Install this by using

wget https://download1.rstudio.org/rstudio-1.0.143-amd64.deb

sudo dpkg -i *.deb

rm *.deb

CentOS/RHEL

Install this by using

wget https://download1.rstudio.org/rstudio-1.0.143-x86_64.rpm

sudo yum install --nogpgcheck rstudio-1.0.143-x86_64.rpm

Windows

Install https://download1.rstudio.org/RStudio-1.0.143.exe .

R Packages

I suggest the following additional R packages to enhance the default R environment.

Data.Table Package

Data.Table enables you to work with data files more effectively. I suggest that you practice using Data.Table processing, to enable you to process data quickly in the R environment and empower you to handle data sets that are up to 100GB in size.

The documentation is available at https://cran.r-project.org/web/packages/data.table/data.table.pdf . See https://CRAN.R-project.org/package=data.table for up-to-date information on the package.

To install the package, I suggest that you open your RStudio IDE and use the following command:

install.packages (data.table)

ReadR Package

The ReadR package enables the quick loading of text data into the R environment. The documentation is available at https://cran.r-project.org/web/packages/readr/readr.pdf . See https://CRAN.R-project.org/package=readr for up-to-date information on the package.

To install the package, I advise you to open your RStudio IDE and use the following command:

install.packages(readr)

I suggest that you practice by importing and exporting different formats of files, to understand the workings of this package and master the process. I also suggest that you investigate the following functions in depth in the ReadR package:

Spec_delim(): Supports getting the specifications of the file without reading it into memory

read_delim(): Supports reading of delimited files into the R environment

write_delim(): Exports data from an R environment to a file on disk

JSONLite Package

This package enables you to process JSON files easily, as it is an optimized JSON parser and generator specifically for statistical data.

The documentation is at https://cran.r-project.org/web/packages/jsonlite/jsonlite.pdf . See https://CRAN.R-project.org/package=jsonlite for up-to-date information on the package. To install the package, I suggest that you open your RStudio IDE and use the following command:

install.packages (jsonlite)

I also suggest that you investigate the following functions in the package:

fromJSON(): This enables you to import directly into the R environment from a JSON data source.

prettify(): This improves the human readability by formatting the JSON, so that a human can read it easier.

minify(): Removes all the JSON indentation/whitespace to make the JSON machine readable and optimized

toJSON(): Converts R data into JSON formatted data

read_json(): Reads JSON from a disk file

write_json(): Writes JSON to a disk file

Ggplot2 Package

Visualization of data is a significant skill for the data scientist. This package supports you with an environment in which to build a complex graphic format for your data. It is so successful at the task of creating detailed graphics that it is called The Grammar of Graphics. The documentation is located at https://cran.r-project.org/web/packages/ ggplot2/ ggplot2.pdf. See https://CRAN.R-project.org/package = ggplot2 for up-to-date information on the package.

To install the package, I suggest that you to open your RStudio IDE and use

Enjoying the preview?

Page 1 of 1

Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets

About this ebook

Andreas François Vermeulen

Related authors

Related to Practical Data Science

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Practical Data Science

What did you think?

Book preview

Practical Data Science - Andreas François Vermeulen

1. Data Science Technology Stack

Rapid Information Factory Ecosystem

Note

Data Science Storage Tools

Schema-on-Write and Schema-on-Read

Data Lake

Data Vault

Hubs

Links

Satellites

Data Warehouse Bus Matrix

Data Science Processing Tools

Spark

Spark Core

Spark SQL

Spark Streaming

MLlib Machine Learning Library

GraphX

Mesos

Akka

Cassandra

Kafka

Kafka Core

Kafka Streams

Kafka Connect

Elastic Search

R

Scala

Python

MQTT (MQ Telemetry Transport)

What’s Next?

Note

2. Vermeulen-Krennwallner-Hillman-Clark

Windows

Linux

Warning

It’s Now Time to Meet Your Customer

Vermeulen PLC

Krennwallner AG

Hillman Ltd

Clark Ltd

Processing Ecosystem

Note

Scala

Note

Apache Spark

Note

Apache Mesos

Note

Akka

Apache Cassandra

Note

Kafka

Note

Message Queue Telemetry Transport

Note

Example Ecosystem

Python

Is Python3 Ready?

Warning

Note

Note

R

Development Environment

R Packages