SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis

Ebook572 pages3 hours

SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis

Name: SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis
Author: Renee M. P. Teate
ISBN: 9781119669395

By Renee M. P. Teate

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Jump-start your career as a data scientist—learn to develop datasets for exploration, analysis, and machine learning

SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis is a resource that’s dedicated to the Structured Query Language (SQL) and dataset design skills that data scientists use most. Aspiring data scientists will learn how to how to construct datasets for exploration, analysis, and machine learning. You can also discover how to approach query design and develop SQL code to extract data insights while avoiding common pitfalls.

You may be one of many people who are entering the field of Data Science from a range of professions and educational backgrounds, such as business analytics, social science, physics, economics, and computer science. Like many of them, you may have conducted analyses using spreadsheets as data sources, but never retrieved and engineered datasets from a relational database using SQL, which is a programming language designed for managing databases and extracting data.

This guide for data scientists differs from other instructional guides on the subject. It doesn’t cover SQL broadly. Instead, you’ll learn the subset of SQL skills that data analysts and data scientists use frequently. You’ll also gain practical advice and direction on "how to think about constructing your dataset."

Gain an understanding of relational database structure, query design, and SQL syntax
Develop queries to construct datasets for use in applications like interactive reports and machine learning algorithms
Review strategies and approaches so you can design analytical datasets
Practice your techniques with the provided database and SQL code

In this book, author Renee Teate shares knowledge gained during a 15-year career working with data, in roles ranging from database developer to data analyst to data scientist. She guides you through SQL code and dataset design concepts from an industry practitioner’s perspective, moving your data scientist career forward!

Skip carousel

LanguageEnglish

PublisherWiley

Release dateAug 17, 2021

ISBN9781119669395

Author

Renee M. P. Teate

Related authors

Skip carousel

Related to SQL for Data Scientists

Related ebooks

Skip carousel

Professional ASP.NET MVC 5
Ebook
Professional ASP.NET MVC 5
byJon Galloway
Rating: 0 out of 5 stars
0 ratings
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
Ebook
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
byAbhishek Mishra
Rating: 0 out of 5 stars
0 ratings
Machine Learning: Hands-On for Developers and Technical Professionals
Ebook
Machine Learning: Hands-On for Developers and Technical Professionals
byJason Bell
Rating: 0 out of 5 stars
0 ratings
NoSQL For Dummies
Ebook
NoSQL For Dummies
byAdam Fowler
Rating: 0 out of 5 stars
0 ratings
Algorithms For Dummies
Ebook
Algorithms For Dummies
byJohn Paul Mueller
Rating: 4 out of 5 stars
4/5
Excel: A Comprehensive Guide to the Basics, Formulas, Functions, Charts, and Tables in Excel with Step-by-Step Instructions and Practical Examples
Ebook
Excel: A Comprehensive Guide to the Basics, Formulas, Functions, Charts, and Tables in Excel with Step-by-Step Instructions and Practical Examples
byAdam K. Grubb
Rating: 0 out of 5 stars
0 ratings
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
Ebook
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
byDave Fowler
Rating: 0 out of 5 stars
0 ratings
Python for Data Science For Dummies
Ebook
Python for Data Science For Dummies
byJohn Paul Mueller
Rating: 0 out of 5 stars
0 ratings
Visualizing Financial Data
Ebook
Visualizing Financial Data
byJulie Rodriguez
Rating: 0 out of 5 stars
0 ratings
P-AI-R Programming: How AI Tools Like GitHub Copilot and ChatGPT Can Radically Transform Your Development Workflow: P-AI-R Programming, #1
Ebook
P-AI-R Programming: How AI Tools Like GitHub Copilot and ChatGPT Can Radically Transform Your Development Workflow: P-AI-R Programming, #1
byMichael D Callaghan
Rating: 4 out of 5 stars
4/5
PyTorch Cookbook
Ebook
PyTorch Cookbook
byMatthew Rosch
Rating: 0 out of 5 stars
0 ratings
Learning Bing Maps API
Ebook
Learning Bing Maps API
byArtan Sinani
Rating: 0 out of 5 stars
0 ratings
Managing Machine Learning Projects: From design to deployment
Ebook
Managing Machine Learning Projects: From design to deployment
bySimon Thompson
Rating: 0 out of 5 stars
0 ratings
Mastering Visual Studio Code: Navigating the Future of Development
Ebook
Mastering Visual Studio Code: Navigating the Future of Development
byKameron Hussain
Rating: 0 out of 5 stars
0 ratings
Tech Trends in Practice: The 25 Technologies that are Driving the 4th Industrial Revolution
Ebook
Tech Trends in Practice: The 25 Technologies that are Driving the 4th Industrial Revolution
byBernard Marr
Rating: 0 out of 5 stars
0 ratings
Geometry for Programmers
Ebook
Geometry for Programmers
byOleksandr Kaleniuk
Rating: 0 out of 5 stars
0 ratings
Mastering Postman: A Comprehensive Guide to Building End-to-End APIs with Testing, Integration and Automation
Ebook
Mastering Postman: A Comprehensive Guide to Building End-to-End APIs with Testing, Integration and Automation
byOliver James
Rating: 0 out of 5 stars
0 ratings
Applying Data Modeling A Complete Guide
Ebook
Applying Data Modeling A Complete Guide
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Scaling AI For Businesses A Complete Guide - 2020 Edition
Ebook
Scaling AI For Businesses A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
App Design Basics for Professionals
Ebook
App Design Basics for Professionals
byJennifer Carrington
Rating: 0 out of 5 stars
0 ratings
D3: Modern Web Visualization: Exploratory Visualizations, Interactive Charts, 2D Web Graphics, and Data-Driven Visual Representations (English Edition)
Ebook
D3: Modern Web Visualization: Exploratory Visualizations, Interactive Charts, 2D Web Graphics, and Data-Driven Visual Representations (English Edition)
byVictor M Garcia Sanabria
Rating: 0 out of 5 stars
0 ratings
Exploring the Python Library Ecosystem: A Comprehensive Guide
Ebook
Exploring the Python Library Ecosystem: A Comprehensive Guide
byKameron Hussain
Rating: 0 out of 5 stars
0 ratings
The Jamstack Book: Beyond static sites with JavaScript, APIs, and markup
Ebook
The Jamstack Book: Beyond static sites with JavaScript, APIs, and markup
byRaymond Camden
Rating: 0 out of 5 stars
0 ratings
Byte-Sized Learning Series
Ebook series
Byte-Sized Learning Series
byI. Almeida
Julia as a Second Language
Ebook
Julia as a Second Language
byErik Engheim
Rating: 0 out of 5 stars
0 ratings
.NET 7 Design Patterns In-Depth: Enhance code efficiency and maintainability with .NET Design Patterns (English Edition)
Ebook
.NET 7 Design Patterns In-Depth: Enhance code efficiency and maintainability with .NET Design Patterns (English Edition)
byVahid Farahmandian
Rating: 0 out of 5 stars
0 ratings
Learn Java with Math: Using Fun Projects and Games
Ebook
Learn Java with Math: Using Fun Projects and Games
byRon Dai
Rating: 0 out of 5 stars
0 ratings
Mastering Swift
Ebook
Mastering Swift
byJon Hoffman
Rating: 0 out of 5 stars
0 ratings
iOS 17 App Development Essentials: Developing iOS 17 Apps with Xcode 15, Swift, and SwiftUI
Ebook
iOS 17 App Development Essentials: Developing iOS 17 Apps with Xcode 15, Swift, and SwiftUI
byNeil Smyth
Rating: 0 out of 5 stars
0 ratings
Learn AI-assisted Python Programming: With GitHub Copilot and ChatGPT
Ebook
Learn AI-assisted Python Programming: With GitHub Copilot and ChatGPT
byLeo Porter
Rating: 0 out of 5 stars
0 ratings

Programming For You

Skip carousel

PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
Ebook
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
byMark Chan
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
Ebook
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
byJason Scotts
Rating: 4 out of 5 stars
4/5
Coding All-in-One For Dummies
Ebook
Coding All-in-One For Dummies
byNikhil Abraham
Rating: 4 out of 5 stars
4/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
C++ Learn in 24 Hours
Ebook
C++ Learn in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
Ebook
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
byGwendolyn Faraday
Rating: 5 out of 5 stars
5/5
HTML & CSS: Learn the Fundaments in 7 Days
Ebook
HTML & CSS: Learn the Fundaments in 7 Days
byMichael Knapp
Rating: 4 out of 5 stars
4/5
Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS
Ebook
Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS
byTravis Plunk
Rating: 0 out of 5 stars
0 ratings
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
Ebook
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
byJames Tudor
Rating: 5 out of 5 stars
5/5
C# 7.0 All-in-One For Dummies
Ebook
C# 7.0 All-in-One For Dummies
byBill Sempf
Rating: 0 out of 5 stars
0 ratings
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
Ebook
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
byDexter Jackson
Rating: 4 out of 5 stars
4/5
Beginning Programming with Python For Dummies
Ebook
Beginning Programming with Python For Dummies
byJohn Paul Mueller
Rating: 3 out of 5 stars
3/5
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
Ebook
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
byBrady Ellison
Rating: 5 out of 5 stars
5/5
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
Ebook
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
byHeath Haskins
Rating: 5 out of 5 stars
5/5
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
Ebook
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
byTimothy C. Needham
Rating: 4 out of 5 stars
4/5
Linux: Learn in 24 Hours
Ebook
Linux: Learn in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Game Development with Unreal Engine 5: Learn the Basics of Game Development in Unreal Engine 5 (English Edition)
Ebook
Game Development with Unreal Engine 5: Learn the Basics of Game Development in Unreal Engine 5 (English Edition)
byMitchell Lynn
Rating: 0 out of 5 stars
0 ratings
Python: Learn Python in 24 Hours
Ebook
Python: Learn Python in 24 Hours
byAlex Nordeen
Rating: 4 out of 5 stars
4/5
Data Structures and Algorithm Analysis in Java, Third Edition
Ebook
Data Structures and Algorithm Analysis in Java, Third Edition
byClifford A. Shaffer
Rating: 4 out of 5 stars
4/5
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
Ebook
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
byJoseph Labrecque
Rating: 5 out of 5 stars
5/5
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
Ebook
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
byAnthony Adams
Rating: 4 out of 5 stars
4/5
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
Ebook
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
byi Code Academy
Rating: 5 out of 5 stars
5/5
LINUX Beginner's Crash Course: Linux for Beginner's Guide to Linux Command Line, Linux System & Linux Commands
Ebook
LINUX Beginner's Crash Course: Linux for Beginner's Guide to Linux Command Line, Linux System & Linux Commands
byQuick Start Guides
Rating: 4 out of 5 stars
4/5
SQL All-in-One For Dummies
Ebook
SQL All-in-One For Dummies
byAllen G. Taylor
Rating: 3 out of 5 stars
3/5

Related podcast episodes

Skip carousel

Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
Investing In Understanding The Customer Journey At American Express: An interview with Purvi Shah about the Customer 360 project at American Express and their journey into the cloud for enterprise data management
Podcast episode
Investing In Understanding The Customer Journey At American Express: An interview with Purvi Shah about the Customer 360 project at American Express and their journey into the cloud for enterprise data management
byData Engineering Podcast
0 ratings
0% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
Generative models: exploration to deployment: get Fully-Connected with Chris & Daniel
Podcast episode
Generative models: exploration to deployment: get Fully-Connected with Chris & Daniel
byPractical AI: Machine Learning, Data Science
100%
100% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML
Podcast episode
LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
Models for Human-Robot Collaboration with Julie Shah - #538
Podcast episode
Models for Human-Robot Collaboration with Julie Shah - #538
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
82: How to Get Started with Advanced Analytics R-Python w/ Ryan Wade: Ryan Wade joins us on AOF today to talk about how to use advanced analytics in your organization! Ryan has been in the analytics game for the last 20 years and is now a Senior Solution Consultant at Blue Granite, based in Indianapolis, Indiana. He...
Podcast episode
82: How to Get Started with Advanced Analytics R-Python w/ Ryan Wade: Ryan Wade joins us on AOF today to talk about how to use advanced analytics in your organization! Ryan has been in the analytics game for the last 20 years and is now a Senior Solution Consultant at Blue Granite, based in Indianapolis, Indiana. He...
byAnalytics on Fire
0 ratings
0% found this document useful
Exploring K-means Clustering and Building a Gradebook With Pandas
Podcast episode
Exploring K-means Clustering and Building a Gradebook With Pandas
byThe Real Python Podcast
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
039. Day in the Life of a Salesforce Developer: If you can relate to Brad's 12-year developing free streak, you might be interested but hesitant about programming-heavy career options. Sure, the other Salesforce roles involve some coding, but Developers are in a completely different...
Podcast episode
039. Day in the Life of a Salesforce Developer: If you can relate to Brad's 12-year developing free streak, you might be interested but hesitant about programming-heavy career options. Sure, the other Salesforce roles involve some coding, but Developers are in a completely different...
bySalesforce for Everyone by Talent Stacker
0 ratings
0% found this document useful
462: Refactoring a Design System: This week, Marshall breaks down his process, tips, and tricks to redesign a design system in Figma.
Podcast episode
462: Refactoring a Design System: This week, Marshall breaks down his process, tips, and tricks to redesign a design system in Figma.
byDesign Details
0 ratings
0% found this document useful
[DataFramed Careers Series #1] Launching a Data Career in 2022
Podcast episode
[DataFramed Careers Series #1] Launching a Data Career in 2022
byDataFramed
0 ratings
0% found this document useful
How LLMs and Generative AI are Revolutionizing AI for Science with Anima Anandkumar - #614
Podcast episode
How LLMs and Generative AI are Revolutionizing AI for Science with Anima Anandkumar - #614
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Automate all the UIs!: with Dominik Klotz, Co-Founder & CTO of AskUI
Podcast episode
Automate all the UIs!: with Dominik Klotz, Co-Founder & CTO of AskUI
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
2155: Databricks - The Story Behind the Lakehouse Company: Many are citing open source as the future. The UK Government's National Data Strategy even talks about the importance of opening public sector datasets to form the backbone of innovation, efficiency, and growth. This is a trend that Databricks...
Podcast episode
2155: Databricks - The Story Behind the Lakehouse Company: Many are citing open source as the future. The UK Government's National Data Strategy even talks about the importance of opening public sector datasets to form the backbone of innovation, efficiency, and growth. This is a trend that Databricks...
byThe Tech Talks Daily Podcast
0 ratings
0% found this document useful
? Feature stores and CI/CD for machine learning with Qwak.ai VP Engineering, Ran Romano
Podcast episode
? Feature stores and CI/CD for machine learning with Qwak.ai VP Engineering, Ran Romano
byThe MLOps Podcast
0 ratings
0% found this document useful
Episode 339: Marin Todorov: Marin Todorov joins Tim to discuss his work on Swift Concurrency and Apple's DocC. He has just finished contributing to the RayWenderlich book, Combine: Asynchronous Programming with Swift and has an upcoming book on Modern Concurrency in Swift. He is also one of the original contributors on Apple's open source DocC.
Podcast episode
Episode 339: Marin Todorov: Marin Todorov joins Tim to discuss his work on Swift Concurrency and Apple's DocC. He has just finished contributing to the RayWenderlich book, Combine: Asynchronous Programming with Swift and has an upcoming book on Modern Concurrency in Swift. He is also one of the original contributors on Apple's open source DocC.
byMore Than Just Code podcast - iOS and Swift development, news and advice
0 ratings
0% found this document useful
Dataprep with Eric Anderson: Eric Anderson joins the podcast to talk about how Dataprep is simplifying data wrangling!
Podcast episode
Dataprep with Eric Anderson: Eric Anderson joins the podcast to talk about how Dataprep is simplifying data wrangling!
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Powering your Copilot for Data – with Artem Keydunov of Cube.dev
Podcast episode
Powering your Copilot for Data – with Artem Keydunov of Cube.dev
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
How Data Platforms Affect ML & AI // Jake Watson // #207
Podcast episode
How Data Platforms Affect ML & AI // Jake Watson // #207
byMLOps.community
0 ratings
0% found this document useful
416: Multi-Dimensional Numbers: Joël discusses the challenges he encountered while optimizing slow SQL queries in a non-Rails application. Stephanie shares her experience with canary deploys in a Rails upgrade. Together, Stephanie and Joël address a listener's question about replacing the wkhtml2pdf tool, which is no longer maintained. The episode's main topic revolves around the concept of multidimensional numbers and their applications in software development. Joël introduces the idea of treating objects containing multiple numbers as single entities, using the example of 2D points in space to illustrate how custom classes can define mathematical operations like addition and subtraction for complex data types. They explore how this approach can simplify operations on data structures, such as inventories of T-shirt sizes, by treating them as mathematical objects.
Podcast episode
416: Multi-Dimensional Numbers: Joël discusses the challenges he encountered while optimizing slow SQL queries in a non-Rails application. Stephanie shares her experience with canary deploys in a Rails upgrade. Together, Stephanie and Joël address a listener's question about replacing the wkhtml2pdf tool, which is no longer maintained. The episode's main topic revolves around the concept of multidimensional numbers and their applications in software development. Joël introduces the idea of treating objects containing multiple numbers as single entities, using the example of 2D points in space to illustrate how custom classes can define mathematical operations like addition and subtraction for complex data types. They explore how this approach can simplify operations on data structures, such as inventories of T-shirt sizes, by treating them as mathematical objects.
byThe Bike Shed
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
Putting machine learning into a database: Most data scientists bounce back and forth regula…
Podcast episode
Putting machine learning into a database: Most data scientists bounce back and forth regula…
byLinear Digressions
0 ratings
0% found this document useful
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
Scalable Databases on Kubernetes
Podcast episode
Scalable Databases on Kubernetes
byThe Cloudcast
0 ratings
0% found this document useful
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
Podcast episode
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
byData Engineering Podcast
0 ratings
0% found this document useful
#06 - Tech stack of Open Podcast: Which database is best?
Podcast episode
#06 - Tech stack of Open Podcast: Which database is best?
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
Podcast episode
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
byData Engineering Podcast
0 ratings
0% found this document useful
Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable: Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.
Podcast episode
Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable: Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.
byData Engineering Podcast
0 ratings
0% found this document useful

Skip carousel

12 Ways AI Could Improve Windows 11 (or Windows 12)
PCWorld
Article
12 Ways AI Could Improve Windows 11 (or Windows 12)
Mar 7, 2023
6 min read
Ev Experts Compares The Tesla Model 3 And Volkswagen Id.4
AppleMagazine
Article
Ev Experts Compares The Tesla Model 3 And Volkswagen Id.4
Nov 12, 2021
3 min read
A Congressman Wanted To Understand AI. So He Went Back To A College Classroom To Learn
TechLife News
Article
A Congressman Wanted To Understand AI. So He Went Back To A College Classroom To Learn
Apr 13, 2024
4 min read
My Kid Deserved What We Couldn’t Afford
Time Magazine International Edition
Article
My Kid Deserved What We Couldn’t Afford
Nov 25, 2023
3 min read
Getting Started With The Powerful EBPF
Linux Format
Article
Getting Started With The Powerful EBPF
Sep 20, 2022
Credit: https://ebpf.io Don’t miss next issue! Subscribe on page 16 Mihalis Tsoukalos is a systems engineer and a technical writer. You can reach him at www. mtsoukalos.eu and @mactsouk. Get the code for this tutorial from the Linux Format archive:
10 min read
GeForce RTX 4060
Linux Format
Article
GeForce RTX 4060
Aug 22, 2023
2 min read
Editor’s Note
Techfastly
Article
Editor’s Note
Jul 1, 2022
Dear Readers, Centralization has contributed to onboarding billions of users to the World Wide Web and creating the solid, reliable infrastructure that supports it. Simultaneously, a few centralized companies control significant swaths of the Interne
1 min read
The Nerds Who Conquered Silicon Valley
MoneyWeek
Article
The Nerds Who Conquered Silicon Valley
Apr 2, 2021
2 min read
The Big Idea Behind Big Data
NPR
Article
The Big Idea Behind Big Data
Nov 17, 2017
As we find our way in a world shaped by Big Data, it's not the reams of information we gather but the networks they illuminate that's the newest addition to science's index of things, says Adam Frank.
6 min read
Musk Says He Has $46.5b In Financing Ready To Buy Twitter
TechLife News
Article
Musk Says He Has $46.5b In Financing Ready To Buy Twitter
Apr 23, 2022
1 min read
Top Five AI-ML Books For Business Leaders
Techfastly
Article
Top Five AI-ML Books For Business Leaders
Aug 2, 2021
5 min read
What Is The Future Of Game Streaming Now That Stadia Is Dead?
APC
Article
What Is The Future Of Game Streaming Now That Stadia Is Dead?
Oct 31, 2022
Once hyped as being ‘the future of gaming’, the Google Stadia game streaming service was officially, just three years after launch and before even making it to Australian shores. When game streaming first launched we did have some apprehension about
2 min read
EBPF To Enhance Kubernetes Monitoring
Techfastly
Article
EBPF To Enhance Kubernetes Monitoring
Apr 1, 2022
The introduction of Docker and Kubernetes has brought a dramatic revolution in the IT industry. Unlike the traditional methods of developing and deploying software, Kubernetes or K8s uses scaling and automated deployment. Thanks to the Linux function
4 min read
Pay-per-chew: More Restaurants Trying Subscription Programs
TechLife News
Article
Pay-per-chew: More Restaurants Trying Subscription Programs
Feb 25, 2023
3 min read
The Case for Locking Up Your Smartphone
The Atlantic
Article
The Case for Locking Up Your Smartphone
Feb 2, 2018
5 min read
Metaverse In Spotlight At Mwc Tech Fair Even As Doubts Arise
TechLife News
Article
Metaverse In Spotlight At Mwc Tech Fair Even As Doubts Arise
Mar 4, 2023
3 min read
Say Goodbye To X+Y: Should Community Colleges Abolish Algebra?
NPR
Article
Say Goodbye To X+Y: Should Community Colleges Abolish Algebra?
Jul 19, 2017
4 min read
Charts And Diagrams
Linux Format
Article
Charts And Diagrams
Nov 15, 2022
1 min read
Microsoft’s New Surface Pro 9 Hides An Arm Option And Cool Features
PCWorld
Article
Microsoft’s New Surface Pro 9 Hides An Arm Option And Cool Features
Nov 1, 2022
5 min read
New Bing AI Warfare: The Battlefield Of The Decade Is Now Placed
AppleMagazine
Article
New Bing AI Warfare: The Battlefield Of The Decade Is Now Placed
Feb 17, 2023
4 min read
Is eBPF Foundation Molding the Future of Infrastructure Software Space?
Techfastly
Article
Is eBPF Foundation Molding the Future of Infrastructure Software Space?
Apr 1, 2022
2 min read
App Of The Month
MacFormat
Article
App Of The Month
Nov 15, 2022
2 min read
Top 10 Excel Functions That Everyone Should Know
Techfastly
Article
Top 10 Excel Functions That Everyone Should Know
Feb 4, 2021
5 min read
MARIADB Optimise And Control Your Databases
Linux Format
Article
MARIADB Optimise And Control Your Databases
Jul 30, 2019
9 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
MapReduce: The ‘Big Data’ Idea Inside Your Android Phone
APC
Article
MapReduce: The ‘Big Data’ Idea Inside Your Android Phone
Dec 2, 2019
4 min read
“There’s No Single ‘Best’ Language To Learn. I Think The Real Key Is To Learn How To Write Code”
PC Pro Magazine
Article
“There’s No Single ‘Best’ Language To Learn. I Think The Real Key Is To Learn How To Write Code”
Oct 8, 2022
9 min read
Code An Admin Back-end In Django
Linux Format
Article
Code An Admin Back-end In Django
Dec 13, 2022
Credit: www.djangoproject.com OUR EXPERT Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://
6 min read
Soulver 3: Mac App Simplifies Readable Calculations And Conversions
MacWorld
Article
Soulver 3: Mac App Simplifies Readable Calculations And Conversions
Nov 19, 2019
3 min read
Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read

Related categories

Skip carousel

Reviews for SQL for Data Scientists

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

SQL for Data Scientists - Renee M. P. Teate

Introduction

Who I Am and Why I'm Writing About This Topic

When I was first brainstorming topics for this book, I used two questions to narrow down my list: Who is my audience? and What topic do I know well enough to write a book that would be worth publishing for that audience?

The first question had an easy initial answer: I already have an audience of data-science-learning Twitter followers with whom I share resources and advice on Becoming a Data Scientist that I could keep in mind while narrowing down the topics.

So then I was left to figure out what I know that I could teach to people who want to become data scientists.

I have been designing and querying relational databases professionally for about 17 years: first as a database and web developer, then as a data analyst, and for the last 5 years, as a data scientist. SQL (Structured Query Language) has been a key tool for me throughout—whether I was working with MS Access, MS SQL Server, MySQL, Oracle, or Redshift databases, and whether I was summarizing data into reporting views in a data mart, extracting data to use in a data visualization tool like Tableau, or preparing a dataset for a machine learning project.

Since SQL is a tool I have used throughout my career, and because creating and retrieving datasets for analysis has been such an integral part of my job as a data scientist, I was surprised to learn that some data scientists don't know SQL or don't regularly write SQL code. But in an informal Twitter poll I conducted, which received responses from 979 data scientists, 19% of them reported wanting to learn, or learn more, SQL (74% reported already using SQL professionally). Additionally, 55% of 713 respondents who were working toward becoming data scientists said they wanted to learn, or learn more, SQL. So, my target audience had an interest in this topic.

According to an analysis of online job postings conducted by Jeff Hale of Towards Data Science, SQL is in the top three technology skills that data scientist jobs require. (See towardsdatascience.com/the-most-in-demand-skills-for-data-scientists-4a4a8db896db.) In an Indeed BeSeen article, Joy Garza lists SQL as one of the top-five in-demand tech skills for data scientists. (See https://web.archive.org/web/20200624031802/https://www.beseen.com/blog/talent/data-scientist-skills/.)

After learning how many working and prospective data scientists wanted to learn SQL, and how much of a need there is in the industry for people who know how to use it, SQL dataset development started to move to the top of the list of topics I could share my knowledge of with others.

There are many SQL books on the market that can be used to learn query syntax and advanced SQL functions—after all, the language has been around for 45 years and has been standardized since the late 1980s—but I hadn't found any definitive resources to refer people to when they asked me if I knew of any books that taught how to use SQL to construct datasets for machine learning, so I decided to write this book to cover SQL from a data scientist's point of view.

So, my goal in writing this book is not only to teach you how to write SQL code but to teach you how to think about summarizing data into analytical datasets that can be used for reports and machine learning: to use SQL like a data scientist does. Like I do.

Who This Book Is For

SQL for Data Scientists is designed to be a learning resource for anyone who wants to become (or who already is) a data analyst or data scientist, and wants to be able to pull data from databases to build their own datasets without having to rely on others in the organization to query the source system and transform it into flat files (or spreadsheets) for them.

There are plenty of SQL books out there, but many are either written as syntax references or written for people in other roles that create, query from, and maintain databases. However, this book is written from the perspective of a data scientist and is aimed at those who will primarily be extracting data from existing databases in order to generate datasets for analysis.

I won't assume that you've ever written SQL queries before, and we'll start with the basics, but I do assume that you have some basic understanding of what databases are and a general idea of how data might be used in reports, analyses, and machine learning algorithms. This book is meant to fill in the steps between finding a database that contains the data you need and starting the analysis. I aim to teach you how to think about structuring datasets for analysis and how to use SQL to extract the data from the database and get it into that form.

Why You Should Learn SQL if You Want to Be a Data Scientist

If you can use SQL to pull your own datasets, you don't have to rely on others in your organization to pull it for you, enabling you to work more efficiently. Requesting datasets usually involves a process of filling out a form or ticket describing in detail what data you need, waiting for your request to be fulfilled, then often clarifying your request after seeing the initial results, and then waiting again for modifications. If you can edit your own queries, you can not only design and retrieve your own datasets but then also adjust calculations or add fields as needed.

Additionally, running a SQL query that writes to a database table or exports to a file—effectively snapshotting the data in the form you need it in for your analysis—means you don't have to retrieve and reprocess the data in your machine learning script every time you run your code, speeding up the usually iterative model development process.

Some summaries and calculations can be done more efficiently in SQL than in other types of code, as well, so even if you are running the queries live each time you run your script, you may be able to lower the computational cost of your code by doing some of the transformations in SQL.

Finally, because it is a high-demand tech skill in data scientist job postings, learning SQL will increase your marketability and value to employers.

What I Hope You Gain from This Book

My goal is that by the time you finish reading this book and practicing the queries within (ideally both on the provided example database and on another database of your choosing, so you have to modify the example queries and apply them in another context), you will be able to think through the process of creating an analytical dataset and develop the SQL code necessary to generate your intended output.

I hope that even if you end up needing to use a SQL function that's not covered in this book, you will have gained enough baseline knowledge from the book to go look it up online and determine how to best use it in the query you are developing.

I also hope that this book will help you feel confident that you can pull your own data at work and get it into the form you need it in for your report or model without having to wait on others to do it for you.

Conventions

This book uses MySQL version 8.0–style SQL. No matter what type of database system you use (MS SQL Server, Redshift, PostgreSQL, Oracle, etc.), the query design concepts and syntax are very similar, when not identical across platforms. So, if you work with a database system other than MySQL, you might have to search for the equivalent code syntax for a few functions in the book, but the overall dataset design concepts are platform-independent, and the SQL keywords are cross-platform standards.

When you see code displayed in the following style:

SELECT * FROM Product

that means it is a complete SQL query that you can use to select data from the Farmer's Market database described in Chapter 1, Data Sources. If you're reading the printed version of this book, you can go to the book's website to get digital versions of the queries that you can copy and paste to try them out yourself.

Reserved SQL keywords like SELECT will appear in all-uppercase throughout the book, and column names will appear in all-lowercase. This isn't a requirement of SQL syntax (neither are line breaks), but is a convention used for readability.

Be aware that the Farmer's Market database will continue to evolve, and I will likely continue adding rows to its tables after this book goes to print, so the data values you see in the output when you run the queries yourself may not exactly match the screenshots included in the printed book.

Reader Support for This Book

Companion Download Files

As you work through the examples in this book, you may choose either to type in all the code manually or to use the source code files that accompany the book. All the source code used in this book, along with the Farmer's Market database, is available for download from both sqlfordatascientists.com and www.wiley.com/go/sqlfordatascientists.

How to Contact the Publisher

If you believe you've found a mistake in this book, please bring it to our attention. At John Wiley & Sons, we understand how important it is to provide our customers with accurate content, but even with our best efforts an error may occur.

In order to submit your possible errata, please email it to our Customer Service Team at wileysupport@wiley.com with the subject line Possible Book Errata Submission.

How to Contact the Author

I'm known as Data Science Renee on Twitter, and my username is @becomingdatasci. I'm happy to interact with readers via social media, so feel free to tweet me your questions and suggestions.

Thank you for giving me the chance to help guide you through the topic of SQL for Data Scientists. Let's dive in!

CHAPTER 1

Data Sources

As a data analyst or data scientist, you will encounter data from many sources—from databases to spreadsheets to Application Programming Interfaces (APIs)—which you are expected to use for predictive modeling. Understanding the source system your data comes from, how it was initially gathered and stored, and how frequently it is updated, will take you a long way toward an effective analysis. In my experience, issues with a predictive model can often be traced back all the way to the source data or the query that first pulls the data from the source. Exploring the data available for your analysis starts with exploring the structure of the source database.

Data Sources

Data can be stored in many forms and structures. Examples of unstructured data include text documents or images stored as individual files in a computer's file system. In this book, we'll be focusing on structured data, which is typically organized into a tabular format, like a spreadsheet or database table containing limited-length text or numeric values.

Many software applications enable the organization of data into structured forms. One example you are likely familiar with is Microsoft Excel, for creating and maintaining spreadsheets. Excel also includes some analysis capabilities, such as pivot tables for summarizing spreadsheets and data visualization tools for plotting data points from a spreadsheet. Some functions in Excel allow you to connect data in one spreadsheet to another, but in order to create a true relational database model and define rules for how the data tables are interconnected, Microsoft offers a relational database application called Access.

My first experiences with relational database design were in MS Access, and the basic Structured Query Language (SQL) concepts I learned in order to query data from an Access database are the same concepts I have used throughout my career—in increasingly complex ways. I have since extracted data from other Relational Database Management Systems (RDBMSs) such as MS SQL Server, Oracle Database, MySQL, and Amazon Redshift. Though the syntax for each can differ slightly, the general concepts, many of which you will learn in this book, are consistent across products.

SQL-style RDBMSs were first developed in the 1970s, and the basic database design concepts have stood the test of time; many of the database systems that originated then are still in use today. The longevity of these tools is another reason that SQL is so ubiquitous and so valuable to learn.

As a professional who works with data, you will likely encounter several of the following popular Relational Database Management Systems:

Oracle

MySQL

MS SQL Server

PostgreSQL

Amazon Redshift

IBM DB2

MS Access

SQLite

Snowflake

You will also likely work with data retrieved from other types of files at some point, such as CSV text files, JSON retrieved via API, XML in a NoSQL database, Graph databases with special query languages, key-value stores, and so on. However, relational SQL databases still dominate the industry for structured data storage and are the most likely database systems you will encounter on the job.

Tools for Connecting to Data Sources and Editing SQL

When you start an analysis project, the first step is often connecting to a database on a server. This is generally done through a SQL Integrated Development Environment (IDE) or with code that connects to the database without a graphical user interface (GUI) to run queries that extract the data and store it in a structure that you can work with downstream in your analysis, such as a dataframe.

The IDE referenced for demonstration purposes throughout this book is MySQL Workbench Community Edition, which was chosen because we'll be querying a MySQL database in the examples. MySQL is open source under the GPL license, and MySQL Workbench CE is free to download.

Many other IDEs will allow you to connect to databases and will perform syntax-highlighting of SQL (highlighting keywords to make it easier to read and to spot errors). All major database systems support Open Database Connectivity (ODBC), which uses drivers to standardize the interfaces between software applications and databases. Whoever has granted you permission to access a database should give you documentation on how to securely connect to it via your selected IDE.

You can also connect to a database directly from code such as Python or R. Search for your preferred language and the type of database (for example, R SQL Server or Python Redshift) and you will find packages or add-ons that enable you to embed SQL queries in your code and return results in the form of a dataframe or other data structure. The database system's official documentation will also provide information about connecting to it from other software and from within your code. Searching MySQL connector brings up a list of drivers for use with different languages, for example.

If you are writing code in a language like Python and will be passing a SQL statement to a function as a string, where it won't be syntax highlighted, you can write SQL in a free text tool that performs SQL syntax highlighting, such as Notepad++, or in a SQL IDE, and then paste the final result into your code.

Relational Databases

If you have never explored a database, you can think of a database table like a well-defined spreadsheet, with row identifiers and named column headers. Each table may store different subsets and types of data at different levels of detail.

An entity is the thing (object or concept) that the table represents and captures data for. If there is a table that contains data about books, the entity is Books, and the Book table is the data structure that contains information about the Book entity. Some people use the terms entity and table interchangeably.

You may see me using the terms row and record interchangeably in this book: a record in a database is like a row in a table and displayed the same way. Some people call a database row a tuple.

You may also see me using the terms column, field, and attribute as synonyms. A column header in a spreadsheet is the equivalent of an attribute name in a table. Each column in a database table stores data about an attribute of the entity.

For example, as illustrated in Figure 1.1, in a table of Books there would be a row for each book, with an ISBN number column to identify each book. The ISBN is an attribute of the book entity. The Author column in the row in the Books table representing this book would have my name in it, so you could say that "the value in the Author field in the SQL for Data Scientistsrecord in the Books table is ‘Renée M. P. Teate’. Or, In the Books table, the row representing the book SQL for Data Scientists contains the value ‘Renée M. P. Teate’ in the Author column."

Table presented with three rows and three columns. It records I S B N, title, and author.

Figure 1.1

A database is a collection of related tables, and a database schema stores information about the tables (and other database objects), as well as the relationships between them, defining the structure of the database.

To illustrate an example of a relationship between database tables, imagine that one table in a database contains a record (row) for every patient that's ever scheduled an appointment at a doctor's office, with each patient's name, birthdate, and phone number, like a directory. Another table contains a record of every appointment, with the patient's name, appointment time, reason for the visit, and the name of the doctor the patient has an appointment with. The connection between these two tables could be the patient's name. (In reality, a unique identifier would be assigned to each patient, since two people can have the same name, but for this illustration, the name will suffice.) In order to create a report of every patient who has an appointment scheduled in the next week along with their contact information, there would have to be an established connection between the patient directory table and the appointment table, enabling someone to pull data from both tables simultaneously. See Figure 1.2.

A set of two tables titled, patients and appointments.

Figure 1.2

The relationship between the entities just described is called a one-to-many relationship. Each patient only appears in the patient directory table one time but can have many appointments in the related appointment-tracking table. Each appointment only has one patient's name associated with it.

Database relationships like this one are depicted in what's called an entity-relationship diagram (ERD). The ERD for these two tables is shown in Figure 1.3.

An illustration showing the relationship between patients and appointments.

Figure 1.3

NOTE

In an ERD, an infinity symbol, N, or crow's feet on the end of a line connecting two tables indicates that it is the many side of a one-to-many relationship. You can see the infinity symbol next to the Appointments table in Figure 1.3.

The primary key in a table is a column or combination of columns that uniquely identifies a row. The combination of values in the primary key columns must be unique per record, and cannot all be NULL (empty). The primary key can be made of values that occur in the data that are unique per record—such as a Student ID Card number in a table of students at a university—or it can be generated by the database and not carry meaning elsewhere in real life, like an integer value that increments automatically every time a new record is created. The primary key in a table can be used to identify the records in other tables that relate to each of its records. When a table's primary key is referenced in another table, it is called a foreign key.

NOTE

Notice that the NULL value is described in this section as empty and not as blank. In database terms, NULL and blank aren't necessarily the same thing. For example, a single space can be considered a blank value in a string field, but is not NULL, because there is a space character stored there. A NULL is the absence of any value, a totally empty field. NULLs are treated differently than blanks in SQL.

As mentioned, using the Patient Name in the previous example is a poor selection of primary key, because two patients can have the same name, so your primary key won't necessarily end up uniquely identifying patients. One option that is common practice in the industry is to create a field that generates an auto-incrementing integer to serve as a unique identifier for each new row, so as not to rely on other values unique to a record that may be a privacy concern or unavailable at the time the record is created, such as Social Security numbers.

So, let's say that instead, the doctor's office database assigned an auto-incrementing integer value to serve as the primary key for each patient record in the Patients table and for each appointment record in the Appointments table. Then, the appointment-tracking table can use that generated Patient ID value to link each appointment to each patient, and the patient's name doesn't even need to be stored in the Appointments table. In Figure 1.4, you can see a database design where the Patient ID is serving as a primary key in the Patients table, and as a foreign key in the Appointments table.

An illustration showing the relationship between patients and appointments and the corresponding tables.

Figure 1.4

Another type of relationship found in RDBMSs is called many-to-many. As you might guess, it's a connection between entities where the records on each side of the relationship can connect to multiple records on the other side. Using our Books example, if we had a table of Authors, there would be a many-to-many relationship between books and authors, because each author can write multiple books, and each book can have multiple authors. In order to create this

Enjoying the preview?

Page 1 of 1

SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis

About this ebook

Renee M. P. Teate

Related authors

Related to SQL for Data Scientists

Related ebooks

Programming For You

Related podcast episodes

Related articles

Related categories

Reviews for SQL for Data Scientists

What did you think?

Book preview

SQL for Data Scientists - Renee M. P. Teate

Who I Am and Why I'm Writing About This Topic

Who This Book Is For

Why You Should Learn SQL if You Want to Be a Data Scientist

What I Hope You Gain from This Book

Conventions

Reader Support for This Book

Companion Download Files

How to Contact the Publisher

How to Contact the Author

Data Sources

Tools for Connecting to Data Sources and Editing SQL

Relational Databases

NOTE

NOTE