SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform

Ebook356 pages2 hours

SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform

Name: SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform
Author: Benjamin Weissman
ISBN: 9781484259856

By Benjamin Weissman and Enrico van de Laar

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Use this guide to one of SQL Server 2019’s most impactful features—Big Data Clusters. You will learn about data virtualization and data lakes for this complete artificial intelligence (AI) and machine learning (ML) platform within the SQL Server database engine. You will know how to use Big Data Clusters to combine large volumes of streaming data for analysis along with data stored in a traditional database. For example, you can stream large volumes of data from Apache Spark in real time while executing Transact-SQL queries to bring in relevant additional data from your corporate, SQL Server database.
Filled with clear examples and use cases, this book provides everything necessary to get started working with Big Data Clusters in SQL Server 2019. You will learn about the architectural foundations that are made up from Kubernetes, Spark, HDFS, and SQL Server on Linux. You then are shown how to configure and deploy Big Data Clusters in on-premises environments or in the cloud. Next, you are taught about querying. You will learn to write queries in Transact-SQL—taking advantage of skills you have honed for years—and with those queries you will be able to examine and analyze data from a wide variety of sources such as Apache Spark.
Through the theoretical foundation provided in this book and easy-to-follow example scripts and notebooks, you will be ready to use and unveil the full potential of SQL Server 2019: combining different types of data spread across widely disparate sources into a single view that is useful for business intelligence and machine learning analysis.

What You Will Learn

Install, manage, and troubleshoot Big Data Clusters in cloud or on-premise environments
Analyze large volumes of data directly from SQL Server and/or Apache Spark
Manage data stored in HDFS from SQL Server as if it wererelational data
Implement advanced analytics solutions through machine learning and AI
Expose different data sources as a single logical source using data virtualization

Who This Book Is For

Data engineers, data scientists, data architects, and database administrators who want to employ data virtualization and big data analytics in their environments

Skip carousel

LanguageEnglish

PublisherApress

Release dateMay 23, 2020

ISBN9781484259856

Author

Benjamin Weissman

Related authors

Skip carousel

Related to SQL Server Big Data Clusters

Related ebooks

Skip carousel

SQL Server Data Automation Through Frameworks: Building Metadata-Driven Frameworks with T-SQL, SSIS, and Azure Data Factory
Ebook
SQL Server Data Automation Through Frameworks: Building Metadata-Driven Frameworks with T-SQL, SSIS, and Azure Data Factory
byAndy Leonard
Rating: 0 out of 5 stars
0 ratings
The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform
Ebook
The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform
byMatt How
Rating: 0 out of 5 stars
0 ratings
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Ebook
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
byDebananda Ghosh
Rating: 0 out of 5 stars
0 ratings
MySQL Connector/Python Revealed: SQL and NoSQL Data Storage Using MySQL for Python Programmers
Ebook
MySQL Connector/Python Revealed: SQL and NoSQL Data Storage Using MySQL for Python Programmers
byJesper Wisborg Krogh
Rating: 0 out of 5 stars
0 ratings
Azure Arc-Enabled Data Services Revealed: Early First Edition Based on Public Preview
Ebook
Azure Arc-Enabled Data Services Revealed: Early First Edition Based on Public Preview
byBen Weissman
Rating: 0 out of 5 stars
0 ratings
Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud
Ebook
Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud
byRobert Ilijason
Rating: 0 out of 5 stars
0 ratings
PolyBase Revealed: Data Virtualization with SQL Server, Hadoop, Apache Spark, and Beyond
Ebook
PolyBase Revealed: Data Virtualization with SQL Server, Hadoop, Apache Spark, and Beyond
byKevin Feasel
Rating: 0 out of 5 stars
0 ratings
Sql : The Ultimate Beginner to Advanced Guide To Master SQL Quickly with Step-by-Step Practical Examples
Ebook
Sql : The Ultimate Beginner to Advanced Guide To Master SQL Quickly with Step-by-Step Practical Examples
byMark Robinson
Rating: 0 out of 5 stars
0 ratings
Introducing Azure Kubernetes Service: A Practical Guide to Container Orchestration
Ebook
Introducing Azure Kubernetes Service: A Practical Guide to Container Orchestration
bySteve Buchanan
Rating: 0 out of 5 stars
0 ratings
Cosmos DB for MongoDB Developers: Migrating to Azure Cosmos DB and Using the MongoDB API
Ebook
Cosmos DB for MongoDB Developers: Migrating to Azure Cosmos DB and Using the MongoDB API
byManish Sharma
Rating: 0 out of 5 stars
0 ratings
Creating your MySQL Database: Practical Design Tips and Techniques
Ebook
Creating your MySQL Database: Practical Design Tips and Techniques
byMarc Delisle
Rating: 3 out of 5 stars
3/5
Practical Azure SQL Database for Modern Developers: Building Applications in the Microsoft Cloud
Ebook
Practical Azure SQL Database for Modern Developers: Building Applications in the Microsoft Cloud
byDavide Mauri
Rating: 0 out of 5 stars
0 ratings
Introducing the MySQL 8 Document Store
Ebook
Introducing the MySQL 8 Document Store
byCharles Bell
Rating: 0 out of 5 stars
0 ratings
Learning Cascading
Ebook
Learning Cascading
byMichael Covert
Rating: 0 out of 5 stars
0 ratings
Beginning Azure Synapse Analytics: Transition from Data Warehouse to Data Lakehouse
Ebook
Beginning Azure Synapse Analytics: Transition from Data Warehouse to Data Lakehouse
byBhadresh Shiyal
Rating: 0 out of 5 stars
0 ratings
Creating ASP.NET Core Web Applications: Proven Approaches to Application Design and Development
Ebook
Creating ASP.NET Core Web Applications: Proven Approaches to Application Design and Development
byDirk Strauss
Rating: 0 out of 5 stars
0 ratings
IaaS Mastery: Infrastructure As A Service: Your All-In-One Guide To AWS, GCE, Microsoft Azure, And IBM Cloud
Ebook
IaaS Mastery: Infrastructure As A Service: Your All-In-One Guide To AWS, GCE, Microsoft Azure, And IBM Cloud
byRob Botwright
Rating: 0 out of 5 stars
0 ratings
Azure Data Factory by Example: Practical Implementation for Data Engineers
Ebook
Azure Data Factory by Example: Practical Implementation for Data Engineers
byRichard Swinbank
Rating: 0 out of 5 stars
0 ratings
Introducing InnoDB Cluster: Learning the MySQL High Availability Stack
Ebook
Introducing InnoDB Cluster: Learning the MySQL High Availability Stack
byCharles Bell
Rating: 0 out of 5 stars
0 ratings
SQL Server Advanced Data Types: JSON, XML, and Beyond
Ebook
SQL Server Advanced Data Types: JSON, XML, and Beyond
byPeter A. Carter
Rating: 0 out of 5 stars
0 ratings
Hands-on Cloud Analytics with Microsoft Azure Stack
Ebook
Hands-on Cloud Analytics with Microsoft Azure Stack
byPrashila Naik
Rating: 0 out of 5 stars
0 ratings
Pro SQL Server 2019 Administration: A Guide for the Modern DBA
Ebook
Pro SQL Server 2019 Administration: A Guide for the Modern DBA
byPeter A. Carter
Rating: 0 out of 5 stars
0 ratings
Mastering Redis
Ebook
Mastering Redis
byJeremy Nelson
Rating: 0 out of 5 stars
0 ratings
Practical Oracle Cloud Infrastructure: Infrastructure as a Service, Autonomous Database, Managed Kubernetes, and Serverless
Ebook
Practical Oracle Cloud Infrastructure: Infrastructure as a Service, Autonomous Database, Managed Kubernetes, and Serverless
byMichał Tomasz Jakóbczyk
Rating: 0 out of 5 stars
0 ratings
SQL Server 2019 Revealed: Including Big Data Clusters and Machine Learning
Ebook
SQL Server 2019 Revealed: Including Big Data Clusters and Machine Learning
byBob Ward
Rating: 0 out of 5 stars
0 ratings
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
Ebook
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
byPablo Alejandro Echeverria Barrios
Rating: 0 out of 5 stars
0 ratings
Application Design: Key Principles For Data-Intensive App Systems
Ebook
Application Design: Key Principles For Data-Intensive App Systems
byRob Botwright
Rating: 0 out of 5 stars
0 ratings
Beginning SQL Server Reporting Services
Ebook
Beginning SQL Server Reporting Services
byKathi Kellenberger
Rating: 0 out of 5 stars
0 ratings
Edge Cloud Operations: A Systems Approach
Ebook
Edge Cloud Operations: A Systems Approach
byLarry L. Peterson
Rating: 0 out of 5 stars
0 ratings
MySQL Concurrency: Locking and Transactions for MySQL Developers and DBAs
Ebook
MySQL Concurrency: Locking and Transactions for MySQL Developers and DBAs
byJesper Wisborg Krogh
Rating: 0 out of 5 stars
0 ratings

Databases For You

Skip carousel

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Practical Data Analysis
Ebook
Practical Data Analysis
byHector Cuesta
Rating: 4 out of 5 stars
4/5
100+ SQL Queries T-SQL for Microsoft SQL Server
Ebook
100+ SQL Queries T-SQL for Microsoft SQL Server
byIFS Harrison
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
Ebook
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
byAlexander Cooper
Rating: 1 out of 5 stars
1/5
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
Ebook
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
byJohn Ladley
Rating: 4 out of 5 stars
4/5
Blockchain Basics: A Non-Technical Introduction in 25 Steps
Ebook
Blockchain Basics: A Non-Technical Introduction in 25 Steps
byDaniel Drescher
Rating: 5 out of 5 stars
5/5
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator
Ebook
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator
byBrian Peasland
Rating: 0 out of 5 stars
0 ratings
Access 2019 For Dummies
Ebook
Access 2019 For Dummies
byLaurie A. Ulrich
Rating: 0 out of 5 stars
0 ratings
Access 2010 All-in-One For Dummies
Ebook
Access 2010 All-in-One For Dummies
byAlison Barrows
Rating: 4 out of 5 stars
4/5
Data Mining: Concepts and Techniques
Ebook
Data Mining: Concepts and Techniques
byJiawei Han
Rating: 4 out of 5 stars
4/5
Building a Scalable Data Warehouse with Data Vault 2.0
Ebook
Building a Scalable Data Warehouse with Data Vault 2.0
byDaniel Linstedt
Rating: 4 out of 5 stars
4/5
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Learn SQL Server Administration in a Month of Lunches
Ebook
Learn SQL Server Administration in a Month of Lunches
byDon Jones
Rating: 0 out of 5 stars
0 ratings
Data Modeling Essentials
Ebook
Data Modeling Essentials
byGraeme Simsion
Rating: 4 out of 5 stars
4/5
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
Ebook
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
byWilliam Sullivan
Rating: 5 out of 5 stars
5/5
LINUX: Beginner's Crash Course. Your Step-By-Step Guide To Learning The Linux Operating System And Command Line Easy & Fast!
Ebook
LINUX: Beginner's Crash Course. Your Step-By-Step Guide To Learning The Linux Operating System And Command Line Easy & Fast!
byJeremy Li
Rating: 3 out of 5 stars
3/5
Business Intelligence Guidebook: From Data Integration to Analytics
Ebook
Business Intelligence Guidebook: From Data Integration to Analytics
byRick Sherman
Rating: 4 out of 5 stars
4/5
Beginning Microsoft SQL Server 2012 Programming
Ebook
Beginning Microsoft SQL Server 2012 Programming
byPaul Atkinson
Rating: 1 out of 5 stars
1/5
Behind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight
Ebook
Behind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight
byPiyanka Jain
Rating: 5 out of 5 stars
5/5
CompTIA DataSys+ Study Guide: Exam DS0-001
Ebook
CompTIA DataSys+ Study Guide: Exam DS0-001
byMike Chapple
Rating: 0 out of 5 stars
0 ratings
Database Design: Know It All
Ebook
Database Design: Know It All
byToby J. Teorey
Rating: 5 out of 5 stars
5/5
Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics
Ebook
Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics
byDan Clark
Rating: 0 out of 5 stars
0 ratings
The SQL Workshop: Learn to create, manipulate and secure data and manage relational databases with SQL
Ebook
The SQL Workshop: Learn to create, manipulate and secure data and manage relational databases with SQL
byFrank Solomon
Rating: 0 out of 5 stars
0 ratings
The Visual Imperative: Creating a Visual Culture of Data Discovery
Ebook
The Visual Imperative: Creating a Visual Culture of Data Discovery
byLindy Ryan
Rating: 4 out of 5 stars
4/5
SQL Clearly Explained
Ebook
SQL Clearly Explained
byJan L. Harrington
Rating: 5 out of 5 stars
5/5
The Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality
Ebook
The Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality
byLowell Fryman
Rating: 5 out of 5 stars
5/5
Relational Database Design and Implementation
Ebook
Relational Database Design and Implementation
byJan L. Harrington
Rating: 5 out of 5 stars
5/5
Business Intelligence Strategy and Big Data Analytics: A General Management Perspective
Ebook
Business Intelligence Strategy and Big Data Analytics: A General Management Perspective
bySteve Williams
Rating: 5 out of 5 stars
5/5
Python and SQLite Development
Ebook
Python and SQLite Development
byAgus Kurniawan
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Scalable Databases on Kubernetes
Podcast episode
Scalable Databases on Kubernetes
byThe Cloudcast
0 ratings
0% found this document useful
Powering your Copilot for Data – with Artem Keydunov of Cube.dev
Podcast episode
Powering your Copilot for Data – with Artem Keydunov of Cube.dev
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
Scalable, Serverless Database Platforms
Podcast episode
Scalable, Serverless Database Platforms
byThe Cloudcast
0 ratings
0% found this document useful
kslDB Database for Streaming Events
Podcast episode
kslDB Database for Streaming Events
byThe Cloudcast
0 ratings
0% found this document useful
Bringing DevOps to the Database with Automation
Podcast episode
Bringing DevOps to the Database with Automation
byThe Cloudcast
0 ratings
0% found this document useful
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
MySQL Database Service and HeatWave: In this episode, Lois Houston and Nikita Abraham are joined by Autumn Black to discuss MySQL Database, a fully-managed database service powered by the integrated HeatWave in-memory query accelerator. Oracle MyLearn: Oracle University Learning...
Podcast episode
MySQL Database Service and HeatWave: In this episode, Lois Houston and Nikita Abraham are joined by Autumn Black to discuss MySQL Database, a fully-managed database service powered by the integrated HeatWave in-memory query accelerator. Oracle MyLearn: Oracle University Learning...
byOracle University Podcast
0 ratings
0% found this document useful
Cloud Spanner Revisited with Dilraj Kaur and Christoph Bussler: Mark Mirchandani and Stephanie Wong are back this week as we learn about all the new things happening with Google Cloud Spanner.
Podcast episode
Cloud Spanner Revisited with Dilraj Kaur and Christoph Bussler: Mark Mirchandani and Stephanie Wong are back this week as we learn about all the new things happening with Google Cloud Spanner.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Whiteboard Confessional: Everything's a Database Except SQLite: Join me as I continue a new series called Whiteboard Confessional with a look at the awesomeness that is SQLite, including how it wasn’t designed to work in a client-server fashion, when you should use it and when you absolutely shouldn’t, how deciding to
Podcast episode
Whiteboard Confessional: Everything's a Database Except SQLite: Join me as I continue a new series called Whiteboard Confessional with a look at the awesomeness that is SQLite, including how it wasn’t designed to work in a client-server fashion, when you should use it and when you absolutely shouldn’t, how deciding to
byAWS Morning Brief
0 ratings
0% found this document useful
Oracle NoSQL Database Cloud Service: High availability, data model flexibility, elastic scalability… If these words have piqued your interest, then this is the episode for you! Join Lois Houston and Nikita Abraham, along with Autumn Black, as they discuss how Oracle NoSQL...
Podcast episode
Oracle NoSQL Database Cloud Service: High availability, data model flexibility, elastic scalability… If these words have piqued your interest, then this is the episode for you! Join Lois Houston and Nikita Abraham, along with Autumn Black, as they discuss how Oracle NoSQL...
byOracle University Podcast
0 ratings
0% found this document useful
Oracle Data Lakehouse: With each passing day, more and more data sources are sending greater volumes of data across the globe. For any organization, this combination of structured and unstructured data continues to be a challenge. Data lakehouses link, correlate, and...
Podcast episode
Oracle Data Lakehouse: With each passing day, more and more data sources are sending greater volumes of data across the globe. For any organization, this combination of structured and unstructured data continues to be a challenge. Data lakehouses link, correlate, and...
byOracle University Podcast
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.
$Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.$
$Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.$
Podcast episode
Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.
byData Engineering Podcast
0 ratings
0% found this document useful
Designing Data Transfer Systems That Scale: The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.
Podcast episode
Designing Data Transfer Systems That Scale: The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.
byData Engineering Podcast
0 ratings
0% found this document useful
Debezium - Capturing Data the Instant it Happens (with Gunnar Morling)
Podcast episode
Debezium - Capturing Data the Instant it Happens (with Gunnar Morling)
byDeveloper Voices
0 ratings
0% found this document useful
Data Gravity? Why Cloud Databases Will Prevail: Information assets may not have physical weight, but that doesn't mean data has no gravity. And in the new, cloud-centric world evolving around us, many new data sets are born in the cloud, where they will likely remain, whether for analytical or...
Podcast episode
Data Gravity? Why Cloud Databases Will Prevail: Information assets may not have physical weight, but that doesn't mean data has no gravity. And in the new, cloud-centric world evolving around us, many new data sets are born in the cloud, where they will likely remain, whether for analytical or...
byDM Radio
0 ratings
0% found this document useful
An Event-Driven Apps Look Ahead for 2021
Podcast episode
An Event-Driven Apps Look Ahead for 2021
byThe Cloudcast
0 ratings
0% found this document useful
Introduction to Data Mesh
Podcast episode
Introduction to Data Mesh
byThe Cloudcast
0 ratings
0% found this document useful
#06 - Tech stack of Open Podcast: Which database is best?
Podcast episode
#06 - Tech stack of Open Podcast: Which database is best?
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
Database Monitoring & Observability
Podcast episode
Database Monitoring & Observability
byThe Cloudcast
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
The Evolution of Serverless Databases
Podcast episode
The Evolution of Serverless Databases
byThe Cloudcast
0 ratings
0% found this document useful
Running Databases on Kubernetes
Podcast episode
Running Databases on Kubernetes
byThe Cloudcast
0 ratings
0% found this document useful
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
Podcast episode
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
byData Engineering Podcast
0 ratings
0% found this document useful
SQL Commenter with Nimesh Bhagat and Morgan McLean: First time co-host joins this week to talk about database observability and the cool tools that make it possible. Morgan McLean and Nimesh Bhagat describe database observability, which uses metrics, logs, and other tools to help users understand the...
Podcast episode
SQL Commenter with Nimesh Bhagat and Morgan McLean: First time co-host joins this week to talk about database observability and the cool tools that make it possible. Morgan McLean and Nimesh Bhagat describe database observability, which uses metrics, logs, and other tools to help users understand the...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Understanding Time-Series Database Patterns
Podcast episode
Understanding Time-Series Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
Best of 2023: Getting Started with Oracle Database: In today’s digital economy, data is a form of capital. Given the mission-critical role that it has, having a robust data management strategy is now more crucial than ever. Join Lois Houston and Nikita Abraham, along with Kay Malcolm, as they...
Podcast episode
Best of 2023: Getting Started with Oracle Database: In today’s digital economy, data is a form of capital. Given the mission-critical role that it has, having a robust data management strategy is now more crucial than ever. Join Lois Houston and Nikita Abraham, along with Kay Malcolm, as they...
byOracle University Podcast
0 ratings
0% found this document useful
Confidential Computing
Podcast episode
Confidential Computing
byThe Cloudcast
0 ratings
0% found this document useful
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
Podcast episode
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
byData Engineering Podcast
0 ratings
0% found this document useful
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
Podcast episode
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
byData Engineering Podcast
0 ratings
0% found this document useful

Skip carousel

All Your Database Are Belong To Us
Linux Format
Article
All Your Database Are Belong To Us
Apr 6, 2021
7 min read
It’s Great When You’re K8s
Linux Format
Article
It’s Great When You’re K8s
Oct 18, 2022
8 min read
Herd In The Cloud
Linux Format
Article
Herd In The Cloud
Sep 21, 2021
Matt Yonkovit is Percona’s Head of Open Source Strategy and a member of SHA (Silly Hats Anonymous). “Going ‘cloud native’ involves building applications in new ways. Traditional applications are generally designed with a two- or three-tier architectu
1 min read
The Future Of The Database
Linux Format
Article
The Future Of The Database
Aug 27, 2019
7 min read
Enterprise Soaring Success
Linux Format
Article
Enterprise Soaring Success
Aug 27, 2019
7 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
KeePassXC: The Friendlier Free Offline Password Manager
PCWorld
Article
KeePassXC: The Friendlier Free Offline Password Manager
Sep 5, 2023
7 min read
MARIADB Optimise And Control Your Databases
Linux Format
Article
MARIADB Optimise And Control Your Databases
Jul 30, 2019
9 min read
Jonathan Ellis INTERVIEW
Linux Format
Article
Jonathan Ellis INTERVIEW
Oct 22, 2019
6 min read
Mac 911
MacWorld
Article
Mac 911
Apr 20, 2021
7 min read
“It’s Very Important To Takea Long, Hard, Cynical Look At Your Servere State”
PC Pro Magazine
Article
“It’s Very Important To Takea Long, Hard, Cynical Look At Your Servere State”
Feb 10, 2022
8 min read
Vector Vexations
Linux Format
Article
Vector Vexations
Apr 2, 2024
Why does MySQL not support vectors in its community edition? Generative AI is the hot topic in tech. GenAI relies on vector data. Yet Oracle has no plans to support vectors in the community edition of MySQL. If you want to try out vector data with ot
1 min read
Observability Of The Kernel And Containers
Linux Format
Article
Observability Of The Kernel And Containers
Apr 4, 2023
Mihalis Tsoukalos is currently working on Time Series. You can reach him at: @mactsouk. For our final delve into eBPF, we’re tackling applications, the kernel and Docker containers. At the end of the day, all Linux machines execute code for applicat
10 min read
Join the Pod, Man!
Linux Format
Article
Join the Pod, Man!
May 30, 2023
8 min read
Code A Cataloguing Application In Python
Linux Format
Article
Code A Cataloguing Application In Python
Nov 15, 2022
Credit: www.djangoproject.com Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://github.com/mat
8 min read
Open Success
Linux Format
Article
Open Success
Nov 17, 2020
“ClickHouse was developed for Yandex Metrics (the Russian equivalent of Google Analytics) as a data store and was Apache 2 licenced in 2016. In 2020. Altinity picked up $4m in funding to help it finish off a ClickHouse cloud service that’s in private
1 min read
Build A Static Project Website On GitHub
Linux Format
Article
Build A Static Project Website On GitHub
Jul 25, 2023
10 min read
“When Something Goes Wrong, You Realise You’re Like That Cartoon Character That Has Run Off The Edge Of The Cliff”
PC Pro Magazine
Article
“When Something Goes Wrong, You Realise You’re Like That Cartoon Character That Has Run Off The Edge Of The Cliff”
Feb 9, 2023
We need to talk about data. Specifically, your data and my data. The stuff we use on a day-to-day basis, from where we store it to what our expectations are for its safe handling. Now let me get one thing clear from the beginning: I am going to sugge
9 min read
Linux For The Soul
Linux Format
Article
Linux For The Soul
Oct 19, 2021
6 min read
“We Should Pay Attention To The Way That A New Language Can Redefine The Limits Of Computing”
PC Pro Magazine
Article
“We Should Pay Attention To The Way That A New Language Can Redefine The Limits Of Computing”
Feb 11, 2021
7 min read
Real World Computing
PC Pro Magazine
Article
Real World Computing
May 11, 2023
Migrating to Azure isn’t necessarily the toughest part of a successful cloud migration, explains our guest columnist Many organisations succeed at deploying resources in or migrating to Microsoft Azure. But many of those same organisations fail to en
6 min read
One Tree To Rule Them All
Family Tree
Article
One Tree To Rule Them All
Apr 19, 2022
7 min read
Newsdesk
Linux Format
Article
Newsdesk
Nov 14, 2023
8 min read
Build And Compile For Embedded Systems
Linux Format
Article
Build And Compile For Embedded Systems
Nov 17, 2020
Mats Tage Axelsson and a whole pile of compiling time. Mats Tage Axelsson has spent decades figuring out reasons to use Linux computers while sacrificing real social interactions. Embedded systems are small, specialised units with a key attribute
4 min read
Personal Cloud Servers
Linux Format
Article
Personal Cloud Servers
Sep 19, 2023
1 min read
CalicoPie Family Historian 7
Computeractive
Article
CalicoPie Family Historian 7
Mar 24, 2021
SOFTWARE | £60 from Family Historian Store www.snipca.com/37615 If you’ve ever researched your family tree, you’ll know it’s much harder than the BBC’s celebrity genealogy programme Who Do You Think You Are? makes it appear. You’ll certainly need to
2 min read
Kernel Watch
Linux Format
Article
Kernel Watch
Apr 2, 2024
Linus Torvalds announced Linux 6.8, noting that the development cycle had been calm over the trailing couple of weeks, “just as it should be”. The new kernel includes many performance enhancements under the bonnet. Among these are support for variabl
2 min read
Database Control With C++ Tools
Linux Format
Article
Database Control With C++ Tools
Dec 17, 2019
10 min read
Master Linux VM creation in Azure
Linux Format
Article
Master Linux VM creation in Azure
May 30, 2023
12 min read
The Race To Exascale Supercomputers
Maximum PC
Article
The Race To Exascale Supercomputers
Jun 21, 2022
9 min read

Related categories

Skip carousel

Reviews for SQL Server Big Data Clusters

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

SQL Server Big Data Clusters - Benjamin Weissman

B. Weissman, E. van de LaarSQL Server Big Data Clusters https://doi.org/10.1007/978-1-4842-5985-6_1

1. What Are Big Data Clusters?

Benjamin Weissman¹ and Enrico van de Laar²

(1)

Nurnberg, Germany

(2)

Drachten, The Netherlands

SQL Server 2019 Big Data Clusters – or just Big Data Clusters – are a new feature set within SQL Server 2019 with a broad range of functionality around data virtualization, data mart scale out, and artificial intelligence (AI).

SQL Server 2019 Big Data Clusters are only available as part of the box-product SQL Server. This is despite Microsoft’s cloud-first strategy to release new features and functionality to Azure first and eventually roll it over to the on-premises versions later (if at all).

Major parts of Big Data Clusters run only on Linux. Let that sink in and travel back a few years in time. If somebody had told you in early 2016 that you would be able to run SQL Server on Linux, you probably would not have believed them. Then SQL Server on Linux was announced, but it was only delivering a subset of what it’s big brother – SQL Server on Windows – actually contained. And now we have a feature that actually requires us to run SQL Server on Linux.

Oh, and by the way, the name is a bit misleading. Some parts of SQL Server Big Data Clusters don’t really form a cluster – but more on that later.

Speaking of parts, Big Data Clusters is not a single feature but a huge feature set serving a whole lot of different purposes, so it is unlikely that you will be embracing every single piece of it. Depending on your role, specific parts may be more useful to you than others. Over the course of this book, we will guide you through all capabilities to allow you to pick those functions that will help you and ignore those that wouldn’t add any value for you.

What Is a SQL Server 2019 Big Data Cluster Really?

SQL Server 2019 Big Data Clusters are essentially a combination of SQL Server, Apache Spark, and the HDFS filesystem running in a Kubernetes environment. As mentioned before, Big Data Clusters is not a single feature. Figure 1-1 categorizes the different parts of the feature set into different groups to help you better understand what is being provided. The overall idea is, through virtualization and scale out, SQL Server 2019 becomes your data hub for all your data, even if that data is not physically sitting in SQL Server.

../images/480532_2_En_1_Chapter/480532_2_En_1_Fig1_HTML.jpg

Figure 1-1

Feature overview of SQL Server 2019 Big Data Clusters

The major aspects of Big Data Clusters are shown from left to right in Figure 1-1. You have support for data virtualization, then a managed data platform, and finally an artificial intelligence (AI) platform . Each of these aspects is described in more detail in the remainder of this chapter.

Data Virtualization

The first feature within a SQL Server 2019 Big Data Cluster is data virtualization. Data virtualization – unlike data integration – retains your data at the source instead of duplicating it. Figure 1-2 illustrates this distinction between data integration and data virtualization. The dotted rectangles in the data virtualization target represent virtual data sources that always resolve back to a single instance of the data at the original source. In the world of Microsoft, this resolution of data to its original source is done via a SQL Server feature named PolyBase, allowing you to virtualize all or parts of your data mart.

../images/480532_2_En_1_Chapter/480532_2_En_1_Fig2_HTML.jpg

Figure 1-2

Data virtualization vs. data integration

One obvious upside to data virtualization is that you get rid of redundant data as you don’t copy it from the source but read it directly from there. Especially in cases where you only read a big flat file once to aggregate it, there may be little to no use to that duplicate and redundant data. Also, with PolyBase, your query is real time, whereas integrated data will always carry some lag.

On the other hand, you can’t put indexes on an external table. Thus if you have data that you frequently query with different workloads than on the original source, which means that you require another indexing strategy, it might still make sense to integrate the data rather than virtualize it. That decision may also be driven by the question on whether you can accept the added workload to your source that would result from more frequent reporting queries and so on.

Note

While data virtualization solves a couple of issues that come with data integration, it won’t be able to replace data integration. This is NOT the end of SSIS or ETL. ../images/480532_2_En_1_Chapter/480532_2_En_1_Figa_HTML.gif

Technically, PolyBase has been around since SQL Server 2016, but so far only supported very limited types of data sources. In SQL Server 2019, PolyBase has been greatly enhanced by support for multiple relational data sources such as SQL Server or Oracle and NoSQL sources like MongoDB, HDFS, and all other kinds of data as we illustrate in Figure 1-3.

../images/480532_2_En_1_Chapter/480532_2_En_1_Fig3_HTML.jpg

Figure 1-3

PolyBase sources and capabilities in SQL Server 2019

Effectively, you can query a table in another database or even on a completely different machine as if it were a local table.

The use of PolyBase for virtualization may remind you of a linked server and there definitely are some similarities. One big difference is that the query toward a linked server tends to be longer and more involved than a PolyBase query. For example, here is a typical query against a remote table:

SELECT * FROM MyOtherServer.MyDatabase.DBO.MyTable

Using PolyBase, you would write the same query more simply, as if the table were in your local database. For example:

SELECT * FROM MyTable

PolyBase will know that the table is in a different database because you will have created a definition in PolyBase indicating where the table can be found.

An advantage of using PolyBase is that you can move MyDatabase to another server without having to rewrite your queries. Simply change your PolyBase data source definition to redirect to the new data source. You can do that easily, without harming or affecting your existing queries or views.

There are more differences between the use of linked servers and PolyBase. Table 1-1 describes some that you should be aware of.

Table 1-1

Comparison of linked servers and PolyBase

Outsource Your Data

You may have heard of Stretch Database,¹ a feature introduced in SQL Server 2016, which allows you to offload parts of your data to Azure. The idea is to use the feature for cold data – meaning data that you don’t access as frequently because it’s either old (but still needed for some queries) or simply for business areas that require less attention.

The rationale behind cold data is that it should be cheaper to store that data in Azure than on premise. Unfortunately, the service may not be right for everyone as even its entry tier provides significant storage performance which obviously comes at a cost.

With PolyBase, you can now, for example, offload data to an Azure SQL Database and build your own very low-level outsourcing functionality.

Reduce Data Redundancy and Development Time

Besides offloading data, the reason to virtualize it instead of integrating it is obviously the potentially tremendous reduction of data redundancy . As data virtualization keeps the data at its original source and the data is therefore not persisted at the destination, you basically cut your storage needs in half compared to a traditional ETL-based staging process.

Note

Our cut in half assertion may not be super accurate as you may not have staged the full dataset anyway (reducing the savings) or you may have used different datatypes (potentially even increasing the savings even more).

Think of this: You want to track the number of page requests on your website per hour which is logging to text files. In a traditional environment, you would have written a SQL Server Integration Services (SSIS) package to load the text file into a table, then run a query on it to group the data, and then store or use its result. In this then new virtualization approach, you would still run the query to group the data but you’d run it right on your flat file, saving the time it would have taken to develop the SSIS package and also the storage for the staging table holding the log data which would otherwise have coexisted in the file as well as the staging table in SQL Server.

A Combined Data Platform Environment

One of the big use cases of SQL Server Big Data Clusters is the ability to create an environment that stores, manages, and analyzes data in different formats, types, and sizes. Most notably, you get the ability to store both relational data inside the SQL Server component and nonrelational data inside the HDFS storage subsystem. Using Big Data Clusters allows you to create a data lake environment that can answer all your data needs without a huge layer of complexity that comes with managing, updating, and configuring various parts that make up a data lake.

Big Data Clusters completely take care of the installation and management of your Big Data Cluster straight from the installation of the product. Since Big Data Clusters is being pushed as a stand-alone product with full support from Microsoft, this means Microsoft is going to handle updates for all the technologies that make up Big Data Clusters through service packs and updates.

So why would you be interested in a data lake? As it turns out, many organizations have a wide variety of data stored in different formats. In many situations, a large portion of data comes from the use of applications that store their data inside relational databases like SQL Server. By using a relational database, we can easily query the data inside of it and use it for all kinds of things like dashboards, KPIs, or even machine learning tasks to predict future sales, for instance.

A relational database must follow a number of rules, and one of the most important of those rules is that a relational database always stores data in a schema-on-write manner. This means that if you want to insert data into a relational database, you have to make sure the data complies to the structure of the table being written to. Figure 1-4 illustrates schema-on-write.

For instance, a table with the columns OrderID, OrderCustomer, and OrderAmount dictates that data you are inserting into that table will also need to contain those same columns. This means that when you want to write a new row in this table, you will have to define an OrderID, OrderCustomer, and OrderAmount for the insert to be successful. There is no room for adding additional columns on the fly, and in many cases, the data you are inserting needs to be the same datatype as specified in the table (for inside integers for numbers and strings for text).

../images/480532_2_En_1_Chapter/480532_2_En_1_Fig4_HTML.jpg

Figure 1-4

Scheme-on-write

Now in many situations the schema-on-write approach is perfectly fine. You make sure all your data is formatted in the way the relational databases expect it to be, and you can store all your data inside of it. But what happens when you decide to add new datasets that do not necessarily have a fixed schema? Or, you want to process data that is very large (multiple terabytes) in terms of size? In those situations, it is frequently advised to look for another technology to store and process your data since a relational database has difficulties handling data with those characteristics.

Solutions like Hadoop and HDFS were created to solve some of the limitations around relational databases. Big Data platforms are able to process large volumes of data in a distributed manner by spreading the data across different machines (called nodes) that make up a cluster architecture. Using a technology like Hadoop, or as we will use in this book Spark, allows you to store and process data in any format. This means we can store huge CSV (comma-separated values) files, video files, Word documents, PDFs, or whatever we please without having to worry about complying to a predefined schema like we’d have to when storing data inside a relational database.

Apache’s Spark technology makes sure our data is cut up into smaller blocks and stored on the filesystem of the nodes that make up a Spark cluster. We only have to worry about the schema when we are going to read in and process the data, something that is called schema-on-read . When we load in our CSV file to check its contents, we have to define what type of data it is and, in the case of a CSV file, what the columns are of the data. Specifying these details on read allows us a lot of flexibility when dealing with this data, since we can add or remove columns or transform datatypes without having to worry about a schema before we write the data back again. Because a technology like Spark has a distributed architecture, we can perform all these data manipulation and querying steps very quickly on large datasets, something we are explaining in more detail in Chapter 2.

What you see in the real world is that in many situations organizations have both relational databases and a Hadoop/Spark cluster to store and process their data. These solutions are implemented separately from each other and, in many cases, do not talk to each other. Is the data relational? Store it in the database! Is it nonrelational like CSV, IoT data, or other formats? Throw it on the Hadoop/Spark cluster! One reason why we are so excited over the release of SQL Server Big Data Clusters

Enjoying the preview?

Page 1 of 1

SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform

About this ebook

Benjamin Weissman

Related authors

Related to SQL Server Big Data Clusters

Related ebooks

Databases For You

Related podcast episodes

Related articles

Related categories

Reviews for SQL Server Big Data Clusters

What did you think?

Book preview

SQL Server Big Data Clusters - Benjamin Weissman

1. What Are Big Data Clusters?

What Is a SQL Server 2019 Big Data Cluster Really?

Data Virtualization

Outsource Your Data

Reduce Data Redundancy and Development Time

A Combined Data Platform Environment