Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)

Ebook803 pages5 hours

Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)

Name: Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
Author: Pablo Alejandro Echeverria Barrios
ISBN: 9789390684427

By Pablo Alejandro Echeverria Barrios

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book brings exciting coverage on establishing and managing data virtualization using polybase. This book teaches how to configure polybase on almost all relational and nonrelational databases. You will learn to set up the test environment for any tool or software instantly without hassle. You will practice how to design and build some of the high performing data warehousing solutions and that too in a few minutes of time.
You will almost become an expert in connecting to all databases including hadoop, cassandra, MySQL, PostgreSQL, MariaDB and Oracle database. This book also brings exclusive coverage on how to build data clusters on Azure and using Azure Synapse Analytics. By the end of this book, you just don't administer the polybase for managing big data clusters but rather you learn to optimize and boost the performance for enabling data analytics and ease of data accessibility.

Skip carousel

Enterprise Applications

LanguageEnglish

PublisherBPB Online LLP

Release dateApr 2, 2021

ISBN9789390684427

Author

Pablo Alejandro Echeverria Barrios

Related authors

Skip carousel

Related to Hands-on Data Virtualization with Polybase

Related ebooks

Skip carousel

RDBMS In-Depth: Mastering SQL and PL/SQL Concepts, Database Design, ACID Transactions, and Practice Real Implementation of RDBM (English Edition)
Ebook
RDBMS In-Depth: Mastering SQL and PL/SQL Concepts, Database Design, ACID Transactions, and Practice Real Implementation of RDBM (English Edition)
byDr. Madhavi Vaidya
Rating: 0 out of 5 stars
0 ratings
SQL CODING FOR BEGINNERS: Step-by-Step Beginner's Guide to Mastering SQL Programming and Coding (2022 Crash Course for Newbies)
Ebook
SQL CODING FOR BEGINNERS: Step-by-Step Beginner's Guide to Mastering SQL Programming and Coding (2022 Crash Course for Newbies)
byFawn Watson
Rating: 0 out of 5 stars
0 ratings
Hands-On Azure Data Platform: Building Scalable Enterprise-Grade Relational and Non-Relational database Systems with Azure Data Services
Ebook
Hands-On Azure Data Platform: Building Scalable Enterprise-Grade Relational and Non-Relational database Systems with Azure Data Services
bySagar Lad
Rating: 0 out of 5 stars
0 ratings
Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools
Ebook
Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools
byVinicius Aquino do Vale
Rating: 0 out of 5 stars
0 ratings
Microsoft Azure Cosmos DB Revealed: A Multi-Model Database Designed for the Cloud
Ebook
Microsoft Azure Cosmos DB Revealed: A Multi-Model Database Designed for the Cloud
byJosé Rolando Guay Paz
Rating: 0 out of 5 stars
0 ratings
The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform
Ebook
The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform
byMatt How
Rating: 0 out of 5 stars
0 ratings
Practical Azure SQL Database for Modern Developers: Building Applications in the Microsoft Cloud
Ebook
Practical Azure SQL Database for Modern Developers: Building Applications in the Microsoft Cloud
byDavide Mauri
Rating: 0 out of 5 stars
0 ratings
Pro Oracle Database 18c Administration: Manage and Safeguard Your Organization’s Data
Ebook
Pro Oracle Database 18c Administration: Manage and Safeguard Your Organization’s Data
byMichelle Malcher
Rating: 0 out of 5 stars
0 ratings
Cloud Data Architectures Demystified: Gain the expertise to build Cloud data solutions as per the organization's needs (English Edition)
Ebook
Cloud Data Architectures Demystified: Gain the expertise to build Cloud data solutions as per the organization's needs (English Edition)
byAshok Boddeda
Rating: 0 out of 5 stars
0 ratings
Learn T-SQL Querying: A guide to developing efficient and elegant T-SQL code
Ebook
Learn T-SQL Querying: A guide to developing efficient and elegant T-SQL code
byPedro Lopes
Rating: 0 out of 5 stars
0 ratings
Hands-on Cloud Analytics with Microsoft Azure Stack
Ebook
Hands-on Cloud Analytics with Microsoft Azure Stack
byPrashila Naik
Rating: 0 out of 5 stars
0 ratings
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
Ebook
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
byChitra Lele
Rating: 0 out of 5 stars
0 ratings
Learn SQL with MySQL: Retrieve and Manipulate Data Using SQL Commands with Ease
Ebook
Learn SQL with MySQL: Retrieve and Manipulate Data Using SQL Commands with Ease
byAshwin Pajankar
Rating: 0 out of 5 stars
0 ratings
Hands-On Machine Learning Recommender Systems with Apache Spark
Ebook
Hands-On Machine Learning Recommender Systems with Apache Spark
byErnesto Lee
Rating: 0 out of 5 stars
0 ratings
SQL Server 2017 Integration Services Cookbook
Ebook
SQL Server 2017 Integration Services Cookbook
byChristian Cote
Rating: 0 out of 5 stars
0 ratings
PolyBase Revealed: Data Virtualization with SQL Server, Hadoop, Apache Spark, and Beyond
Ebook
PolyBase Revealed: Data Virtualization with SQL Server, Hadoop, Apache Spark, and Beyond
byKevin Feasel
Rating: 0 out of 5 stars
0 ratings
Learning NServiceBus Sagas
Ebook
Learning NServiceBus Sagas
byRich Helton
Rating: 0 out of 5 stars
0 ratings
Querying SQL Server: Run T-SQL operations, data extraction, data manipulation, and custom queries to deliver simplified analytics (English Edition)
Ebook
Querying SQL Server: Run T-SQL operations, data extraction, data manipulation, and custom queries to deliver simplified analytics (English Edition)
byAdam Aspin
Rating: 0 out of 5 stars
0 ratings
Advanced Analytics with Power BI and Excel: Learn powerful visualization and data analysis techniques using Microsoft BI tools along with Python and R
Ebook
Advanced Analytics with Power BI and Excel: Learn powerful visualization and data analysis techniques using Microsoft BI tools along with Python and R
byDejan Sarka
Rating: 0 out of 5 stars
0 ratings
SQL Server Query Performance Tuning
Ebook
SQL Server Query Performance Tuning
byGrant Fritchey
Rating: 0 out of 5 stars
0 ratings
Jump Start MySQL: Master the Database That Powers the Web
Ebook
Jump Start MySQL: Master the Database That Powers the Web
byTimothy Boronczyk
Rating: 0 out of 5 stars
0 ratings
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
Ebook
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
byi Code Academy
Rating: 5 out of 5 stars
5/5
Implementing Power BI in the Enterprise
Ebook
Implementing Power BI in the Enterprise
byGreg Low
Rating: 5 out of 5 stars
5/5
Redis® Deep Dive: Explore Redis - Its Architecture, Data Structures and Modules like Search, JSON, AI, Graph, Timeseries (English Edition)
Ebook
Redis® Deep Dive: Explore Redis - Its Architecture, Data Structures and Modules like Search, JSON, AI, Graph, Timeseries (English Edition)
byChinmay Kulkarni
Rating: 0 out of 5 stars
0 ratings
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
Ebook
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
byAlok Kumar
Rating: 0 out of 5 stars
0 ratings
SQL Server 2017 Query Performance Tuning: Troubleshoot and Optimize Query Performance
Ebook
SQL Server 2017 Query Performance Tuning: Troubleshoot and Optimize Query Performance
byGrant Fritchey
Rating: 0 out of 5 stars
0 ratings
SQL and NoSQL Interview Questions: Your essential guide to acing SQL and NoSQL job interviews (English Edition)
Ebook
SQL and NoSQL Interview Questions: Your essential guide to acing SQL and NoSQL job interviews (English Edition)
byVishwanathan Narayanan
Rating: 0 out of 5 stars
0 ratings
Business Analytics with SAS Studio: Deliver Business Intelligence by Combining SQL Processing, Insightful Visualizations, and Various Data Mining Techniques
Ebook
Business Analytics with SAS Studio: Deliver Business Intelligence by Combining SQL Processing, Insightful Visualizations, and Various Data Mining Techniques
byRajinder Kr. Chitoria
Rating: 0 out of 5 stars
0 ratings
Mastering PL/SQL Through Illustrations: From Learning Fundamentals to Developing Efficient PL/SQL Blocks (English Edition)
Ebook
Mastering PL/SQL Through Illustrations: From Learning Fundamentals to Developing Efficient PL/SQL Blocks (English Edition)
byB Chandra
Rating: 0 out of 5 stars
0 ratings
Model Based Environment: A Practical Guide for Data Model Implementation with Examples in Powerdesigner
Ebook
Model Based Environment: A Practical Guide for Data Model Implementation with Examples in Powerdesigner
byVladimir Pantic
Rating: 0 out of 5 stars
0 ratings

Enterprise Applications For You

Skip carousel

Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Excel Formulas and Functions 2020: Excel Academy, #1
Ebook
Excel Formulas and Functions 2020: Excel Academy, #1
byAdam Ramirez
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
Ebook
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
byTJ Books
Rating: 3 out of 5 stars
3/5
101 Ready-to-Use Excel Formulas
Ebook
101 Ready-to-Use Excel Formulas
byMichael Alexander
Rating: 4 out of 5 stars
4/5
Bitcoin For Dummies
Ebook
Bitcoin For Dummies
byPrypto
Rating: 4 out of 5 stars
4/5
Microsoft Power Platform A Deep Dive: Dig into Power Apps, Power Automate, Power BI, and Power Virtual Agents (English Edition)
Ebook
Microsoft Power Platform A Deep Dive: Dig into Power Apps, Power Automate, Power BI, and Power Virtual Agents (English Edition)
byBijay Kumar Sahoo
Rating: 0 out of 5 stars
0 ratings
Enterprise AI For Dummies
Ebook
Enterprise AI For Dummies
byZachary Jarvinen
Rating: 3 out of 5 stars
3/5
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
Ebook
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
byKevin Pitch
Rating: 5 out of 5 stars
5/5
Microsoft Outlook Guide to Success: Learn Smart Email Practices and Calendar Management for a Smooth Workflow [II EDITION]
Ebook
Microsoft Outlook Guide to Success: Learn Smart Email Practices and Calendar Management for a Smooth Workflow [II EDITION]
byKevin Pitch
Rating: 5 out of 5 stars
5/5
Excel 2019 For Dummies
Ebook
Excel 2019 For Dummies
byGreg Harvey
Rating: 3 out of 5 stars
3/5
The New Email Revolution: Save Time, Make Money, and Write Emails People Actually Want to Read!
Ebook
The New Email Revolution: Save Time, Make Money, and Write Emails People Actually Want to Read!
byRobert W. Bly
Rating: 5 out of 5 stars
5/5
Excel for Beginners 2023: A Step-by-Step and Quick Reference Guide to Master the Fundamentals, Formulas, Functions, & Charts in Excel with Practical Examples | A Complete Excel Shortcuts Cheat Sheet
Ebook
Excel for Beginners 2023: A Step-by-Step and Quick Reference Guide to Master the Fundamentals, Formulas, Functions, & Charts in Excel with Practical Examples | A Complete Excel Shortcuts Cheat Sheet
byJames H. Moyle
Rating: 0 out of 5 stars
0 ratings
Learn Windows PowerShell in a Month of Lunches
Ebook
Learn Windows PowerShell in a Month of Lunches
byDon Jones
Rating: 0 out of 5 stars
0 ratings
Excel 2023 for Beginners: A Complete Quick Reference Guide from Beginner to Advanced with Simple Tips and Tricks to Master All Essential Fundamentals, Formulas, Functions, Charts, Tools, & Shortcuts
Ebook
Excel 2023 for Beginners: A Complete Quick Reference Guide from Beginner to Advanced with Simple Tips and Tricks to Master All Essential Fundamentals, Formulas, Functions, Charts, Tools, & Shortcuts
byTerry R. Hoffmann
Rating: 0 out of 5 stars
0 ratings
Excel Guide for Success
Ebook
Excel Guide for Success
byKevin Pitch
Rating: 5 out of 5 stars
5/5
Excel 2019 Bible
Ebook
Excel 2019 Bible
byMichael Alexander
Rating: 4 out of 5 stars
4/5
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
Excel Formulas That Automate Tasks You No Longer Have Time For
Ebook
Excel Formulas That Automate Tasks You No Longer Have Time For
byErik Kopp
Rating: 5 out of 5 stars
5/5
Experts' Guide to OneNote
Ebook
Experts' Guide to OneNote
byJeremy P. Jones
Rating: 5 out of 5 stars
5/5
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
Ebook
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
byMaximus Wilson
Rating: 0 out of 5 stars
0 ratings
50 Useful Excel Functions: Excel Essentials, #3
Ebook
50 Useful Excel Functions: Excel Essentials, #3
byM.L. Humphrey
Rating: 5 out of 5 stars
5/5
QuickBooks Online For Dummies
Ebook
QuickBooks Online For Dummies
byDavid H. Ringstrom
Rating: 0 out of 5 stars
0 ratings
Excel Tips and Tricks
Ebook
Excel Tips and Tricks
byM.L. Humphrey
Rating: 0 out of 5 stars
0 ratings
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
Ebook
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
byJohn Ladley
Rating: 4 out of 5 stars
4/5
Essential Office 365 Third Edition: The Illustrated Guide to Using Microsoft Office
Ebook
Essential Office 365 Third Edition: The Illustrated Guide to Using Microsoft Office
byKevin Wilson
Rating: 3 out of 5 stars
3/5
Learning Microsoft Azure
Ebook
Learning Microsoft Azure
byGeoff Webber-Cross
Rating: 4 out of 5 stars
4/5
QuickBooks 2023 All-in-One For Dummies
Ebook
QuickBooks 2023 All-in-One For Dummies
byStephen L. Nelson
Rating: 0 out of 5 stars
0 ratings
Building Web Services with Microsoft Azure
Ebook
Building Web Services with Microsoft Azure
byAlex Belotserkovskiy
Rating: 0 out of 5 stars
0 ratings
Evernote Essentials Guide (Boxed Set): Evernote Guide For Beginners for Organizing Your Life
Ebook
Evernote Essentials Guide (Boxed Set): Evernote Guide For Beginners for Organizing Your Life
bySpeedy Publishing
Rating: 3 out of 5 stars
3/5
MrExcel XL: The 40 Greatest Excel Tips of All Time
Ebook
MrExcel XL: The 40 Greatest Excel Tips of All Time
byBill Jelen
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Podcast episode
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
Podcast episode
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
byData Engineering Podcast
0 ratings
0% found this document useful
Oracle Data Lakehouse: With each passing day, more and more data sources are sending greater volumes of data across the globe. For any organization, this combination of structured and unstructured data continues to be a challenge. Data lakehouses link, correlate, and...
Podcast episode
Oracle Data Lakehouse: With each passing day, more and more data sources are sending greater volumes of data across the globe. For any organization, this combination of structured and unstructured data continues to be a challenge. Data lakehouses link, correlate, and...
byOracle University Podcast
0 ratings
0% found this document useful
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
Podcast episode
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
byData Engineering Podcast
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
Bringing DevOps to the Database with Automation
Podcast episode
Bringing DevOps to the Database with Automation
byThe Cloudcast
0 ratings
0% found this document useful
Build - Test - Monitor: Microservice Monitoring for Developers on a CaaS Platform - DevOpsDays DC - 2017: Joshua Boyd, Senior Lead Technologist at Booz Allen Hamilton
Podcast episode
Build - Test - Monitor: Microservice Monitoring for Developers on a CaaS Platform - DevOpsDays DC - 2017: Joshua Boyd, Senior Lead Technologist at Booz Allen Hamilton
byDevOps Days Podcast
0 ratings
0% found this document useful
Powering Vector Search With Real Time And Incremental Vector Indexes: The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.
Podcast episode
Powering Vector Search With Real Time And Incremental Vector Indexes: The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.
byData Engineering Podcast
0 ratings
0% found this document useful
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
Podcast episode
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
byData Engineering Podcast
0 ratings
0% found this document useful
Rainforest QA with Russell Smith: Russell Smith, cofounder and CTO of Rainforest QA, joins the podcast to explain how they power their analytics platform with BigQuery, streaming thousands of rows per second.
Podcast episode
Rainforest QA with Russell Smith: Russell Smith, cofounder and CTO of Rainforest QA, joins the podcast to explain how they power their analytics platform with BigQuery, streaming thousands of rows per second.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
Podcast episode
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
byData Engineering Podcast
0 ratings
0% found this document useful
Reconciling The Data In Your Databases With Datafold: A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.
Podcast episode
Reconciling The Data In Your Databases With Datafold: A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.
byData Engineering Podcast
0 ratings
0% found this document useful
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
Podcast episode
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
byData Engineering Podcast
0 ratings
0% found this document useful
Data Sharing Across Business And Platform Boundaries: Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
Podcast episode
Data Sharing Across Business And Platform Boundaries: Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
byData Engineering Podcast
0 ratings
0% found this document useful
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
Podcast episode
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
byData Engineering Podcast
0 ratings
0% found this document useful
Real-World SRE Perspectives
Podcast episode
Real-World SRE Perspectives
byThe Cloudcast
0 ratings
0% found this document useful
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
Podcast episode
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
byData Engineering Podcast
0 ratings
0% found this document useful
Designing Data Transfer Systems That Scale: The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.
Podcast episode
Designing Data Transfer Systems That Scale: The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.
byData Engineering Podcast
0 ratings
0% found this document useful
Shining Some Light In The Black Box Of PostgreSQL Performance: Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.
Podcast episode
Shining Some Light In The Black Box Of PostgreSQL Performance: Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.
byData Engineering Podcast
0 ratings
0% found this document useful
How Redpanda Extracts Business Value from Data Events with Alex Gallego
Podcast episode
How Redpanda Extracts Business Value from Data Events with Alex Gallego
byScreaming in the Cloud
0 ratings
0% found this document useful
Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.
$Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.$
$Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.$
Podcast episode
Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.
byData Engineering Podcast
0 ratings
0% found this document useful
Surveying The Market Of Database Products: Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.
Podcast episode
Surveying The Market Of Database Products: Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.
byData Engineering Podcast
0 ratings
0% found this document useful
Best of 2023: Getting Started with Oracle Database: In today’s digital economy, data is a form of capital. Given the mission-critical role that it has, having a robust data management strategy is now more crucial than ever. Join Lois Houston and Nikita Abraham, along with Kay Malcolm, as they...
Podcast episode
Best of 2023: Getting Started with Oracle Database: In today’s digital economy, data is a form of capital. Given the mission-critical role that it has, having a robust data management strategy is now more crucial than ever. Join Lois Houston and Nikita Abraham, along with Kay Malcolm, as they...
byOracle University Podcast
0 ratings
0% found this document useful
Understanding Time-Series Database Patterns
Podcast episode
Understanding Time-Series Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
Building Applications With Data As Code On The DataOS: The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.
Podcast episode
Building Applications With Data As Code On The DataOS: The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.
byData Engineering Podcast
0 ratings
0% found this document useful
Whiteboard Confessional: Everything's a Database Except SQLite: Join me as I continue a new series called Whiteboard Confessional with a look at the awesomeness that is SQLite, including how it wasn’t designed to work in a client-server fashion, when you should use it and when you absolutely shouldn’t, how deciding to
Podcast episode
Whiteboard Confessional: Everything's a Database Except SQLite: Join me as I continue a new series called Whiteboard Confessional with a look at the awesomeness that is SQLite, including how it wasn’t designed to work in a client-server fashion, when you should use it and when you absolutely shouldn’t, how deciding to
byAWS Morning Brief
0 ratings
0% found this document useful
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
Podcast episode
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
Podcast episode
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
byData Engineering Podcast
0 ratings
0% found this document useful
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Podcast episode
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
byData Engineering Podcast
0 ratings
0% found this document useful

Skip carousel

Enterprise Soaring Success
Linux Format
Article
Enterprise Soaring Success
Aug 27, 2019
7 min read
Three Low-code Options
PC Pro Magazine
Article
Three Low-code Options
Nov 12, 2020
Counting Intel, Vodafone and VW among its customers, OutSystems helps businesses create cloudbased, on-premises and hybrid applications for mobile and web. Its development environment is predominantly drag-and-drop, with views for processes, data and
3 min read
“When Something Goes Wrong, You Realise You’re Like That Cartoon Character That Has Run Off The Edge Of The Cliff”
PC Pro Magazine
Article
“When Something Goes Wrong, You Realise You’re Like That Cartoon Character That Has Run Off The Edge Of The Cliff”
Feb 9, 2023
We need to talk about data. Specifically, your data and my data. The stuff we use on a day-to-day basis, from where we store it to what our expectations are for its safe handling. Now let me get one thing clear from the beginning: I am going to sugge
9 min read
How Changing an App Platform Allowed This Meal-Replacement Startup to Grow
Entrepreneur
Article
How Changing an App Platform Allowed This Meal-Replacement Startup to Grow
Mar 1, 2016
1 min read
One-day Projects To Improve Your Business Network
PC Pro Magazine
Article
One-day Projects To Improve Your Business Network
Apr 10, 2022
8 min read
Choices, Choices
Linux Format
Article
Choices, Choices
Apr 5, 2022
Matt Yonkovit is the head of Open Source Strategy at Percona “Many modern programs are built with dozens of different open source components, constructed like LEGO from pre-built blocks. This approach to picking and choosing the best tools and compon
1 min read
Choices, Choices
Linux Format
Article
Choices, Choices
Apr 5, 2022
Matt Yonkovit is the head of Open Source Strategy at Percona “Many modern programs are built with dozens of different open source components, constructed like LEGO from pre-built blocks. This approach to picking and choosing the best tools and compon
1 min read
Darq
PC Pro Magazine
Article
Darq
Jul 9, 2022
3 min read
Buying The Tool
Techfastly
Article
Buying The Tool
Apr 1, 2021
3 min read
Real World Computing
PC Pro Magazine
Article
Real World Computing
May 11, 2023
Migrating to Azure isn’t necessarily the toughest part of a successful cloud migration, explains our guest columnist Many organisations succeed at deploying resources in or migrating to Microsoft Azure. But many of those same organisations fail to en
6 min read
“We Should Pay Attention To The Way That A New Language Can Redefine The Limits Of Computing”
PC Pro Magazine
Article
“We Should Pay Attention To The Way That A New Language Can Redefine The Limits Of Computing”
Feb 11, 2021
7 min read
“The Biggest Problem I See When People Are Working From Home Is A Poorly Designed Network”
PC Pro Magazine
Article
“The Biggest Problem I See When People Are Working From Home Is A Poorly Designed Network”
Jun 8, 2023
6 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
Contributing For Non - Coders
Linux Format
Article
Contributing For Non - Coders
Jan 10, 2023
9 min read
KeePassXC: The Friendlier Free Offline Password Manager
PCWorld
Article
KeePassXC: The Friendlier Free Offline Password Manager
Sep 5, 2023
7 min read
Data-driven Decision Making That Uses Data, Mind And Heart
The European Business Review
Article
Data-driven Decision Making That Uses Data, Mind And Heart
Jan 31, 2020
14 min read
Leighton Wolffe
Cannabis & Tech Today
Article
Leighton Wolffe
Mar 20, 2020
The cannabis industry has plenty of data floating around, but how much is put to use? As with most big data, it’s desperately underutilized. Lighting, irrigation, and HVAC systems could be transmitting information about crop health twenty-four hours
4 min read
Commentary: Thinking About A Password-free Future? Think Again
Chicago Tribune
Article
Commentary: Thinking About A Password-free Future? Think Again
Aug 25, 2023
3 min read
What is ELT?
Techfastly
Article
What is ELT?
Apr 1, 2021
It stands for extract, load, and transform- the processes a data pipeline uses for replicating the data from a source system into a target system such as a cloud data warehouse. 1. Extraction is the first step in which data is copied from the source
6 min read
10 Questions Every IT Department Should Be Able To Answer (BUT PROBABLY CAN’T)
PC Pro Magazine
Article
10 Questions Every IT Department Should Be Able To Answer (BUT PROBABLY CAN’T)
Jul 8, 2021
6 min read
BUYER'S GUIDE TO Cloud File Sharing In 2021
PC Pro Magazine
Article
BUYER'S GUIDE TO Cloud File Sharing In 2021
Jan 7, 2021
4 min read
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
PC Pro Magazine
Article
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
Jul 8, 2021
6 min read
Integrated Workplace Management Systems
Facility Management
Article
Integrated Workplace Management Systems
Dec 23, 2018
Property and facilities management are data-rich operating worlds. This is becoming even more complex as the Internet of Things (IoT) provides the capability to imbed sensors and diagnostic tools to monitor the use and performance of everything in re
4 min read
Rise Of The Robots
Linux Format
Article
Rise Of The Robots
Jan 12, 2021
7 min read
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Business Today
Article
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Jan 20, 2023
2 min read
Mining Actionable Information with Smart Capture
The European Business Review
Article
Mining Actionable Information with Smart Capture
May 22, 2018
4 min read
Best Password Managers For Your Android Device
Android Advisor
Article
Best Password Managers For Your Android Device
Jul 5, 2023
7 min read
Herd In The Cloud
Linux Format
Article
Herd In The Cloud
Sep 21, 2021
Matt Yonkovit is Percona’s Head of Open Source Strategy and a member of SHA (Silly Hats Anonymous). “Going ‘cloud native’ involves building applications in new ways. Traditional applications are generally designed with a two- or three-tier architectu
1 min read
Building Trends, Building Momentum
Facility Management
Article
Building Trends, Building Momentum
Oct 14, 2019
3 min read
Hybrid Working–is Your IT Ready?
PC Pro Magazine
Article
Hybrid Working–is Your IT Ready?
Nov 11, 2021
7 min read

Related categories

Skip carousel

Reviews for Hands-on Data Virtualization with Polybase

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Hands-on Data Virtualization with Polybase - Pablo Alejandro Echeverria Barrios

CHAPTER 1

Data Virtualization

Imagine you have a list of information. If the list is relatively small, you can read it from start to end, and summarize it extremely fast and without difficulty. And even if it is the size of a small book, it is still doable, although it will require more time and probably some summarizing techniques you already know. But as the size of the list grows, the amount of effort and time you must put into it will also increase … until you get to a point where your brain can't process all of the information. This is exactly the case with state-of-the-art Big Data, Internet of Things ( IoT ), data mining … you name it: there are massive data sets that need to be analyzed as fast as possible, but with a rational number of resources to keep the costs low. How can you access the information contained in these massive data sets using the tools and languages you already know, that is, without having to put a lot of effort in learning how to use them, and without having to build complex structures and processes, or moving vast amounts of data that will take insane amounts of time? Will you be able to do it several times a day?

Structure

In this chapter, you will learn the following topics:

Filtering the information

Link relational data with storage/file system data

What you would have to do without data virtualization

How data virtualization simplifies querying external data

How learning PolyBase can help you irrespective of your role

Objectives

After studying this chapter, you will be able to do the following:

Identify on which side of a computer communication network the information should be filtered

Understand the importance of relational data

Understand the importance of storage and file system data

Understand the benefits of data virtualization

Understand how PolyBase can help different roles

Filtering the information

You have two computers, A and B. In computer A, you have 1,000,000 entries, and in computer B, you have 1,000 entries; note that the entries are somehow related to each other. If you move all entries from computer A to B, it means this data will go through the network, consuming your network bandwidth, and not allowing other information to be transmitted among other computers on the network because of the reduced network space or the lack of it. This also means you must have enough memory in computer B to store these entries, not to mention you have duplicated your information. Finally, computer B needs to use its CPU, memory, and disk to process the entries and link them with its local entries. Which device has adequate memory to link them? The answer is not computer A, as you may be thinking. Computer B can process the data faster:

If it has more memory,

If it has additional or faster CPUs,

If it has additional or faster disks, or

If it is a distributed system.

So, you must achieve a balance between moving massive amounts of data through the network and processing the data in a computer with additional resources.

But what if you can move the 1,000 entries from computer B to A, do the filtering on computer A, and return only those entries to computer B? In that case, you will not saturate the network, duplicate information, or need more storage. Thus, it's not enough to compare the computing resources between two environments' it's also crucial to consider how to process the data efficiently and effectively, even testing different setups to find the one that provides the most benefits.

Link relational data with storage/file system data

You may ask why you need external data when you already have a relational database. Using a relational database, you can process daily business operations such as modifying the stored information through updates (for example, a customer that has moved to another city) and deletes (for example, a customer has cancelled a pre-order). However, you also need to consider insert operations (for example, new orders) and read operations (for example, reports), and how to guarantee data integrity between concurrent reads and writes; both fast reads and fast writes cannot be satisfied at the same time.

Despite these benefits, you must consider the several different types of database management systems available, each with advantages over the other. Moreover, you may not be able to migrate all of your data onto a single one, or the data may come from third-party software that is unsupported in a different database. And you may not have enough time and money to switch to another system or develop your own.

You may also ask why you need a relational database when you have storage and file system. The storage and file system offer the advantage of fast reads and writes, but at the expense of data integrity. Further, updating the information of a customer means updating hundreds of records, which is a slow and costly operation. Therefore, it is better suited for information generated sequentially (which won't contain customer information) and for archiving purposes (which may never need to be updated).

The storage and file system is used because it can be optimized for parallel processing, provide cost-effective distributed and scalable processing, allow unstructured information storage and retrieval, provide real-time analysis mechanisms, and support deep learning and streaming workloads.

While working on the field with real businesses, you will use components that are already purchased and licensed, and therefore knowing how to interconnect them is a must. Your customer has heterogeneous database management systems and storage and file system data; trying to change this is an enormous and costly operation that won't generate any value. It is possible that you store the customer information in a relational database, which must be transactional and concurrent, and documents, pre-orders, orders, etc. in a storage and file system data where you have fast storage and retrieval. The only way to know the relationship between both, and to extract information from one into another, is by linking them. While doing so, you want to keep the benefits each of these technologies provides.

What you would have to do without data virtualization

Let's say your relational data is stored in SQL Server (a relational database management system), and your storage and file system data is stored in HDFS (Hadoop File System). If you're familiar with SQL Server, you know you can create a linked server (a data connection) between both. Here is an article that describes how to create it: https://runops.wordpress.com/2015/10/17/create-sql-server-linked-server-to-hadoop/. Once that is established, you can retrieve the information from Hadoop into SQL Server to have it in the same format and link it with your relational data, but you won't be using the benefits of Hadoop; so it will probably end up being a long-running operation, consuming network bandwidth and memory.

Another possible way is to load the SQL Server data into Hadoop to have it in the same format, and then link the information and get the insight you wanted. However, this means you will need to learn about the distributed architecture (nodes) and the communication model between these nodes, how to load SQL Server data into it, how to write HiveQL (which is similar to T-SQL), how to write MapReduce jobs for summarizing information, and how to export this back into SQL Server. This will be a several months' project for each team member, and for new hires as well.

If your data is stored in another database, like Oracle, you could create a link to SQL Server and link the information within Oracle. Here is an article that describes how to create it: https://www.sqlservercentral.com/articles/perform-data-filtering-in-oracle-link-to-sql-server. However, this means you will need to learn about Oracle, how to connect to it, write PL/SQL queries, and have elevated permissions to create the link. Further, if you want to process the information back in SQL Server, you will need a way for it, write custom logic to link and integrate it at an application server, or create a complex setup for this.

Wouldn't it be great to be able to query any external information within SQL Server (a tool you already know) using T-SQL (a language you're familiar with), while utilizing the characteristics each external system offers, like parallel processing and fast storage and retrieval?

How data virtualization simplifies querying external data

PolyBase enables your SQL Server instance to read data from external sources through T-SQL statements that, first, specify the details at the moment of the table creation (for example, how the external data is structured), and then, query the external source as normal tables, irrespective of whether they're database management systems or a storage or file system. As the data in these external sources comes back in the form of tables, you can easily link them to your SQL Server data tables, and then combine both. And because PolyBase uses T-SQL for this purpose, you don't need any knowledge about the external source, or about how to configure or query it in its own language.

With PolyBase, you're also not required to install additional software in your external environment, and you won't need a separate ETL or import tool to link the data. And that's what data virtualization means: allowing the data to stay in its original location while virtually (not realistically) having it available in your SQL Server instance.

In the specific case of Hadoop, you can query unstructured information, and you can push the computation to be made remotely in the Hadoop server when it helps optimizing the overall performance. The decision to do the processing on Hadoop is based on statistics kept in SQL Server about the external table, but if the computation is chosen to be made in Hadoop, it automatically creates MapReduce jobs for the task without you knowing how to create them and leverages the distributed computational resources of Hadoop.

Moreover, if you need enhanced performance owing to the nature of your data, you can create SQL Server scale-out groups that enable parallel data transfer between each Hadoop node and each SQL Server instance, also allowing to operate on this external data using each instance's computing resources.

How learning PolyBase can help you irrespective of your role

Now that you know the benefits of PolyBase, you may be wondering why you need to learn it and how it can help you in your current role. The use cases are diverse, and I'm sure there's one that fits your organization. I'm citing only a few use cases here, but with these I hope to give you enough insight about how important this technology is and what it enables you to do, with the goal that this will help you do your job faster and easier and allow you to propose it within your organization for a situation where it fits well.

As a database administrator (DBA), you have long-running processes that move information from one place to another, and that information is critical for the business decision support systems. When there is delay, or the process fails, your customer starts losing money and won't be willing to wait for the process to be restarted or lose a whole day of work. PolyBase can accelerate this process thanks to the parallel processing it offers.

As a data engineer, you divide and sample the information from all data stores, which requires you to learn each data store system's basics and then gather the required information. PolyBase doesn't require you to know anything other than SQL Server and T-SQL, thus simplifying your job.

As a data scientist, you perform exploratory data analysis before working on the whole data, which requires you to work on large amounts of information using large number of resources. PolyBase allows you to easily work on subsets of data using only SQL Server.

As a developer, your main goal is to develop fast and efficient programs irrespective of where the data is located. PolyBase allows you to avoid using a linked server, which is slow.

In a business intelligence (BI) role, you're more interested in the external data to be available than the details about how it works. PolyBase allows you to query the external data without moving all of it, and before all of it has been moved from one point to the other.

In a machine learning (ML) role, you're more interested in pre-processing the data than learning where the data comes from and how to query it. PolyBase does exactly that.

As a systems architect, you provision the components that bring the most value to your customer and simplify the existing ones, thus reducing costs. This decision is driven by the fact that these components can be interconnected easily, which PolyBase allows you to do.

As an entrepreneur, you try different setups and configurations before deciding on which ones to use, and so you need to know if you can interconnect them. PolyBase allows you to easily interconnect them.

In a financial role, you reduce costs including the ones associated with training personnel with old and new tools and languages, switching from paid to open-source third-party software, or deciding when to do internal development to cut on expenses. PolyBase only requires knowledge about SQL Server and T-SQL, allows the use of open-source storage and file systems, and reduces dependence on ETL tools and paid software, as anyone with access to external tables can get the data they need.

As a customer, you want your data to be processed in one pass so it is available for you to perform your job faster. Learning PolyBase can turn this into a reality.

As a support technician or incident response member, you must do troubleshooting before calling the appropriate team, development, network, storage, database administration, etc. If PolyBase is used, you need to know how it works and how to troubleshoot it.

As a data migration specialist, you transfer data between different systems with different collations and encodings. PolyBase facilitates this and reduces the costs incurred for specialized tool licenses.

Conclusion

It is a fact you will end up working with heterogeneous database systems and storage solutions, so it is important for you to know the most efficient way to perform some computations. And you must do this to obtain insight from your information, so you have to do it efficiently and, if possible, in no time using the tools and languages you're most familiar with. Without data virtualization, you either have to do a lot of work or do it inefficiently, but with PolyBase, you can virtualize your data and consume it as if it was local. This is a technology everyone must be aware of, as it can help you achieve your business goals.

In the next chapter, we will see the detailed history of PolyBase.

Points to remember

The process of linking data needs to be considered and tested to ensure it is efficient.

The relational data is as important as storage and file system data.

Data can get virtualized using a technology available to you.

PolyBase facilitates data virtualization and accelerates computation.

Multiple choice questions

Which resource do you consider most important when deciding where the computation needs to be done between a pair of computers?

Network bandwidth

CPU

Memory

Storage

Which are the benefits of a relational database?

Data integrity

Scalable processing

Concurrency

Fast storage and retrieval

What are the benefits of storage and file system data?

Concurrency

Scalable processing

Data integrity

Fast storage and retrieval

How does PolyBase help?

Learn the details and the language of the external source

Decide where to perform the computation

Perform parallel data transfer and operate on external data

Install additional software on the external source

Answer

a, c

b, c

Questions

Between a pair of computers, which do think is more suitable for filtering information?

What are your reasons to link relational data with storage and file system data?

What is data virtualization in your own words?

How do you think PolyBase can help you in your current role?

CHAPTER 2

History of PolyBase

PolyBase was first announced on November 07, 2012, during the SQL PASS Summit at Seattle , and it was presented along with other new technologies such as Hekaton and updateable column store indexes. During the three days of the summit, representatives from Microsoft talked about the opportunity the new technologies offer for reintegrating and rewiring the economy around the changing value of information in all businesses, enabling us to gain insights from any data, of any size, and from anywhere. However, PolyBase could not have been possible without Parallel Data Warehouse ( PDW ) released in 2010.

PolyBase was expected to be released in 2013 on the new version of SQL Server PDW 2012 (v2), and when released it was only able to connect to Microsoft HDInsight implementation (Hadoop on Windows Server and Windows Azure), requiring only Oracle Java Runtime Environment (JRE) as a third-party component. Knowing its origins and how it has developed over time, you can know where it is headed and what you can expect it to achieve.

Structure

In this chapter, you will learn the following topics:

The data warehousing market

November 2010, the basis: Parallel Data Warehouse (PDW)

November 2012: PolyBase official announcement at the SQL PASS session

July 12, 2013: PDW 2012 (v2) release

May 1, 2014: PDW v2 AU1 also known as Analytics Platform System (APS)

Other APS AU releases (2, 3, 4, and 5)

SQL Server 2016

SQL Server 2017

Objectives

After studying this chapter, you will be able to do the following:

Familiarize yourself with the terminology required to read technical papers from Microsoft regarding their Big Data solutions and decide which one better suits your needs

Have a better understanding of the underlying technology of PolyBase and where it came from

Understand the importance of Parallel Data Warehouse (PDW) 2010 with respect to PolyBase

Understand the importance of PolyBase as described in SQL PASS 2012

Identify the characteristics of PolyBase in PDW 2012 (v2) release

Identify the characteristics of PolyBase in Analytics Platform System (APS)

Identify the characteristics of PolyBase in SQL Server 2016

Identify the characteristics of PolyBase in SQL Server 2017

Understand the limitations of PolyBase

The data warehousing market

It was 2008, and Teradata was leading the data warehousing and analytics market with around thirty years of experience. They were pioneers of high scalability with an implementation that didn’t rely on hardware for its parallelism, scalability, reliability, or availability, and that ran over Linux or Windows. They had a high-performance decision-support engine, a truly parallel implementation that automatically distributed data and balanced workload without replication, serialization, or merging. And they also had a technology called Teradata Virtual Storage, which moved hot data to faster disks or faster blocks within a disk.

It was also during this year when several startups specialized in data warehousing and analytics had already consolidated and were harvesting the results of their good work when major enterprises, prevising the Big Data and IoT future, were trying to enter this market. This is inferred because the same functionality PolyBase provides had been addressed by others to ingest data stored in Hadoop (the first value in parenthesis is the year when it was created or when the functionality was first provided):

Netezza (1999, acquired in 2010 by IBM): It reads data through its HTTP interface; so unfortunately, multiple nodes can’t be read in parallel.

Greenplum (2003, acquired in 2010 by EMC Corporation): It allows files to be queried as relational tables, with syntax similar to that of PolyBase.

Aster (2005, acquired in 2010 by Teradata): It is able to parallelize the work on each working node, allowing the extraction of a different part in parallel and insert it into a partitioned temporary table.

Oracle (version 9i in 2003, 10g in 2006): It allows the information to be queried without having to be pre-loaded. Version 10g external tables are read-write; previously these were read-only.

Vertica (2005, acquired in 2011 by Hewlett Packard): It is similar to Greenplum, Aster, and Oracle.

Sqoop (2009, part of Apache since 2012): It is able to move tables in and out of Hadoop, and generate Java classes that allow MapReduce to interact with relational data.

Hadap (2010, purchased in 2014 by Teradata): It enables the execution of SQL-like queries across unstructured and structured data using a split query processing that creates MapReduce jobs executing in parallel.

Oracle entered this market with Exadata Database Machine, offering a hardware and software combination capable of running OLTP simultaneously with analytics, and providing extreme performance and scalability, the ability to perform up to 1 million input/output operations per second (IOPS), and running the most important database applications ten times faster or more. This is possible because all disks can operate in parallel and all processing is moved to storage (including decryption), thus reducing CPU and network consumption.

SQL Server introduced Predicate Pushdown to the storage engine in SQL Server 2016, about eight years later. Also, in this version, you can find multiple improvements to support OLTP simultaneously with analytics.

SQL Server was capable of addressing hundreds of petabytes of information according to its technical specifications, but the processing slowed down as the size increased. So in order to maintain their competing edge in the market, in July 2008, they announced the purchase of DATAllegro, specializing in data warehousing since 2003. Its main features are the architecture implemented on commodity hardware and that it uses an open-source software stack: Ingres DBMS running on Linux. Microsoft had to work on merging this technology with SQL Server, which was not an easy task, so they started a project codenamed Madison, which was expected to be released in the second quarter of 2010.

November 2010, the basis: Parallel Data Warehouse (PDW)

It was not until the Professional Association for SQL Server (PASS) Community Summit on November 09, 2010, that Ted Kummert, the then senior vice-president of the Business Platform Division at Microsoft, announced the availability of Microsoft SQL Server 2008 R2 PDW, also known by Hewlett-Packard as Enterprise Data Warehouse (EDW), targeted at high-end businesses at a lower price than its competitors. It was sold for about $2 million without licenses and support, with the software starting at $841,610 and the hardware at $900,000.

It consisted of multiple SQL Server 2008 R2 instances running on Windows Server 2008 R2 and on specific pre-configured Hewlett-Packard hardware, offering high scalability when needed. Its architecture was a control node, the brain that managed query execution and metadata for what is stored and what is processed on each node, and multiple compute nodes that performed the actual storage and computations in parallel, either by having tables replicated across nodes (execute bits of the same request simultaneously) or distributed across nodes (determine which node contains the data and therefore needs to do the actual processing). Part of the technology incorporated into it included a parallel database copy that enabled rapid data movement and consistency between PDW and data marts used by SQL Server Analysis Services (SSAS). The user interface was Nexus Chameleon, as SSMS had not been reworked to connect to the control node.

It offered 200 times faster queries and 10 times more scalability than traditional deployments, handling up to 600 TB of data. A working use case was Information Security Consolidated Event Management (ICE) migrated to PDW, where query performance improved to an average of fifteen to twenty times faster, SQL Server Information Services (SSIS) data load throughput of up to 285 GB/hour with minimal query performance impact, and support for up to 12 TB/day in throughput.

It was sold with the premise that it required less DBA maintenance and monitoring, so they could spend more time architecting and not babysitting the database, thus preventing blocks, logs, or waits, and not requiring indexes, archiving, deleting, query hints, IO tuning, query optimization, partitioning, managing filegroups, or shrinking databases. Because of all the new development to integrate SQL Server with DATAllegro, not all features of SQL Server were supported in this version. This didn’t include PolyBase, but included a component that later was going to be critical in enabling this technology: Data Movement Service (DMS). This was the interface between the Massive Parallel Processing (MPP) engine and the actual data in the control, compute, and landing zone nodes. It was responsible for moving data around nodes as needed, and enabled parallel operations among the compute nodes (queries, loads, etc.)

DMS is still used in SQL Server 2019.

Hadoop was the leader of Big Data, and IoT storage being a distributed and cost-effective solution, and because Microsoft was going to implement its own version of Hadoop (which occurred in 2013 with Azure HDInsight), they considered supporting querying HDFS data directly from the next version of PDW. This was not possible on another Microsoft product because big data requires big processing, and PDW was the only one capable of handling such huge amounts of data.

November 2012: PolyBase official announcement at the SQL PASS session

It was during the PASS Community Summit in November 2012 that Microsoft announced the next version of PDW containing PolyBase.

The first session (November 07, 2012) was presented by Ted Kummert, who was the corporate data platform vice-president at Microsoft at that time. He talked about how the business intelligence solutions help people be better at their jobs and businesses to move forward, and how Hadoop and MapReduce had matured and could be applied to a broader set of problems like machine learning and large-scale web applications where you had to store vast amount of unstructured data. For him, big data was about new insights, the latent value within your current data, and adding new sources of data—any data, anywhere, any size—and the main target was business acceleration through faster time-to-insight, making it easier for every end user not familiarized with technology to gain and for the business to operationalize. He mentioned how they went to David Dewitt and his team, and asked how to gain value out of multiple different forms of storage and processing engines and capabilities, and David’s team unified this into the T-SQL query processor as a base to support other types of data in the future.

Then, with the help of Christian Kleinerman, the then general manager at Microsoft, they demoed reading data from a file containing forum comments about Microsoft products into Hadoop HDFS. Then in PDW v2, he created an external table giving it a name, a schema, and the location of the Hadoop cluster. He performed a T-SQL select on that table to view ten records, and then he joined it to a relational table. Then he mentioned there are moments where you have a question you know is answered by the data out there, and when you don’t have it, you realize you need business intelligence. They ended the demo with a 1 PB data warehouse query finishing in less than two seconds.

The second session (November 08, 2012) was presented by Quentin Clark, the then corporate vice-president at Microsoft, during which he related that, since 2010, he had been demoing the new SQL Server 2012 functionalities for the next release.

His first use case referred to the business of running an election, where your job was to look at multiple signals (large-scale data) from multiple data sources, with information you produced as well as that coming from outside, which turned this into a big data problem and changed how a business worked.

His second use case referred to a large hotel chain that used RFID not only for the room doors, but throughout the hotels to see who goes where, what drives people to take an action … basically observe the behavior inside the hotel and merge it with customer information to get a better profile of their guests. But they wanted to join it to social media information in order to reach into your preferences and see your interests and activities to customize and tailor your experience in the hotel.

His third use case referred to retail chains and how to do things in real time, like changing the music playing based on who’s in the store, the demographics, and purchase history. This required interacting with their music provider service based on analytics in real-time. Furthermore, they wanted to install digital displays where their providers could advertise or provide coupons based on who was standing next to it; in his own words, that’s a very different plumbing of the economy.

His fourth use case referred to a package shipping company, which sold information to financial companies. Although this was not their core business, it was valuable for other companies, so it could become a revenue stream.

It was a lot of data needing to be shaped into something understandable and to derive some value from, and provided in a way and with tools that engaged so people could listen to it, hear what it was trying to tell them. Only then, they could collaborate, share those stories, operationalize those insights for a business process to back that up and be able

Enjoying the preview?

Page 1 of 1

Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)

About this ebook

Pablo Alejandro Echeverria Barrios

Related authors

Related to Hands-on Data Virtualization with Polybase

Related ebooks

Enterprise Applications For You

Related podcast episodes

Related articles

Related categories

Reviews for Hands-on Data Virtualization with Polybase

What did you think?

Book preview

Hands-on Data Virtualization with Polybase - Pablo Alejandro Echeverria Barrios

CHAPTER 1

Data Virtualization

Structure

Objectives

Filtering the information

Link relational data with storage/file system data

What you would have to do without data virtualization

How data virtualization simplifies querying external data

How learning PolyBase can help you irrespective of your role

Conclusion

Points to remember

Multiple choice questions

Answer

Questions

CHAPTER 2

History of PolyBase

Structure

Objectives

The data warehousing market

November 2010, the basis: Parallel Data Warehouse (PDW)

November 2012: PolyBase official announcement at the SQL PASS session