Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
Ebook803 pages5 hours

Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book brings exciting coverage on establishing and managing data virtualization using polybase. This book teaches how to configure polybase on almost all relational and nonrelational databases. You will learn to set up the test environment for any tool or software instantly without hassle. You will practice how to design and build some of the high performing data warehousing solutions and that too in a few minutes of time.
You will almost become an expert in connecting to all databases including hadoop, cassandra, MySQL, PostgreSQL, MariaDB and Oracle database. This book also brings exclusive coverage on how to build data clusters on Azure and using Azure Synapse Analytics. By the end of this book, you just don't administer the polybase for managing big data clusters but rather you learn to optimize and boost the performance for enabling data analytics and ease of data accessibility.
LanguageEnglish
Release dateApr 2, 2021
ISBN9789390684427
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)

Related to Hands-on Data Virtualization with Polybase

Related ebooks

Enterprise Applications For You

View More

Related articles

Reviews for Hands-on Data Virtualization with Polybase

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Hands-on Data Virtualization with Polybase - Pablo Alejandro Echeverria Barrios

    CHAPTER 1

    Data Virtualization

    Imagine you have a list of information. If the list is relatively small, you can read it from start to end, and summarize it extremely fast and without difficulty. And even if it is the size of a small book, it is still doable, although it will require more time and probably some summarizing techniques you already know. But as the size of the list grows, the amount of effort and time you must put into it will also increase … until you get to a point where your brain can't process all of the information. This is exactly the case with state-of-the-art Big Data, Internet of Things ( IoT ), data mining … you name it: there are massive data sets that need to be analyzed as fast as possible, but with a rational number of resources to keep the costs low. How can you access the information contained in these massive data sets using the tools and languages you already know, that is, without having to put a lot of effort in learning how to use them, and without having to build complex structures and processes, or moving vast amounts of data that will take insane amounts of time? Will you be able to do it several times a day?

    Structure

    In this chapter, you will learn the following topics:

    Filtering the information

    Link relational data with storage/file system data

    What you would have to do without data virtualization

    How data virtualization simplifies querying external data

    How learning PolyBase can help you irrespective of your role

    Objectives

    After studying this chapter, you will be able to do the following:

    Identify on which side of a computer communication network the information should be filtered

    Understand the importance of relational data

    Understand the importance of storage and file system data

    Understand the benefits of data virtualization

    Understand how PolyBase can help different roles

    Filtering the information

    You have two computers, A and B. In computer A, you have 1,000,000 entries, and in computer B, you have 1,000 entries; note that the entries are somehow related to each other. If you move all entries from computer A to B, it means this data will go through the network, consuming your network bandwidth, and not allowing other information to be transmitted among other computers on the network because of the reduced network space or the lack of it. This also means you must have enough memory in computer B to store these entries, not to mention you have duplicated your information. Finally, computer B needs to use its CPU, memory, and disk to process the entries and link them with its local entries. Which device has adequate memory to link them? The answer is not computer A, as you may be thinking. Computer B can process the data faster:

    If it has more memory,

    If it has additional or faster CPUs,

    If it has additional or faster disks, or

    If it is a distributed system.

    So, you must achieve a balance between moving massive amounts of data through the network and processing the data in a computer with additional resources.

    But what if you can move the 1,000 entries from computer B to A, do the filtering on computer A, and return only those entries to computer B? In that case, you will not saturate the network, duplicate information, or need more storage. Thus, it's not enough to compare the computing resources between two environments' it's also crucial to consider how to process the data efficiently and effectively, even testing different setups to find the one that provides the most benefits.

    Link relational data with storage/file system data

    You may ask why you need external data when you already have a relational database. Using a relational database, you can process daily business operations such as modifying the stored information through updates (for example, a customer that has moved to another city) and deletes (for example, a customer has cancelled a pre-order). However, you also need to consider insert operations (for example, new orders) and read operations (for example, reports), and how to guarantee data integrity between concurrent reads and writes; both fast reads and fast writes cannot be satisfied at the same time.

    Despite these benefits, you must consider the several different types of database management systems available, each with advantages over the other. Moreover, you may not be able to migrate all of your data onto a single one, or the data may come from third-party software that is unsupported in a different database. And you may not have enough time and money to switch to another system or develop your own.

    You may also ask why you need a relational database when you have storage and file system. The storage and file system offer the advantage of fast reads and writes, but at the expense of data integrity. Further, updating the information of a customer means updating hundreds of records, which is a slow and costly operation. Therefore, it is better suited for information generated sequentially (which won't contain customer information) and for archiving purposes (which may never need to be updated).

    The storage and file system is used because it can be optimized for parallel processing, provide cost-effective distributed and scalable processing, allow unstructured information storage and retrieval, provide real-time analysis mechanisms, and support deep learning and streaming workloads.

    While working on the field with real businesses, you will use components that are already purchased and licensed, and therefore knowing how to interconnect them is a must. Your customer has heterogeneous database management systems and storage and file system data; trying to change this is an enormous and costly operation that won't generate any value. It is possible that you store the customer information in a relational database, which must be transactional and concurrent, and documents, pre-orders, orders, etc. in a storage and file system data where you have fast storage and retrieval. The only way to know the relationship between both, and to extract information from one into another, is by linking them. While doing so, you want to keep the benefits each of these technologies provides.

    What you would have to do without data virtualization

    Let's say your relational data is stored in SQL Server (a relational database management system), and your storage and file system data is stored in HDFS (Hadoop File System). If you're familiar with SQL Server, you know you can create a linked server (a data connection) between both. Here is an article that describes how to create it: https://runops.wordpress.com/2015/10/17/create-sql-server-linked-server-to-hadoop/. Once that is established, you can retrieve the information from Hadoop into SQL Server to have it in the same format and link it with your relational data, but you won't be using the benefits of Hadoop; so it will probably end up being a long-running operation, consuming network bandwidth and memory.

    Another possible way is to load the SQL Server data into Hadoop to have it in the same format, and then link the information and get the insight you wanted. However, this means you will need to learn about the distributed architecture (nodes) and the communication model between these nodes, how to load SQL Server data into it, how to write HiveQL (which is similar to T-SQL), how to write MapReduce jobs for summarizing information, and how to export this back into SQL Server. This will be a several months' project for each team member, and for new hires as well.

    If your data is stored in another database, like Oracle, you could create a link to SQL Server and link the information within Oracle. Here is an article that describes how to create it: https://www.sqlservercentral.com/articles/perform-data-filtering-in-oracle-link-to-sql-server. However, this means you will need to learn about Oracle, how to connect to it, write PL/SQL queries, and have elevated permissions to create the link. Further, if you want to process the information back in SQL Server, you will need a way for it, write custom logic to link and integrate it at an application server, or create a complex setup for this.

    Wouldn't it be great to be able to query any external information within SQL Server (a tool you already know) using T-SQL (a language you're familiar with), while utilizing the characteristics each external system offers, like parallel processing and fast storage and retrieval?

    How data virtualization simplifies querying external data

    PolyBase enables your SQL Server instance to read data from external sources through T-SQL statements that, first, specify the details at the moment of the table creation (for example, how the external data is structured), and then, query the external source as normal tables, irrespective of whether they're database management systems or a storage or file system. As the data in these external sources comes back in the form of tables, you can easily link them to your SQL Server data tables, and then combine both. And because PolyBase uses T-SQL for this purpose, you don't need any knowledge about the external source, or about how to configure or query it in its own language.

    With PolyBase, you're also not required to install additional software in your external environment, and you won't need a separate ETL or import tool to link the data. And that's what data virtualization means: allowing the data to stay in its original location while virtually (not realistically) having it available in your SQL Server instance.

    In the specific case of Hadoop, you can query unstructured information, and you can push the computation to be made remotely in the Hadoop server when it helps optimizing the overall performance. The decision to do the processing on Hadoop is based on statistics kept in SQL Server about the external table, but if the computation is chosen to be made in Hadoop, it automatically creates MapReduce jobs for the task without you knowing how to create them and leverages the distributed computational resources of Hadoop.

    Moreover, if you need enhanced performance owing to the nature of your data, you can create SQL Server scale-out groups that enable parallel data transfer between each Hadoop node and each SQL Server instance, also allowing to operate on this external data using each instance's computing resources.

    How learning PolyBase can help you irrespective of your role

    Now that you know the benefits of PolyBase, you may be wondering why you need to learn it and how it can help you in your current role. The use cases are diverse, and I'm sure there's one that fits your organization. I'm citing only a few use cases here, but with these I hope to give you enough insight about how important this technology is and what it enables you to do, with the goal that this will help you do your job faster and easier and allow you to propose it within your organization for a situation where it fits well.

    As a database administrator (DBA), you have long-running processes that move information from one place to another, and that information is critical for the business decision support systems. When there is delay, or the process fails, your customer starts losing money and won't be willing to wait for the process to be restarted or lose a whole day of work. PolyBase can accelerate this process thanks to the parallel processing it offers.

    As a data engineer, you divide and sample the information from all data stores, which requires you to learn each data store system's basics and then gather the required information. PolyBase doesn't require you to know anything other than SQL Server and T-SQL, thus simplifying your job.

    As a data scientist, you perform exploratory data analysis before working on the whole data, which requires you to work on large amounts of information using large number of resources. PolyBase allows you to easily work on subsets of data using only SQL Server.

    As a developer, your main goal is to develop fast and efficient programs irrespective of where the data is located. PolyBase allows you to avoid using a linked server, which is slow.

    In a business intelligence (BI) role, you're more interested in the external data to be available than the details about how it works. PolyBase allows you to query the external data without moving all of it, and before all of it has been moved from one point to the other.

    In a machine learning (ML) role, you're more interested in pre-processing the data than learning where the data comes from and how to query it. PolyBase does exactly that.

    As a systems architect, you provision the components that bring the most value to your customer and simplify the existing ones, thus reducing costs. This decision is driven by the fact that these components can be interconnected easily, which PolyBase allows you to do.

    As an entrepreneur, you try different setups and configurations before deciding on which ones to use, and so you need to know if you can interconnect them. PolyBase allows you to easily interconnect them.

    In a financial role, you reduce costs including the ones associated with training personnel with old and new tools and languages, switching from paid to open-source third-party software, or deciding when to do internal development to cut on expenses. PolyBase only requires knowledge about SQL Server and T-SQL, allows the use of open-source storage and file systems, and reduces dependence on ETL tools and paid software, as anyone with access to external tables can get the data they need.

    As a customer, you want your data to be processed in one pass so it is available for you to perform your job faster. Learning PolyBase can turn this into a reality.

    As a support technician or incident response member, you must do troubleshooting before calling the appropriate team, development, network, storage, database administration, etc. If PolyBase is used, you need to know how it works and how to troubleshoot it.

    As a data migration specialist, you transfer data between different systems with different collations and encodings. PolyBase facilitates this and reduces the costs incurred for specialized tool licenses.

    Conclusion

    It is a fact you will end up working with heterogeneous database systems and storage solutions, so it is important for you to know the most efficient way to perform some computations. And you must do this to obtain insight from your information, so you have to do it efficiently and, if possible, in no time using the tools and languages you're most familiar with. Without data virtualization, you either have to do a lot of work or do it inefficiently, but with PolyBase, you can virtualize your data and consume it as if it was local. This is a technology everyone must be aware of, as it can help you achieve your business goals.

    In the next chapter, we will see the detailed history of PolyBase.

    Points to remember

    The process of linking data needs to be considered and tested to ensure it is efficient.

    The relational data is as important as storage and file system data.

    Data can get virtualized using a technology available to you.

    PolyBase facilitates data virtualization and accelerates computation.

    Multiple choice questions

    Which resource do you consider most important when deciding where the computation needs to be done between a pair of computers?

    Network bandwidth

    CPU

    Memory

    Storage

    Which are the benefits of a relational database?

    Data integrity

    Scalable processing

    Concurrency

    Fast storage and retrieval

    What are the benefits of storage and file system data?

    Concurrency

    Scalable processing

    Data integrity

    Fast storage and retrieval

    How does PolyBase help?

    Learn the details and the language of the external source

    Decide where to perform the computation

    Perform parallel data transfer and operate on external data

    Install additional software on the external source

    Answer

    a

    a, c

    b, c

    b, c

    Questions

    Between a pair of computers, which do think is more suitable for filtering information?

    What are your reasons to link relational data with storage and file system data?

    What is data virtualization in your own words?

    How do you think PolyBase can help you in your current role?

    CHAPTER 2

    History of PolyBase

    PolyBase was first announced on November 07, 2012, during the SQL PASS Summit at Seattle , and it was presented along with other new technologies such as Hekaton and updateable column store indexes. During the three days of the summit, representatives from Microsoft talked about the opportunity the new technologies offer for reintegrating and rewiring the economy around the changing value of information in all businesses, enabling us to gain insights from any data, of any size, and from anywhere. However, PolyBase could not have been possible without Parallel Data Warehouse ( PDW ) released in 2010.

    PolyBase was expected to be released in 2013 on the new version of SQL Server PDW 2012 (v2), and when released it was only able to connect to Microsoft HDInsight implementation (Hadoop on Windows Server and Windows Azure), requiring only Oracle Java Runtime Environment (JRE) as a third-party component. Knowing its origins and how it has developed over time, you can know where it is headed and what you can expect it to achieve.

    Structure

    In this chapter, you will learn the following topics:

    The data warehousing market

    November 2010, the basis: Parallel Data Warehouse (PDW)

    November 2012: PolyBase official announcement at the SQL PASS session

    July 12, 2013: PDW 2012 (v2) release

    May 1, 2014: PDW v2 AU1 also known as Analytics Platform System (APS)

    Other APS AU releases (2, 3, 4, and 5)

    SQL Server 2016

    SQL Server 2017

    Objectives

    After studying this chapter, you will be able to do the following:

    Familiarize yourself with the terminology required to read technical papers from Microsoft regarding their Big Data solutions and decide which one better suits your needs

    Have a better understanding of the underlying technology of PolyBase and where it came from

    Understand the importance of Parallel Data Warehouse (PDW) 2010 with respect to PolyBase

    Understand the importance of PolyBase as described in SQL PASS 2012

    Identify the characteristics of PolyBase in PDW 2012 (v2) release

    Identify the characteristics of PolyBase in Analytics Platform System (APS)

    Identify the characteristics of PolyBase in SQL Server 2016

    Identify the characteristics of PolyBase in SQL Server 2017

    Understand the limitations of PolyBase

    The data warehousing market

    It was 2008, and Teradata was leading the data warehousing and analytics market with around thirty years of experience. They were pioneers of high scalability with an implementation that didn’t rely on hardware for its parallelism, scalability, reliability, or availability, and that ran over Linux or Windows. They had a high-performance decision-support engine, a truly parallel implementation that automatically distributed data and balanced workload without replication, serialization, or merging. And they also had a technology called Teradata Virtual Storage, which moved hot data to faster disks or faster blocks within a disk.

    It was also during this year when several startups specialized in data warehousing and analytics had already consolidated and were harvesting the results of their good work when major enterprises, prevising the Big Data and IoT future, were trying to enter this market. This is inferred because the same functionality PolyBase provides had been addressed by others to ingest data stored in Hadoop (the first value in parenthesis is the year when it was created or when the functionality was first provided):

    Netezza (1999, acquired in 2010 by IBM): It reads data through its HTTP interface; so unfortunately, multiple nodes can’t be read in parallel.

    Greenplum (2003, acquired in 2010 by EMC Corporation): It allows files to be queried as relational tables, with syntax similar to that of PolyBase.

    Aster (2005, acquired in 2010 by Teradata): It is able to parallelize the work on each working node, allowing the extraction of a different part in parallel and insert it into a partitioned temporary table.

    Oracle (version 9i in 2003, 10g in 2006): It allows the information to be queried without having to be pre-loaded. Version 10g external tables are read-write; previously these were read-only.

    Vertica (2005, acquired in 2011 by Hewlett Packard): It is similar to Greenplum, Aster, and Oracle.

    Sqoop (2009, part of Apache since 2012): It is able to move tables in and out of Hadoop, and generate Java classes that allow MapReduce to interact with relational data.

    Hadap (2010, purchased in 2014 by Teradata): It enables the execution of SQL-like queries across unstructured and structured data using a split query processing that creates MapReduce jobs executing in parallel.

    Oracle entered this market with Exadata Database Machine, offering a hardware and software combination capable of running OLTP simultaneously with analytics, and providing extreme performance and scalability, the ability to perform up to 1 million input/output operations per second (IOPS), and running the most important database applications ten times faster or more. This is possible because all disks can operate in parallel and all processing is moved to storage (including decryption), thus reducing CPU and network consumption.

    SQL Server introduced Predicate Pushdown to the storage engine in SQL Server 2016, about eight years later. Also, in this version, you can find multiple improvements to support OLTP simultaneously with analytics.

    SQL Server was capable of addressing hundreds of petabytes of information according to its technical specifications, but the processing slowed down as the size increased. So in order to maintain their competing edge in the market, in July 2008, they announced the purchase of DATAllegro, specializing in data warehousing since 2003. Its main features are the architecture implemented on commodity hardware and that it uses an open-source software stack: Ingres DBMS running on Linux. Microsoft had to work on merging this technology with SQL Server, which was not an easy task, so they started a project codenamed Madison, which was expected to be released in the second quarter of 2010.

    November 2010, the basis: Parallel Data Warehouse (PDW)

    It was not until the Professional Association for SQL Server (PASS) Community Summit on November 09, 2010, that Ted Kummert, the then senior vice-president of the Business Platform Division at Microsoft, announced the availability of Microsoft SQL Server 2008 R2 PDW, also known by Hewlett-Packard as Enterprise Data Warehouse (EDW), targeted at high-end businesses at a lower price than its competitors. It was sold for about $2 million without licenses and support, with the software starting at $841,610 and the hardware at $900,000.

    It consisted of multiple SQL Server 2008 R2 instances running on Windows Server 2008 R2 and on specific pre-configured Hewlett-Packard hardware, offering high scalability when needed. Its architecture was a control node, the brain that managed query execution and metadata for what is stored and what is processed on each node, and multiple compute nodes that performed the actual storage and computations in parallel, either by having tables replicated across nodes (execute bits of the same request simultaneously) or distributed across nodes (determine which node contains the data and therefore needs to do the actual processing). Part of the technology incorporated into it included a parallel database copy that enabled rapid data movement and consistency between PDW and data marts used by SQL Server Analysis Services (SSAS). The user interface was Nexus Chameleon, as SSMS had not been reworked to connect to the control node.

    It offered 200 times faster queries and 10 times more scalability than traditional deployments, handling up to 600 TB of data. A working use case was Information Security Consolidated Event Management (ICE) migrated to PDW, where query performance improved to an average of fifteen to twenty times faster, SQL Server Information Services (SSIS) data load throughput of up to 285 GB/hour with minimal query performance impact, and support for up to 12 TB/day in throughput.

    It was sold with the premise that it required less DBA maintenance and monitoring, so they could spend more time architecting and not babysitting the database, thus preventing blocks, logs, or waits, and not requiring indexes, archiving, deleting, query hints, IO tuning, query optimization, partitioning, managing filegroups, or shrinking databases. Because of all the new development to integrate SQL Server with DATAllegro, not all features of SQL Server were supported in this version. This didn’t include PolyBase, but included a component that later was going to be critical in enabling this technology: Data Movement Service (DMS). This was the interface between the Massive Parallel Processing (MPP) engine and the actual data in the control, compute, and landing zone nodes. It was responsible for moving data around nodes as needed, and enabled parallel operations among the compute nodes (queries, loads, etc.)

    DMS is still used in SQL Server 2019.

    Hadoop was the leader of Big Data, and IoT storage being a distributed and cost-effective solution, and because Microsoft was going to implement its own version of Hadoop (which occurred in 2013 with Azure HDInsight), they considered supporting querying HDFS data directly from the next version of PDW. This was not possible on another Microsoft product because big data requires big processing, and PDW was the only one capable of handling such huge amounts of data.

    November 2012: PolyBase official announcement at the SQL PASS session

    It was during the PASS Community Summit in November 2012 that Microsoft announced the next version of PDW containing PolyBase.

    The first session (November 07, 2012) was presented by Ted Kummert, who was the corporate data platform vice-president at Microsoft at that time. He talked about how the business intelligence solutions help people be better at their jobs and businesses to move forward, and how Hadoop and MapReduce had matured and could be applied to a broader set of problems like machine learning and large-scale web applications where you had to store vast amount of unstructured data. For him, big data was about new insights, the latent value within your current data, and adding new sources of data—any data, anywhere, any size—and the main target was business acceleration through faster time-to-insight, making it easier for every end user not familiarized with technology to gain and for the business to operationalize. He mentioned how they went to David Dewitt and his team, and asked how to gain value out of multiple different forms of storage and processing engines and capabilities, and David’s team unified this into the T-SQL query processor as a base to support other types of data in the future.

    Then, with the help of Christian Kleinerman, the then general manager at Microsoft, they demoed reading data from a file containing forum comments about Microsoft products into Hadoop HDFS. Then in PDW v2, he created an external table giving it a name, a schema, and the location of the Hadoop cluster. He performed a T-SQL select on that table to view ten records, and then he joined it to a relational table. Then he mentioned there are moments where you have a question you know is answered by the data out there, and when you don’t have it, you realize you need business intelligence. They ended the demo with a 1 PB data warehouse query finishing in less than two seconds.

    The second session (November 08, 2012) was presented by Quentin Clark, the then corporate vice-president at Microsoft, during which he related that, since 2010, he had been demoing the new SQL Server 2012 functionalities for the next release.

    His first use case referred to the business of running an election, where your job was to look at multiple signals (large-scale data) from multiple data sources, with information you produced as well as that coming from outside, which turned this into a big data problem and changed how a business worked.

    His second use case referred to a large hotel chain that used RFID not only for the room doors, but throughout the hotels to see who goes where, what drives people to take an action … basically observe the behavior inside the hotel and merge it with customer information to get a better profile of their guests. But they wanted to join it to social media information in order to reach into your preferences and see your interests and activities to customize and tailor your experience in the hotel.

    His third use case referred to retail chains and how to do things in real time, like changing the music playing based on who’s in the store, the demographics, and purchase history. This required interacting with their music provider service based on analytics in real-time. Furthermore, they wanted to install digital displays where their providers could advertise or provide coupons based on who was standing next to it; in his own words, that’s a very different plumbing of the economy.

    His fourth use case referred to a package shipping company, which sold information to financial companies. Although this was not their core business, it was valuable for other companies, so it could become a revenue stream.

    It was a lot of data needing to be shaped into something understandable and to derive some value from, and provided in a way and with tools that engaged so people could listen to it, hear what it was trying to tell them. Only then, they could collaborate, share those stories, operationalize those insights for a business process to back that up and be able

    Enjoying the preview?
    Page 1 of 1