Ebook922 pages9 hours

Google BigQuery Analytics

Name: Google BigQuery Analytics
Brand: Wiley
Rating: 3.0 (1 reviews)

By Jordan Tigani and Siddartha Naidu

Rating: 3 out of 5 stars

3/5

()

Read preview

About this ebook

How to effectively use BigQuery, avoid common mistakes, and execute sophisticated queries against large datasets

Google BigQuery Analytics is the perfect guide for business and data analysts who want the latest tips on running complex queries and writing code to communicate with the BigQuery API. The book uses real-world examples to demonstrate current best practices and techniques, and also explains and demonstrates streaming ingestion, transformation via Hadoop in Google Compute engine, AppEngine datastore integration, and using GViz with Tableau to generate charts of query results. In addition to the mechanics of BigQuery, the book also covers the architecture of the underlying Dremel query engine, providing a thorough understanding that leads to better query results.

Features a companion website that includes all code and data sets from the book
Uses real-world examples to explain everything analysts need to know to effectively use BigQuery
Includes web application examples coded in Python

Skip carousel

Computers

LanguageEnglish

PublisherWiley

Release dateMay 21, 2014

ISBN9781118824795

Author

Jordan Tigani

Jordan is engineering director for the core BigQuery team. He was one of the founding engineers on BigQuery, and helped grow it to be one of the most successful products in Google’s Cloud Platform. He wrote the first book on BigQuery, and has also spoken widely on the subject. Jordan has twenty years of software development experience, ranging from Microsoft Research to Machine Learning startups.

Related authors

Skip carousel

Related to Google BigQuery Analytics

Related ebooks

Skip carousel

NoSQL For Dummies
Ebook
NoSQL For Dummies
byAdam Fowler
Rating: 0 out of 5 stars
0 ratings
Professional Hadoop Solutions
Ebook
Professional Hadoop Solutions
byBoris Lublinsky
Rating: 4 out of 5 stars
4/5
Practical Play Framework: Focus on what is really important
Ebook
Practical Play Framework: Focus on what is really important
byAlberto Souza
Rating: 0 out of 5 stars
0 ratings
Web Data Mining with Python: Discover and extract information from the web using Python (English Edition)
Ebook
Web Data Mining with Python: Discover and extract information from the web using Python (English Edition)
byDr. Ranjana Rajnish
Rating: 0 out of 5 stars
0 ratings
Google Professional Cloud Developer Exam Guide: Ace the Google Professional Cloud Developer Exam with this comprehensive guide (English Edition)
Ebook
Google Professional Cloud Developer Exam Guide: Ace the Google Professional Cloud Developer Exam with this comprehensive guide (English Edition)
byFiifi Baidoo
Rating: 0 out of 5 stars
0 ratings
Mastering C# - A Comprehensive Guide
Ebook
Mastering C# - A Comprehensive Guide
byOcirema
Rating: 0 out of 5 stars
0 ratings
Mastering Postman: A Comprehensive Guide to Building End-to-End APIs with Testing, Integration and Automation
Ebook
Mastering Postman: A Comprehensive Guide to Building End-to-End APIs with Testing, Integration and Automation
byOliver James
Rating: 0 out of 5 stars
0 ratings
Schematron: A language for validating XML
Ebook
Schematron: A language for validating XML
byErik Siegel
Rating: 0 out of 5 stars
0 ratings
Fast ASP.NET Websites
Ebook
Fast ASP.NET Websites
byDean Hume
Rating: 0 out of 5 stars
0 ratings
PyTorch Cookbook
Ebook
PyTorch Cookbook
byMatthew Rosch
Rating: 0 out of 5 stars
0 ratings
Ultimate Typescript Handbook: Build, scale and maintain Modern Web Applications with Typescript
Ebook
Ultimate Typescript Handbook: Build, scale and maintain Modern Web Applications with Typescript
byDan Wellman
Rating: 0 out of 5 stars
0 ratings
Useful Python
Ebook
Useful Python
byStuart Langridge
Rating: 0 out of 5 stars
0 ratings
Mastering Visual Studio Code: Navigating the Future of Development
Ebook
Mastering Visual Studio Code: Navigating the Future of Development
byKameron Hussain
Rating: 0 out of 5 stars
0 ratings
Getting Started with Docker
Ebook
Getting Started with Docker
byNigel Poulton
Rating: 0 out of 5 stars
0 ratings
Late Light
Ebook
Late Light
byMichael J. Bowden
Rating: 5 out of 5 stars
5/5
Programming the Network with Perl
Ebook
Programming the Network with Perl
byPaul Barry
Rating: 0 out of 5 stars
0 ratings
Real-time business intelligence A Complete Guide
Ebook
Real-time business intelligence A Complete Guide
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Creating Great Android Apps: Tips and Advice
Ebook
Creating Great Android Apps: Tips and Advice
byJim Kerry
Rating: 0 out of 5 stars
0 ratings
Unleashing the Power of CSS
Ebook
Unleashing the Power of CSS
byStephanie Eckles
Rating: 0 out of 5 stars
0 ratings
IoT Data Analytics using Python: Learn how to use Python to collect, analyze, and visualize IoT data (English Edition)
Ebook
IoT Data Analytics using Python: Learn how to use Python to collect, analyze, and visualize IoT data (English Edition)
byM S Hariharan
Rating: 0 out of 5 stars
0 ratings
GROKKING ALGORITHM BLUEPRINT: Advanced Guide to Help You Excel Using Grokking Algorithms
Ebook
GROKKING ALGORITHM BLUEPRINT: Advanced Guide to Help You Excel Using Grokking Algorithms
byWilliam Turner
Rating: 0 out of 5 stars
0 ratings
Google Cloud Platform All-In-One Guide: Get Familiar with a Portfolio of Cloud-based Services in GCP (English Edition)
Ebook
Google Cloud Platform All-In-One Guide: Get Familiar with a Portfolio of Cloud-based Services in GCP (English Edition)
byPraveen Kukreti
Rating: 0 out of 5 stars
0 ratings
ChatGPT Will Won't Save The World
Ebook
ChatGPT Will Won't Save The World
byalasdair gilchrist
Rating: 0 out of 5 stars
0 ratings
TypeScript Blueprints
Ebook
TypeScript Blueprints
byIvo Gabe de Wolff
Rating: 0 out of 5 stars
0 ratings
AutoHotkey Tricks You Ought To Do With Windows
Ebook
AutoHotkey Tricks You Ought To Do With Windows
byJack Dunning
Rating: 5 out of 5 stars
5/5
Django Unleashed: Building Web Applications with Python's Framework
Ebook
Django Unleashed: Building Web Applications with Python's Framework
byKameron Hussain
Rating: 0 out of 5 stars
0 ratings
Beginning C# and .NET
Ebook
Beginning C# and .NET
byBenjamin Perkins
Rating: 0 out of 5 stars
0 ratings
Your First Week With Node.js
Ebook
Your First Week With Node.js
byJames Hibbard
Rating: 0 out of 5 stars
0 ratings
Node.js: Tools & Skills
Ebook
Node.js: Tools & Skills
byJames Hibbard
Rating: 0 out of 5 stars
0 ratings
Beginning Cryptography with Java
Ebook
Beginning Cryptography with Java
byDavid Hook
Rating: 3 out of 5 stars
3/5

Computers For You

Skip carousel

Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance
Ebook
Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance
byKatherine Johnson Martinko
Rating: 0 out of 5 stars
0 ratings
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Going Text: Mastering the Command Line
Ebook
Going Text: Mastering the Command Line
byBrian Schell
Rating: 4 out of 5 stars
4/5
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice
Ebook
AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice
bySeth Reichelson
Rating: 0 out of 5 stars
0 ratings
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
Ebook
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

How ChatGPT Changes Tech + The End of Remote Work? — With Aaron Levie
Podcast episode
How ChatGPT Changes Tech + The End of Remote Work? — With Aaron Levie
byBig Technology Podcast
100%
100% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
240: Important Kotlin Constructs: In this episode, Donn and Kaushik talk about 5 new-ish Kotlin constructs that you might not be aware of.
Podcast episode
240: Important Kotlin Constructs: In this episode, Donn and Kaushik talk about 5 new-ish Kotlin constructs that you might not be aware of.
byFragmented - An Android Developer Podcast
0 ratings
0% found this document useful
12: Coverage.py with Ned Batchelder: We also discuss edX, Python user groups, PyCon talks, and more.
Podcast episode
12: Coverage.py with Ned Batchelder: We also discuss edX, Python user groups, PyCon talks, and more.
byTest and Code
0 ratings
0% found this document useful
Crafting Interpreters With Bob Nystrom: Bob Nystrom is the author of Crafting Interpreters. I speak with Nystrom about building a programming language and an interpreter implementation for it. We talk about parsing, the difference between compiler and interpreters and a lot more. If you are...
Podcast episode
Crafting Interpreters With Bob Nystrom: Bob Nystrom is the author of Crafting Interpreters. I speak with Nystrom about building a programming language and an interpreter implementation for it. We talk about parsing, the difference between compiler and interpreters and a lot more. If you are...
byCoRecursive: Coding Stories
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
Working with Code: How Does a Coder at NASA Do His Job?
Podcast episode
Working with Code: How Does a Coder at NASA Do His Job?
byWorking
0 ratings
0% found this document useful
125: pytest 6 - Anthony Sottile: pytest 6 is out. Anthony Sottile joins the show to discuss features, improvements, documentation updates and more.
Podcast episode
125: pytest 6 - Anthony Sottile: pytest 6 is out. Anthony Sottile joins the show to discuss features, improvements, documentation updates and more.
byTest and Code
0 ratings
0% found this document useful
Accidentally Building A Business With Python At Listen Notes: An interview with Listen Notes founder Wenbin Fang about his experience building a one person company powered by Python and his views on the podcast ecosystem.
Podcast episode
Accidentally Building A Business With Python At Listen Notes: An interview with Listen Notes founder Wenbin Fang about his experience building a one person company powered by Python and his views on the podcast ecosystem.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Harnessing Python for Research: Scientific Applications of Python with Michael Kennedy: Still scrabbling with Excel? Consider Python language uses, says programmer and podcaster Michael Kennedy. A general programming language that is easy to use in multiple environments, Python programming is limitless and has numerous open source...
Podcast episode
Harnessing Python for Research: Scientific Applications of Python with Michael Kennedy: Still scrabbling with Excel? Consider Python language uses, says programmer and podcaster Michael Kennedy. A general programming language that is easy to use in multiple environments, Python programming is limitless and has numerous open source...
byFinding Genius Podcast
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Python with Dustin Ingram: Mark and Brian Dorsey spend today talking Python with Dustin Ingram.
Podcast episode
Python with Dustin Ingram: Mark and Brian Dorsey spend today talking Python with Dustin Ingram.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
TypeScript Fundamentals — Getting a Bit Deeper: In this episode of Syntax, Scott and Wes continue their discussion of TypeScript Fundamentals with a deeper diver into more advanced use cases. Deque - Sponsor Deque’s axe DevTools makes accessibility testing easy and doesn’t require special...
Podcast episode
TypeScript Fundamentals — Getting a Bit Deeper: In this episode of Syntax, Scott and Wes continue their discussion of TypeScript Fundamentals with a deeper diver into more advanced use cases. Deque - Sponsor Deque’s axe DevTools makes accessibility testing easy and doesn’t require special...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Training Data Locality and Chain-of-Thought Reasoning in LLMs with Ben Prystawski - #673
Podcast episode
Training Data Locality and Chain-of-Thought Reasoning in LLMs with Ben Prystawski - #673
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
Podcast episode
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
167: React, TypeScript, and the Joy of Testing - Paul Everitt: Paul has a tutorial on testing and TDD with React and TypeScript. We discuss workflow and the differences, similarities between testing with React/TypeScript and Python.
Podcast episode
167: React, TypeScript, and the Joy of Testing - Paul Everitt: Paul has a tutorial on testing and TDD with React and TypeScript. We discuss workflow and the differences, similarities between testing with React/TypeScript and Python.
byTest and Code
0 ratings
0% found this document useful
Bringing DevOps to the Database with Automation
Podcast episode
Bringing DevOps to the Database with Automation
byThe Cloudcast
0 ratings
0% found this document useful
The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse: Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.
Podcast episode
The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse: Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.
byData Engineering Podcast
0 ratings
0% found this document useful
Powering your Copilot for Data – with Artem Keydunov of Cube.dev
Podcast episode
Powering your Copilot for Data – with Artem Keydunov of Cube.dev
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
Episode 403: JSJ 398: Node 12 with Paige Niedringhaus
Podcast episode
Episode 403: JSJ 398: Node 12 with Paige Niedringhaus
byJavaScript Jabber
0 ratings
0% found this document useful
Building Large AI Models
Podcast episode
Building Large AI Models
byThe Cloudcast
0 ratings
0% found this document useful
Making Sense Of The Technical And Organizational Considerations Of Data Contracts: One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows.
Podcast episode
Making Sense Of The Technical And Organizational Considerations Of Data Contracts: One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows.
byData Engineering Podcast
0 ratings
0% found this document useful
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
Podcast episode
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
byData Engineering Podcast
0 ratings
0% found this document useful
Scalable, Serverless Database Platforms
Podcast episode
Scalable, Serverless Database Platforms
byThe Cloudcast
0 ratings
0% found this document useful
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
319: Lack Rack, Jack: Causing ZFS corruption for fun, NetBSD Assembly Programming Tutorial, The IKEA Lack Rack for Servers, a new OmniOS Community Edition LTS has been published, List Block Devices on FreeBSD lsblk(8) Style, Project Trident 19.10 available, and more.
Podcast episode
319: Lack Rack, Jack: Causing ZFS corruption for fun, NetBSD Assembly Programming Tutorial, The IKEA Lack Rack for Servers, a new OmniOS Community Edition LTS has been published, List Block Devices on FreeBSD lsblk(8) Style, Project Trident 19.10 available, and more.
byBSD Now
0 ratings
0% found this document useful
Building Applications With Data As Code On The DataOS: The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.
Podcast episode
Building Applications With Data As Code On The DataOS: The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.
byData Engineering Podcast
0 ratings
0% found this document useful
Hasty Treat - What makes a server fast?: In this Hasty Treat, Scott and Wes talk about how to make servers fast! Sentry - Sponsor If you want to know what’s happening with your errors, track them with . Sentry is open-source error tracking that helps developers monitor and fix crashes...
Podcast episode
Hasty Treat - What makes a server fast?: In this Hasty Treat, Scott and Wes talk about how to make servers fast! Sentry - Sponsor If you want to know what’s happening with your errors, track them with . Sentry is open-source error tracking that helps developers monitor and fix crashes...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful

Skip carousel

MapReduce: The ‘Big Data’ Idea Inside Your Android Phone
APC
Article
MapReduce: The ‘Big Data’ Idea Inside Your Android Phone
Dec 2, 2019
4 min read
In Brief
Linux Format
Article
In Brief
Jun 1, 2021
Mu is a code editor for many forms of Python. We can write standard Python 3 code, create web apps and write code for microcontrollers such as the new Raspberry Pi Pico. Mu is designed for new users and does away with complicated IDEs in favour of a
1 min read
Access Your Mac Anywhere
MacLife
Article
Access Your Mac Anywhere
Nov 8, 2022
2 min read
Linux Command-Line Tips & Tricks
APC
Article
Linux Command-Line Tips & Tricks
Jul 13, 2020
8 min read
What Is The Future Of Game Streaming Now That Stadia Is Dead?
APC
Article
What Is The Future Of Game Streaming Now That Stadia Is Dead?
Oct 31, 2022
Once hyped as being ‘the future of gaming’, the Google Stadia game streaming service was officially, just three years after launch and before even making it to Australian shores. When game streaming first launched we did have some apprehension about
2 min read
Extract And Organise Files More Quickly
Computeractive
Article
Extract And Organise Files More Quickly
Feb 28, 2024
3 min read
Linux Invades Windows
Linux Format
Article
Linux Invades Windows
Apr 2, 2024
1 min read
HOW TO… Revive an old PC with ChromeOS Flex
Computeractive
Article
HOW TO… Revive an old PC with ChromeOS Flex
Aug 17, 2022
7 min read
Saving and Executing Your Code
Essential Apple User Magazine
Article
Saving and Executing Your Code
Jul 31, 2019
2 min read
Create A Custom Windows Installer
Maximum PC
Article
Create A Custom Windows Installer
May 28, 2019
8 min read
Why Are We Stuck With M.2 When U.2 Is So Much Better?
APC
Article
Why Are We Stuck With M.2 When U.2 Is So Much Better?
May 22, 2023
4 min read
Make AI Work For You
Linux Format
Article
Make AI Work For You
Apr 2, 2024
8 min read
What NOT To Store In Password Managers
Computeractive
Article
What NOT To Store In Password Managers
Feb 28, 2024
4 min read
Mailserver
Linux Format
Article
Mailserver
Mar 8, 2022
3 min read
Intel Core i9 14900K
Linux Format
Article
Intel Core i9 14900K
Dec 12, 2023
3 min read
LISP - Exploring The Original AI Language
Linux Format
Article
LISP - Exploring The Original AI Language
May 30, 2023
11 min read
Why Python?
Linux Format
Article
Why Python?
Apr 7, 2020
Python is an interpreted, high-level, general-purpose programming language that was first released in 1991 by its creator, Guido van Rossum. Very similar in programming construct to how BASIC (Beginners All-purpose Sybollic Instruction Code) was used
1 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Build Your Own Hardware Firewall
Linux Format
Article
Build Your Own Hardware Firewall
Jul 25, 2023
9 min read
How To Enhance Your Network Security With Private Wi-Fi Network Addresses
MacWorld
Article
How To Enhance Your Network Security With Private Wi-Fi Network Addresses
Jul 18, 2023
3 min read
Mucking About With AI
APC
Article
Mucking About With AI
May 22, 2023
2 min read
Metasploitation
Linux Format
Article
Metasploitation
May 2, 2023
It’s a rare piece of code that never requires patching to fix some flaw or other that allows users to do what they were never meant to do. Exploits can be as simple as checking out plain text password files in an unprotected directory, or inputting s
5 min read
Database Bases
Linux Format
Article
Database Bases
Feb 6, 2024
When you look at MySQL, PostgreSQL and MongoDB, they each have a vibrant community that reflects its chosen databases. PostgreSQL has a community-driven development process that actively includes its user communities through meetings, mailing lists a
1 min read
Building A Career In IT
PC Pro Magazine
Article
Building A Career In IT
Aug 7, 2022
8 min read
It’s Great When You’re K8s
Linux Format
Article
It’s Great When You’re K8s
Oct 18, 2022
8 min read
Workflow
Linux Format
Article
Workflow
Nov 17, 2020
3 min read
IPhone Portrait Photo Guide
MacFormat
Article
IPhone Portrait Photo Guide
Dec 12, 2023
11 min read
NEXTCLOUD Host Nextcloud 20 on a Raspberry Pi
Linux Format
Article
NEXTCLOUD Host Nextcloud 20 on a Raspberry Pi
Jan 12, 2021
8 min read
Back To Basics
Linux Format
Article
Back To Basics
Mar 5, 2024
For anyone looking to learn the basics of hacking, we hopefully have the right issue for you. Open source was always the home of hacking, and by that we mean trying to get things to work in ways for which they were never designed. A lot of that meani
1 min read
Fast And Easy Image Processing
Linux Format
Article
Fast And Easy Image Processing
Apr 4, 2023
Credit: https://github.com/oguzhaninan/korkut Shashank Sharma is a trial lawyer in New Delhi and an avid Arch user. He’s been writing about open source software for 20 years and lawyering for 10. You wouldn’t think of the command line as the go-to re
4 min read

Related categories

Skip carousel

Reviews for Google BigQuery Analytics

Rating: 3 out of 5 stars

3/5

1 rating0 reviews

Book preview

Google BigQuery Analytics - Jordan Tigani

Part I

BigQuery Fundamentals

In This Part

Chapter 1: The Story of Big Data at Google

Chapter 2: BigQuery Fundamentals

Chapter 3: Getting Started with BigQuery

Chapter 4: Understanding the BigQuery Object Model

Chapter 1

The Story of Big Data at Google

Since its founding in 1998, Google has grown by multiple orders of magnitude in several different dimensions—how many queries it handles, the size of the search index, the amount of user data it stores, the number of services it provides, and the number of users who rely on those services. From a hardware perspective, the Google Search engine has gone from a server sitting under a desk in a lab at Stanford to hundreds of thousands of servers located in dozens of datacenters around the world.

The traditional approach to scaling (outside of Google) has been to scale the hardware up as the demands on it grow. Instead of running your database on a small blade server, run it on a Big Iron machine with 64 processors and a terabyte of RAM. Instead of relying on inexpensive disks, the traditional scaling path moves critical data to costly network-attached storage (NAS).

There are some problems with the scale-up approach, however:

Scaled-up machines are expensive. If you need one that has twice the processing power, it might cost you five times as much.

Scaled-up machines are single points of failure. You might need to get more than one expensive server in case of a catastrophic problem, and each one usually ends up being built with so many backup and redundant pieces that you're paying for a lot more hardware than you actually need.

Scale up has limits. At some point, you lose the ability to add more processors or RAM; you've bought the most expensive and fastest machine that is made (or that you can afford), and it still might not be fast enough.

Scale up doesn't protect you against software failures. If you have a Big Iron server that has a kernel bug, that machine will crash just as easily (and as hard) as your Windows laptop.

Google, from an early point in time, rejected scale-up architectures. It didn't, however, do this because it saw the limitations more clearly or because it was smarter than everyone else. It rejected scale-up because it was trying to save money. If the hardware vendor quotes you $1 million for the server you need, you could buy 200 $5,000 machines instead. Google engineers thought, Surely there is a way we could put those 200 servers to work so that the next time we need to increase the size, we just need to buy a few more cheap machines, rather than upgrade to the $5 million server. Their solution was to scale out, rather than scale up.

Big Data Stack 1.0

Between 2000 and 2004, armed with a few principles, Google laid the foundation for its Big Data strategy:

Anything can fail, at any time, so write your software expecting unreliable hardware. At most companies, when a database server crashes, it is a serious event. If a network switch dies, it will probably cause downtime. By running in an environment in which individual components fail often, you paradoxically end up with a much more stable system because your software is designed to handle those failures. You can quantify your risk beyond blindly quoting statistics, such as mean time between failures (MTBFs) or service-level agreements (SLAs).

Use only commodity, off-the-shelf components. This has a number of advantages: You don't get locked into a particular vendor's feature set; you can always find replacements; and you don't experience big price discontinuities when you upgrade to the bigger version.

The cost for twice the amount of capacity should not be considerably more than the cost for twice the amount of hardware. This means the software must be built to scale out, rather than up. However, this also imposes limits on the types of operations that you can do. For instance, if you scale out your database, it may be difficult to do a JOIN operation, since you'd need to join data together that lives on different machines.

A foolish consistency is the hobgoblin of little minds. If you abandon the C (consistency) in ACID database operations, it becomes much easier to parallelize operations. This has a cost, however; loss of consistency means that programmers have to handle cases in which reading data they just wrote might return a stale (inconsistent) copy. This means you need smart programmers.

These principles, along with a cost-saving necessity, inspired new computation architectures. Over a short period of time, Google produced three technologies that inspired the Big Data revolution:

Google File System (GFS): A distributed, cluster-based filesystem. GFS assumes that any disk can fail, so data is stored in multiple locations, which means that data is still available even when a disk that it was stored on crashes.

MapReduce: A computing paradigm that divides problems into easily parallelizable pieces and orchestrates running them across a cluster of machines.

Bigtable: A forerunner of the NoSQL database, Bigtable enables structured storage to scale out to multiple servers. Bigtable is also replicated, so failure of any particular tablet server doesn't cause data loss.

What's more, Google published papers on these technologies, which enabled others to emulate them outside of Google. Doug Cutting and other open source contributors integrated the concepts into a tool called Hadoop. Although Hadoop is considered to be primarily a MapReduce implementation, it also incorporates GFS and BigTable clones, which are called HDFS and HBase, respectively.

Armed with these three technologies, Google replaced nearly all the off-the-shelf software usually used to run a business. It didn't need (with a couple of exceptions) a traditional SQL database; it didn't need an e-mail server because its Gmail service was built on top of these technologies.

Big Data Stack 2.0 (and Beyond)

The three technologies—GFS, MapReduce, and Bigtable—made it possible for Google to scale out its infrastructure. However, they didn't make it easy. Over the next few years, a number of problems emerged:

MapReduce is hard. It can be difficult to set up and difficult to decompose your problem into Map and Reduce phases. If you need multiple MapReduce rounds (which is common for many real-world problems), you face the issue of how to deal with state in between phases and how to deal with partial failures without having to restart the whole thing.

MapReduce can be slow. If you want to ask questions of your data, you have to wait minutes or hours to get the answers. Moreover, you have to write custom C++ or Java code each time you want to change the question that you're asking.

GFS, while improving durability of the data (since it is replicated multiple times) can suffer from reduced availability, since the metadata server is a single point of failure.

Bigtable has problems in a multidatacenter environment. Most services run in multiple locations; Bigtable replication between datacenters is only eventually consistent (meaning that data that gets written out will show up everywhere, but not immediately). Individual services spend a lot of redundant effort babysitting the replication process.

Programmers (even Google programmers) have a really difficult time dealing with eventual consistency. This same problem occurred when Intel engineers tried improving CPU performance by relaxing the memory model to be eventually consistent; it caused lots of subtle bugs because the hardware stopped working the way people's mental model of it operated.

Over the next several years, Google built a number of additional infrastructure components that refined the ideas from the 1.0 stack:

Colossus: A distributed filesystem that works around many of the limitations in GFS. Unlike many of the other technologies used at Google, Colossus' architecture hasn't been publicly disclosed in research papers.

Megastore: A geographically replicated, consistent NoSQL-type datastore. Megastore uses the Paxos algorithm to ensure consistent reads and writes. This means that if you write data in one datacenter, it is immediately available in all other datacenters.

Spanner: A globally replicated datastore that can handle data locality constraints, like This data is allowed to reside only in European datacenters. Spanner managed to solve the problem of global time ordering in a geographically distributed system by using atomic clocks to guarantee synchronization to within a known bound.

FlumeJava: A system that allows you to write idiomatic Java code that runs over collections of Big Data. Flume operations get compiled and optimized to run as a series of MapReduce operations. This solves the ease of setup, ease of writing, and ease of handling multiple MapReduce problems previously mentioned.

Dremel: A distributed SQL query engine that can perform complex queries over data stored on Colossus, GFS, or elsewhere.

The version 2.0 stack, built piecemeal on top of the version 1.0 stack (Megastore is built on top of Bigtable, for instance), addresses many of the drawbacks of the previous version. For instance, Megastore allows services to write from any datacenter and know that other readers will read the most up-to-date version. Spanner, in many ways, is a successor to Megastore, which adds automatic planet-scale replication and data provenance protection.

On the data processing side, batch processing and interactive analyses were separated into two tools based on usage models: Flume and Dremel. Flume enables users to easily chain together MapReduces and provides a simpler programming model to perform batch operations over Big Data. Dremel, on the other hand, makes it easy to ask questions about Big Data because you can now run a SQL query over terabytes of data and get results back in a few seconds. Dremel is the query engine that powers BigQuery; Its architecture is discussed in detail in Chapter 9, Understanding Query Execution.

An interesting consequence of the version 2.0 stack is that it explicitly rejects the notion that in order to use Big Data you need to solve your problems in fundamentally different ways than you're used to. While MapReduce required you to think about your computation in terms of Map and Reduce phases, FlumeJava allows you to write code that looks like you are operating over normal Java collections. Bigtable replication required abandoning consistent writes, but Megastore adds a consistent coordination layer on top. And while Bigtable had improved scalability by disallowing queries, Dremel retrofits a traditional SQL query interface onto Big Data structured storage.

There are still rough edges around many of the Big Data 2.0 technologies: things that you expect to be able to do but can't, things that are slow but seem like they should be fast, and cases where they hold onto awkward abstractions. However, as time goes on, the trend seems to be towards smoothing those rough edges and making operation over Big Data as seamless as over smaller data.

Open Source Stack

Many of the technologies at Google have been publicly described in research papers, which were picked up by the Open Source community and re-implemented as open source versions. When the open source Big Data options were in their infancy, they more or less followed Google's lead. Hadoop was designed to be very similar to the architecture described in the MapReduce paper, and the Hadoop subprojects HDFS and HBase are close to GFS and BigTable.

However, as the value of scale-out systems began to increase (and as problems with traditional scale-up solutions became more apparent), the Open Source Big Data stack diverged significantly. A lot of effort has been put into making Hadoop faster; people use technologies such as Hive and Pig to query their data; and numerous NoSQL datastores have sprung up, such as CouchDB, MongoDB, Cassandra, and others.

On the interactive query front, there are a number of open source options:

Cloudera's Impala is an open source parallel execution engine similar to Dremel. It allows you to query data inside HDFS and Hive without extracting it.

Amazon.com's Redshift is a fork of PostgreSQL which has been modified to scale out across multiple machines. Unlike Impala, Redshift is a hosted service, so it is managed in the cloud by Amazon.com.

Drill is an Apache incubator project that aims to be for Dremel what Hadoop was for MapReduce; Drill fills in the gaps of the Dremel paper to provide a similar open source version.

Facebook's Presto is a distributed SQL query engine that is similar to Impala.

The days when Google held the clear advantage in innovation in the Big Data space are over. Now, we're in an exciting time of robust competition among different Big Data tools, technologies, and abstractions.

Google Cloud Platform

Google has released many of its internal infrastructure components to the public under the aegis of the Google Cloud Platform. Google's public cloud consists of a number of components, providing a complete Big Data ecosystem. It is likely that in the coming months and years there will be additional entries, so just because a tool or service isn't mentioned here doesn't mean that it doesn't exist. Chapter 2, BigQuery Fundamentals, goes into more detail about the individual components, but this is a quick survey of the offerings. You can divide the cloud offerings into three portions: processing, storage, and analytics.

Cloud Processing

The cloud processing components enable you to run arbitrary computations over your data:

Google Compute Engine (GCE): The base of Google's Cloud Platform, GCE is infrastructure-as-a-service, plain and simple. If you have software you just want to run in the cloud on a Linux virtual machine, GCE enables you to do so. GCE also can do live migration of your service so that when the datacenter it is running is turned down for maintenance, your service won't notice a hiccup.

AppEngine: AppEngine is a higher-level service than GCE. You don't need to worry about OS images or networking configurations. You just write the code you actually want running in your service and deploy it; AppEngine handles the rest.

Cloud Storage

These cloud storage components enable you to store your own data in Google's cloud:

Google Cloud Storage (GCS): GCS enables you to store arbitrary data in the cloud. It has two APIs: one that is compatible with Amazon.com's S3 and another REST API that is similar to other Google APIs.

DataStore: A NoSQL key-value store. DataStore is usually used from AppEngine, but its REST API enables you to store and look up data from anywhere.

BigQuery (Storage API): BigQuery enables you to store structured rows and columns of data. You can ingest data directly through the REST API, or you can import data from GCS.

Cloud Analytics

Google's cloud analytics services enable you to extract meaning from your data:

Cloud SQL: A hosted MySQL instance in the cloud

Prediction API: Enables you to train machine learning models and apply them to your data

Cloud Hadoop: Packages Hadoop and makes it easy to run on Google Compute Engine

BigQuery: Enables you to run SQL statements over your structured data

If you find that something is missing from Google's Cloud Platform, you always have the option of running your favorite open source software stack on Google Compute Engine. For example, the Google Cloud Hadoop package is one way of running Hadoop, but if you want to run a different version of Hadoop than is supported, you can always run Hadoop directly; Google's Hadoop package uses only publicly available interfaces.

Problem Statement

Before we go on to talk about BigQuery, here's a bit of background information about the problems that BigQuery was developed to solve.

What Is Big Data?

There are a lot of different definitions from experts about what it means to have Big Data; many of these definitions conceal a boast like, Only a petabyte? I've forgotten how to count that low! This book uses the term Big Data to mean more data than you can process sequentially in the amount of time you're willing to spend waiting for it. Put another way, Big Data just means more data than you can easily handle using traditional tools such as relational databases without spending a lot of money on specialized hardware.

This definition is deliberately fuzzy; to put some numbers behind it, we'll say a hundred million rows of structured data or a hundred gigabytes of unstructured data. You can fit data of that size on a commodity disk and even use MySQL on it. However, dealing with data that size isn't going to be pleasant. If you need to write a tool to clean the data, you're going to spend hours running it, and you need be careful about memory usage, and so on. And as the data size gets bigger, the amount of pain you'll experience doing simple things such as backing it up or changing the schema will get exponentially worse.

Why Big Data?

Many people are surprised at how easy it is to acquire Big Data; they assume that you need to be a giant company like Wal-Mart or IBM for Big Data to be relevant. However, Big Data is easy to accumulate. Following are some of the ways to get Big Data without being a Fortune 500 company:

Over time: If you produce a million records a day, that might not be Big Data. But in 3 years, you'll have a billion records; at some point you may find that you either need to throw out old data or figure out a new way to process the data that you have.

Viral scaling: On the Internet, no one knows you're a small company. If your website becomes popular, you can get a million users overnight. If you track 10 actions from a million users a day, you're talking about a billion actions a quarter. Can you mine that data well enough to be able to improve your service and get to the 10 million user mark?

Projected growth: Okay, maybe you have only small data now, but after you sign customer X, you'll instantly end up increasing by another 2 orders of magnitude. You need to plan for that growth now to make sure you can handle it.

Architectural limitations: If you need to do intense computation over your data, the threshold for Big Data can get smaller. For example, if you need to run an unsupervised clustering algorithm over your data, you may find that even a few million data points become difficult to handle without sampling.

Why Do You Need New Ways to Process Big Data?

A typical hard disk can read on the order of 100 MB per second. If you want to ask questions of your data and your data is in the terabyte range, you either need thousands of disks or you are going to spend a lot of time waiting.

As anyone who has spent time tuning a relational database can attest, there is a lot of black magic involved in getting queries to run quickly on your-favorite-database. You may need to add indexes, stripe data across disks, put the transaction log on its own spindle, and so on. However, as your data grows, at some point it gets harder and harder to make your queries perform well. In addition, the more work you do, the more you end up specializing the schema for the type of questions you typically ask of your data.

What if you want to ask a question you've never asked before? If you are relying on a heavily tuned schema, or if you're running different queries than the database was tuned for, you may not get answers in a reasonable amount of time or without bogging down your production database. In these cases, your options are limited; you either need to run an extremely slow query (that may degrade performance for your entire database), or you could export the data and process it in an external system like Hadoop.

Often, to get queries to run quickly, people sample their data—they keep only 10 percent of user impressions, for example. But what happens if you want to explore the data in a way that requires access to all the impressions? Maybe you want to compute the number of distinct users that visited your site—if you drop 90 percent of your data, you can't just multiply the remaining users by 10 to get the number of distinct users in the original dataset. This point is somewhat subtle, but if you drop 90 percent of your data, you might still have records representing 99 percent of your users, or you might have records representing only 5 percent of your users; you can't tell unless you use a more sophisticated way to filter your data.

How Can You Read a Terabyte in a Second?

If you want to ask interactive questions of your Big Data, you must process all your data within a few seconds. That means you need to read hundreds of gigabytes per second—and ideally more.

Following are three ways that you can achieve this type of data rate:

Skip a lot of the data. This is a good option if you know in advance the types of questions you're going to ask. You can pre-aggregate the data or create indexes on the columns that you need to access. However, if you want to ask different questions, or ask them in a different way, you may not be able to avoid reading everything.

Buy some really expensive hardware. For a few million dollars or so, you can get a machine onsite that will come with its own dedicated support person that can let you query over your terabytes of data.

Run in parallel. Instead of reading from one disk, read from thousands of disks. Instead of one database server, read from hundreds.

If you use custom hardware (solution #2) and you want it to go faster, you need to buy an even bigger data warehouse server (and hope you can sell the old one). And if you rely on skipping data (solution #1) to give you performance, the only way to go faster is to be smarter about what data you skip (which doesn't scale).

BigQuery, and most Big Data tools, take approach #3. Although it may sound expensive to have thousands of disks and servers, the advantage is that you get exactly what you pay for; that is, if you need to run twice as fast, you can buy twice as many disks. If you use BigQuery, you don't need to buy your own disks; you get a chance to buy small slices of time on a massive amount of hardware.

What about MapReduce?

A large proportion of the Big Data hype has been directed toward MapReduce and Hadoop, its Open Source incarnation. Hadoop is a fantastic tool that enables you to break up your analysis problem into pieces that run in parallel. The Hadoop File System (HDFS) can enable you to read in parallel from a lot of disks, which allows you to perform operations over Big Data orders of magnitude more quickly than if you had to read that data sequentially.

However, Hadoop specifically and MapReduce in general have some architectural drawbacks that make them unsuited for interactive-style analyses. That is, if you want to ask questions of your data using MapReduce, you're probably going to want to get a cup of coffee (or go out to lunch) while you wait. Interactive analyses should give you answers before you get bored or forget why you were asking in the first place. Newer systems, such as Cloudera's Impala, allow interactive queries over your Hadoop data, but they do so by abandoning the MapReduce paradigm. Chapter 9 discusses the architecture in more detail and shows why MapReduce is better suited to batch workloads than interactive ones.

How Can You Ask Questions of Your Big Data and Quickly Get Answers?

Google BigQuery is a tool that enables you to run SQL queries over your Big Data. It fans out query requests to thousands of servers, reads from tens or hundreds of thousands of disks at once, and can return answers to complex questions within seconds. This book describes how BigQuery can achieve such good performance and how you can use it to run queries on your own data.

Summary

This chapter briefly documented the history of Google's Big Data systems and provided a survey of scale-out technologies, both at Google and elsewhere. It set the stage for BigQuery by describing an unfulfilled Big Data analytics niche. This chapter deliberately didn't mention BigQuery very much, however; Chapter 2 should answer all your questions about what BigQuery is and what it can do.

Chapter 2

BigQuery Fundamentals

This chapter introduces you to BigQuery, describing what it can do, when to use it, when not to use it, and even a bit about why it is so fast. Because you may be evaluating whether BigQuery is the right tool for you, this chapter spends a bit of time comparing it to other systems and other architectures. It discusses the performance and types of workloads best-suited to BigQuery, and also how BigQuery fits with other offerings in the Google Cloud Platform.

This chapter concludes with an introduction to an AppEngine and Android App that is used as the basis of many of the examples throughout the book. This sample app demonstrates several ways that BigQuery can be integrated into an application—from log collection and analyses to dashboard development and correlation of multiple data streams.

What Is BigQuery?

BigQuery, like many tools, started with a problem. Google engineers were having a hard time keeping up with the growth of their data. The number of Gmail users is in the hundreds of millions; by 2012, there were more than 100 billion Google searches done every month. Trying to make sense of all this data was a time-consuming and frustrating experience.

Google is hugely a data-driven company. Decisions ranging from café menus to interview strategies to marketing campaigns are made by analyzing data. If you have a great idea but you don't have data to back it up, you're going to have a hard time convincing anyone to implement your suggestion. However, if you have data on your side that says people click more ads with a particular shade of blue background, your shade of blue will likely become the new official standard.

As Google grew exponentially, the amount of data available also grew exponentially. Despite spending a lot of money on hardware and software for relational databases, it was often difficult to ask simple questions of the data. Despite having invented MapReduce to help analyze large datasets, it was still difficult to get answers interactively, without waiting minutes or hours for a long batch job to complete.

The data problem led to the development of an internal tool called Dremel, which enabled Google employees to run extremely fast SQL queries on large datasets. According to Armando Fox, a professor of computer science at the University of California at Berkley, If you told me beforehand what Dremel claims to do, I wouldn't have believed you could build it. Dremel has become extremely popular at Google; Google engineers use it millions of times a day for tasks ranging from building sales dashboards to datacenter temperature analyses to computing employees' percentile rank of how long they've worked at the company.

In 2012, at Google I/O, Google publicly launched BigQuery, which allowed users outside of Google to take advantage of the power and performance of Dremel. Since then, BigQuery has expanded to become not just a query engine but a hosted, managed cloud-based structured storage provider. The following sections describe the main aspects of BigQuery.

SQL Queries over Big Data

The primary function of BigQuery is to enable interactive analytic queries over Big Data. Although Big Data is a fuzzy term, in practice it just means data that is big enough that you have to worry about how big it is. Sometimes the data might be small now, but you anticipate it growing by orders of magnitude later. Sometimes the data might be only a few megabytes, but your algorithms to process it don't scale well. Or sometimes you have a million hard drives full of customer data in a basement.

BigQuery tries to tackle Big Data problems by attempting to be scale-invariant. That is, whether you have a hundred rows in your table or a hundred billion, the mechanism to work with them should be the same. Although some variance in execution time is expected between running a query over a megabyte and running the same query over a terabyte, the latter shouldn't be a million times slower than the former. If you start using BigQuery when you are receiving 1,000 customer records a day, you won't hit a brick wall when you scale up to 1 billion customer records a day.

BigQuery SQL

The lingua franca for data analyses is the SQL query language. Other systems, such as Hadoop, enable you to write code in your favorite language to perform analytics, but these languages make it difficult to interactively ask questions of your data. If you have to write a Java program to query your data, you'll end up spending a lot of time compiling, debugging, and uploading your program, rather than figuring out what data you need.

Despite being somewhat intimidating at first, SQL is also easy to use for nonprogrammers. Many software engineers are surprised when someone from marketing comes up with a sophisticated query to figure out why sales are slumping. However, it is actually quite common for non- or semi- technical people to be SQL wizards.

Oddly enough, the ones who often have the most difficulty with SQL are the programmers themselves. SQL is a declarative language; that is, you declare what results you want, and it is up to the software to figure out how to get those results. For programmers, this reverses the natural order; we're used to telling the computer exactly what we want it to do so that it gives us the results that we want. SQL leaves the method of execution up to the underlying query engine. This turns out to be advantageous for BigQuery because it allows the Dremel query engine to perform the analysis in a different way from traditional relational databases.

It can be surprising that a model we often have trouble understanding would be accessible to people in other disciplines. But after seeing sales, marketing, and even pointy-haired managers wielding RIGHT OUTER JOINs, we grudgingly have to admit that people who aren't programmers are still quite intelligent.

BigQuery uses an admittedly nonstandard dialect of SQL. Speaking for all the engineers who currently work on or ever have worked on BigQuery or Dremel, if we could go back and change one thing, it probably would be to stick to something closer to standard SQL. If there was one prediction we could make about a breaking change in the future, it would be that BigQuery would deprecate some of the nonstandard quirks, such as a comma for table union, in favor of more standard SQL. That said, if such a change was made, there would be lots of advance warning, and the old dialect would continue to work for a long time after the new dialect was released.

How Fast Is BigQuery?

One of the main limitations of database query performance is the sequential nature of most query execution. Although most databases can make use of multiple processors, they often use their available parallelism to run multiple queries at once, rather than taking advantage of multiple processors for a single query. That said, even if they did parallelize single query execution, the database would still be limited by disk I/O speeds—if your data is stored on a single disk, reading the disk from multiple places in parallel may actually be slower than reading it sequentially.

The SQL query language is highly parallelizable, however, as long as you have a way to take advantage of it. The Dremel query engine created a way to parallelize SQL execution across thousands of machines. Chapter 9, Understanding Query Execution, describes in detail how it works, but the central principle is that it is a scale-out solution. If you want your queries to run faster, you can throw more machines at the problem. This is a contrast to a traditional scale-up architecture, where when you want more performance, you buy fancier hardware.

When run in the Google infrastructure, the Dremel architecture scales nearly linearly to tens of thousands of processor cores and hundreds of thousands of disks. The performance goal of the system was to process a terabyte of data in a second; although peak performance numbers have not been published, those goals have been met and exceeded.

Of course, this doesn't mean that you'll automatically see performance in that range; the Dremel clusters used by BigQuery are tuned for serving multiple queries at once rather than single queries at peak speed. A rough estimate for performance you can expect is on the order of 50 GB per second for a simple query. More complex queries—JOINs, complex regular expressions, and so on—will be somewhat slower. That said, 95 percent of all queries in the public BigQuery clusters finish in less than 5 seconds. However, unless you reserve capacity, you may find that performance fluctuates significantly due to load on the system.

BigQuery Reserved Capacity

BigQuery offers the ability to reserve processing capacity in a dedicated virtual cluster in units of 5 GB processed per second. This might sound strange, since we just said that the rough goal is to process 50 GB per second for on-demand. Does that mean reserved capacity charges you more for less performance?

There are a couple of things to note with respect to reservations:

Reserved capacity gives you the ability to run queries that preempt other users, up to your capacity limit.

Reserved capacity gives you the optional ability to ‘burst’ over your capacity rate. This means your queries can use the pool of on-demand resources in addition to reserved resources.

On-demand (non-reserved) capacity is best-effort only. Performance may vary significantly from day to day, even from query to query, based on load of the overall system. Reservations give you the ability to be first in line for resources and to expect more stable performance.

Query performance should continue to scale sublinearly (that is, if you double the size, it will take less than double the time) up to at least 500 GB of data processed in the query. So if you have a 100 MB table that takes 3 seconds to query and you increase the size a thousand times to 100 GB, it might take only 5 seconds to query. Increasing the size of the table will allow BigQuery to use more hardware to run the query.

There is a limit to the number of execution nodes that will be assigned to any one query, however. Based on current cluster sizing, that limit comes at approximately one-half a terabyte of data processed. If you start with a 1 TB table that you can query in 20 seconds and double it to 2 TB, your queries will now likely take 40 seconds. Note that the relevant size here just includes the fields that are touched. If you have 100 fields but just read one of them, the effective size is just the size of that single field.

There isn't actually a hard maximum table size you can process in BigQuery, other than saying that after a certain point, querying the tables may take longer than you're willing to wait. Multiterabyte queries are fairly common; multipetabyte queries are not.

Performance Benchmarks

Google doesn't publish benchmarks against BigQuery because when a company publishes its own benchmarks, people have a tendency to not believe them. In addition, if Google published performance numbers, it could be construed as a promise that users will see similar numbers. Because there seemed to be a lack of available performance information, we decided to run a simple benchmark of our own. Figure 2.1 shows a graph of how query execution time varies with the number of rows processed in a table for two different queries.

Figure 2.1 Query execution time versus table size

The benchmark used a real dataset: daily page views for Wikipedia in 2011. Each run of the benchmark used a sampled set of rows from the underlying dataset and increased by roughly an order of magnitude in size. The smallest table was 1,192 rows; the largest was more than 10 billion rows. The tables are in a publicly available dataset: bigquery-samples:wikipedia_benchmark. You should be able to reproduce similar results on your own (although you should note that several of the tables are large, and it can be easy to run up a serious bill by querying them).

The lower line on the chart corresponds to timings for a simple query that does a regular expression over the title field, groups by the language field, and sorts by the number of views. Here is the query that was used (where ranged from 1k to 10B):

SELECT language, SUM(views) AS views

FROM [bigquery-samples:wikipedia-benchmark]

WHERE REGEXP_MATCH(title, G.*o.*o.*g)

GROUP BY language

ORDER BY views DESC

We used a reasonably complex query because we didn't want it to be something that could be done with a simple filter or index, and we wanted to make sure the query engine wouldn't be able to use any tricks to skip data.

From the chart, you can see that each order of magnitude increase in the table size roughly corresponds to an additional 1 second of query time (the x-axis is plotted in log scale). The final point on the right corresponds to a 10 billion-row table (10,677,046,566 rows, to be exact). Querying this table scans almost one-half a terabyte of data (446 GB).

This query would be extremely slow on a relational database; there isn't a way to precompute the regular expression results, so a relational database would have to do a table scan. In BigQuery, the query over the largest table took less than 8 seconds, on average, to read half a terabyte and perform 10 billion regular expressions.

We mentioned that there is a size after which you'll start seeing linear performance. To show this, we added another table, this one with more than 100 billion rows, and re-ran the same query. This query processed 4.4 TB and took 69 seconds on average, which is about ten times as long as the query with one tenth the data took. While we didn't go further than the 100 billion row table in our tests, there are a number of customers who routinely query over tens or hundreds of terabytes at a time.

You should not assume that all queries will run this quickly, however. JOIN queries or queries that produce a lot of results may run much more slowly. Some queries will run out of memory; others may hit other limits in the system. For example, in order to test JOIN performance, we ran the following self-join against the same tables from the previous query:

SELECT wiki1.year, wiki1.month, wiki1.day, sum(wiki2.max_views)

FROM [bigquery-samples:wikipedia-benchmark] as wiki1

JOIN EACH (

SELECT title, MAX(views) as max_views

FROM [bigquery-samples:wikipedia-benchmark]

GROUP EACH BY title

) AS wiki2

ON wiki1.title = wiki2.title

GROUP EACH BY wiki1.year, wiki1.month, wiki1.day

The timing of these queries can be seen as the Self Join line in Figure 2.1. They take longer than the simple queries, but still increase slowly until about 100 million rows. Increasing from 100 million to a billion rows takes about double the time—still faster than linear but a significant slowdown. You can see, however, that the line ends at the 1 billion row point; this is because the query against the larger 10 billion row table failed with an Insufficient Resources error. Chapter 9 gives much more information about which queries will work well, which won't, and why. That chapter also provides some pointers for what to do when you hit errors like this one.

Cloud Storage System

In addition to being a way to run queries over your data, BigQuery is also a place to store your structured data in the cloud. Although this aspect of BigQuery grew out of necessity—if your data didn't live in Google's cloud then, you couldn't query it—it has grown into a significant and useful subsystem.

Your data is replicated to multiple geographically distinct locations for improved availability and durability. If a Google datacenter in Atlanta gets shut down because of a hurricane, that shouldn't cause a hiccup in your ability to access your data. Data is also replicated within a cluster, so your data should be virtually immune to data loss due to hardware failure. Of course, the BigQuery service may not have perfect uptime, and if your data is important, you should make sure it is backed up. You can back up your tables by exporting them to Google Cloud Storage for safekeeping, or you can run a table copy job in BigQuery to save a snapshot.

Data Ingestion

Data used in BigQuery must be loaded into the system before it can be queried. The load process transforms your data into a format that is optimized for querying and stores it in locations in physical proximity to the Dremel compute clusters.

There are three ways to get your data into BigQuery: streaming, direct upload, and through Google Cloud Storage. The most reliable and predictable is likely the latter. If your data is already in Google Cloud Storage, the load step is merely a transfer between two systems already within Google's cloud, so ingestion is very fast.

Direct upload can be an easier route if you don't want to go through Google Cloud Storage, because you can follow a standard resumable-upload HTTP protocol. Streaming is the easiest method; you can post individual rows, which will be available for query immediately. That said, for large load operations, or cases in which you want all your data to be available atomically, streaming may not be the best mechanism. For more information about how to get data into BigQuery, Chapter 6, Loading Data, describes the various options in detail.

Structured Data Storage

BigQuery is a system that stores and operates on structured data; that is, data that follows a rigid schema. A spreadsheet is an example of structured data, as is a database table. An HTML document, even though it may have predictable fields, is unstructured. If your data doesn't have a schema, or can't be coerced to a schema, there may be other tools that are better-suited for your use case.

BigQuery schemas describe the columns, or fields, of the structured data. Each field has a name and a data type that indicates the kind of data that can be stored in the field. Those data types can be either primitive or record types. Primitive types are basic types that store a single value—a string, a floating-point number, an integer, or a boolean flag.

A record type, however, is a collection of other fields. For the most part, a record is just a way of grouping your fields together. For example, if you store location as latitude and longitude, you could have a location record with two fields: lat and long. Fields can also be repeated, which means that they can store more than one value.

These last two features—record types and repeated fields—distinguish BigQuery from most relational databases, which can store only flat rows. Records and repeated fields enable you to store the data in a more natural way than you might in a relational database. For example, if your table contains customer orders, you might want to store an entire order as a single record, even though there were multiple items in the order. This makes it easier to perform analysis of the orders without having to flatten the data or normalize it into multiple tables.

Collections of rows of data following a single schema are organized into tables. These tables are similar to tables in a typical relational database but have some restrictions. The only way to modify BigQuery tables is to append to them or rewrite them—there is no way to update individual rows. BigQuery also doesn't support table modification queries, like ALTER TABLE, DROP TABLE, or UPDATE TABLE.

Collections of tables with similar access restrictions are organized into datasets. Many relational database systems allow you to have multiple database catalogs. For instance, if you have a MySQL database with your financial data you might want that to be a separate catalog from your user data.

These catalogs map quite well to a BigQuery dataset. Datasets can be shared with other users and groups. Collections of datasets owned by a single user or organization are organized into projects. Projects are a Google Cloud Platform concept that indicates a single billing entity with a team of users in various roles. Chapter 4, Understanding the BigQuery Object Model, discusses these abstractions in much more detail.

Distributed Cloud Computing

Google has a lot of hardware in their datacenters. A number of people have tried to figure out just how many machines Google has by taking into account things like global PC sales and maximum power capacity of various known Google datacenters. It is a large number. Very few, if any, organizations can match the scale and, as importantly, the organization of Google's datacenters. Google's Cloud Platform allows people outside of Google to take advantage of this scale and manageability.

Harnessing the Power of Google's Cloud

When you run your queries via BigQuery, you put a giant cluster of machines to work for you. Although the BigQuery clusters represent only a small fraction of Google's global fleet, each query cluster is measured in the thousands of cores. When BigQuery needs to grow, there are plenty of resources that can be harnessed to meet the demand.

If you want to, you could probably figure out the size of one of BigQuery's compute clusters by carefully controlling the size of data being scanned in your queries. The number of processor cores involved is in the thousands, the number of disks in the hundreds of thousands. Most organizations don't have the budget to build at that kind of scale just to run some queries over their data.

The benefits of the Google cloud go beyond the amount of hardware that is used, however. A massive datacenter is useless unless you can keep it running. If you have a cluster of 100,000 disks, some reasonable number of those disks is going to fail every day. If you have thousands of servers, some of the power supplies are going to die every day. Even if you have highly reliable software running on those servers, some of them are going to crash every day.

To keep a datacenter up and running requires a lot of expertise and know-how. How do you maximize the life of a disk? How do you know exactly which parts are failing? How do you know which crashes are due to hardware failures and which to software? Moreover, you need software that is written to handle failures at any time and in any combination. Running in Google's cloud means that Google worries about these things so that you don't have to.

There is another key factor to the performance of Google's cloud that some of the early adopters of Google Compute Engine have started to notice: It has an extremely fast network. Parallel computation requires a lot of coordination and aggregation, and if you spend all your time moving the data around, it doesn't matter how fast your algorithms are or how much hardware you have. The details of how Google achieves these network speeds are shrouded in secrecy, but the super-fast machine-to-machine transfer rates are key to making BigQuery fast.

Cloud Data Warehousing

Most companies are accustomed to storing their data on-premise or in leased datacenters on hardware that they own or rent. Fault tolerance is usually handled by adding redundancy within a machine, such as extra power supplies, RAID disk controllers, and ECC memory. All these things add to the cost of the machine but don't actually distance you from the consequences of a hardware failure. If a disk goes bad, someone has to go to the datacenter, find the rack with the bad disk, and swap it out for a new one.

Cloud data warehousing offers the promise of relieving you of the responsibility of caring about whether RAID-5 is good enough, whether your tape backups are running frequently enough, or whether a natural disaster might take you offline completely. Cloud data warehouses, whether Google's or a competitor's, offer fault-tolerance, geographic distribution, and automated backups.

Ever since Google made the decision to go with exclusively scale-out architectures, it has focused on making its software accustomed to handling frequent hardware failures. There are stories about Google teams that run mission-critical components, who don't even bother to free memory—the amount of bugs and performance problems associated with memory management is too high. Instead, they just let the process run out of memory and crash, at which time it will get automatically restarted. Because the software has been designed to not only handle but also expect that type of failure, a large class of errors is virtually eliminated.

For the user of Google's cloud, this means that the underlying infrastructure pieces are extraordinarily failure-resistant and fault-tolerant. Your data is replicated to several disks within a datacenter and then replicated again to multiple datacenters. Failure of a disk, a switch, a load balancer, or a rack won't be noticeable to anyone except a datacenter technician. The only kind of hardware failure that would escalate to the BigQuery operations engineers would be if someone hit the big red off button in a datacenter or if somebody took out a fiber backbone with a backhoe. This type of failure still wouldn't take BigQuery down, however, since BigQuery runs in multiple geographically distributed datacenters and will fail over automatically.

Of course, this is where we have to remind you that all software is fallible. Just because your data is replicated nine ways doesn't mean that it is completely immune to loss. A buggy software release could cause data to be inadvertently deleted from all nine of those disks. If you have critical data, make sure to back it up.

Many organizations are understandably reluctant to move their data into the cloud. It can be difficult to have your data in a place where you don't control it. If there is data loss, or an outage, all you can do is take your business elsewhere—there is no one except support staff to yell at and little you can do to prevent the problem from happening in the future.

That said, the specialized knowledge and operational overhead required to run your own hardware is large and gets only larger. The advantages of scale that Google or Amazon has only get bigger as they get better at managing their datacenters and improving their data warehousing techniques. It seems likely that the days when most companies run their own IT hardware are numbered.

Multitenancy and Parallel Execution

When you run a query on MySQL that takes one second, you

Enjoying the preview?

Page 1 of 1

Google BigQuery Analytics

About this ebook

Jordan Tigani

Related authors

Related to Google BigQuery Analytics

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Google BigQuery Analytics

What did you think?

Book preview

Google BigQuery Analytics - Jordan Tigani

In This Part

Big Data Stack 1.0

Big Data Stack 2.0 (and Beyond)

Open Source Stack

Google Cloud Platform

Cloud Processing

Cloud Storage

Cloud Analytics

Problem Statement

What Is Big Data?

Why Big Data?

Why Do You Need New Ways to Process Big Data?

How Can You Read a Terabyte in a Second?

What about MapReduce?

How Can You Ask Questions of Your Big Data and Quickly Get Answers?

Summary

What Is BigQuery?

SQL Queries over Big Data

Cloud Storage System

Distributed Cloud Computing