The 2020 Google Outage (Detailed Analysis)

FromThe Backend Engineering Show with Hussein Nasser

Start listening View podcast show

The 2020 Google Outage (Detailed Analysis)

FromThe Backend Engineering Show with Hussein Nasser

ratings:

Length:

52 minutes

Released:

Dec 20, 2020

Format:

Podcast episode

Description

0:00 Intro
1:00 Summary of the Outage
4:00 Detailed Analysis of the Incident Report
On Dec 14 2020 Google across the globe suffered from an outage that lasted 45 minutes nobody could access most of Google services.
Google has released a detailed incident report discussing the outage, what caused it, technical details on their internal service architecture and what did they do to mitigate and prevent this from happening in this in the future
In this video, I want to take a few minutes to summarize the report and then go into a detailed analysis. You can find youtube chapters to jump to the interesting part of the video. pick your favorite drink, sit back relax, and enjoy. Let's get started.
let's start with an overview of how the google id service works, the client connects to Google authentication service to get authenticated or retrieve account information
The account information is stored in a distributed manner between the different service ids for redundancy.
when an update is made to an account on the leader node, the existing data in all nodes are marked as outdated, this is done for security reasons. Let’s say you updated your credit card info, privated your profile or deleted a comment, it is extremely dangerous to serve that outdated information. This was the key to the outage.
The updated account is then replicated based on Paxos Consensus protocol.
The user id service has a storage quota controlled by an automated quota management solution when the storage usage of the service changes.
the quota is maintained accordingly either reduced or increased based on the demand ..
So What Exactly Happened that caused the outage?
In October 2020, google migrated their quota management to a new system and registered the id service with the new system.
however some parts of the old system remained hooked up specifically the parts regarding the reading of the service usage. And because the service is registered to the new system, the old qouta system reported 0 usage as it should. So when the new quota manement asked its service for its usage it was incorrectly reporting 0.
Nothing happened for a while since there was a grace period, but that period expired on December
Thats when the new quota system kicked and saw the id service with 0 usage and started reducing the qouta for the id service down .. you are not using it why waste?
The quota kept reducing until the service had no space left.
This has caused updates to the leader node to fail, which caused all data to go out of date in all nodes which in turn escalated globally to what we have seen.
Resource
https://status.cloud.google.com/incident/zall/20013

Released:

Dec 20, 2020

Format:

Podcast episode

Titles in the series (100)

Welcome to the Backend Engineering Show podcast with your host Hussein Nasser. If you like software engineering you’ve come to the right place. I discuss all sorts of software engineering technologies and news with specific focus on the backend. All opinions are my own. Most of my content in the podcast is an audio version of videos I post on my youtube channel here http://www.youtube.com/c/HusseinNasser-software-engineering Buy me a coffee https://www.buymeacoffee.com/hnasr ?‍? Courses I Teach https://husseinnasser.com/courses

Skip carousel

More Episodes from The Backend Engineering Show with Hussein Nasser

Skip carousel

Related podcast episodes

Skip carousel

Discover this podcast and so much more

The 2020 Google Outage (Detailed Analysis)

The 2020 Google Outage (Detailed Analysis)

Description

Titles in the series (100)

More Episodes from The Backend Engineering Show with Hussein Nasser

Related podcast episodes