Discover this podcast and so much more

Podcasts are free to enjoy without a subscription. We also offer ebooks, audiobooks, and so much more for just $11.99/month.

The 2021 Slack Outage (Detailed analysis)

The 2021 Slack Outage (Detailed analysis)

FromThe Backend Engineering Show with Hussein Nasser


The 2021 Slack Outage (Detailed analysis)

FromThe Backend Engineering Show with Hussein Nasser

ratings:
Length:
44 minutes
Released:
Jan 15, 2021
Format:
Podcast episode

Description

On Jan 4th 2021, Slack experienced a global outage that prevented customers from using the service for nearly 5 hours.
Slack has released the Root cause analysis incident report which I’m going to summarize in the first part of this video. After that Ill provide a lengthy deep dive of the incident so make sure to stick around for that.
If you are new here, I make backend engineering videos and also cover software news, so make sure to Like comment and subscribe if you would like to see more plus it really helps the channel, lets jump into it.
So This is an approximation of Slack’s architecture based on what was the described in the reports. Clients connects to load balancers, load balancers distribute requests to backend servers and backend servers finally make requests to database servers which is powered by mysql through vitess sharding. All of those are connected by routers in cross boundary network.
Around 6AM jan 4 , the cross network boundary routers setting between LB and backend and backend to DB started to drop packets.
This lead to the load balancers slowly marking backends as unhealthy and removing them from the fleet Which compounded the amount of requests
The number of failed requests eventually triggered the provisioning service to start spinning an absurdly large number of backend servers
However the provisioning service couldn’t keep up with the huge demand and shortly started to time out for the same networking reasons and eventually ran out of maximum open file handles.
Eventually Slack’s cloud provider increased the networking capacity and backend servers went back to normal around 11 AM PST
This was a summary of the slack outage, Now set back, grab your favorite beverage and lets go through the detailed incident report!
0:00 Outage Summary
2:00 Detailed Analysis Starts
5:20 The Root Cause
30:00 Corrective Actions
Released:
Jan 15, 2021
Format:
Podcast episode

Titles in the series (100)

Welcome to the Backend Engineering Show podcast with your host Hussein Nasser. If you like software engineering you’ve come to the right place. I discuss all sorts of software engineering technologies and news with specific focus on the backend. All opinions are my own. Most of my content in the podcast is an audio version of videos I post on my youtube channel here http://www.youtube.com/c/HusseinNasser-software-engineering Buy me a coffee https://www.buymeacoffee.com/hnasr ?‍? Courses I Teach https://husseinnasser.com/courses