The mistake led to all of Facebook’s services being inaccessible, with one analogy likening it to a failure in the “air traffic control” services for network traffic …
We reported yesterday on the massive failure.
It’s not just you: Facebook, Instagram, and WhatsApp are all currently down for users around the world. We’re seeing error messages on all three services across iOS applications as well as on the web. Users are being greeted with error messages such as: “Sorry, something went wrong,” “5xx Server Error,” and more.
The outage is affecting every Facebook-owned platform, according to data on Downdetector and Twitter. This includes Instagram, Facebook, WhatsApp, and Facebook Messenger […] While some Facebook, Instagram, and WhatsApp outages only affect certain geographic regions, the services are down worldwide today.
It gradually appeared that the problem might relate to DNS – the domain name servers that tell devices which IP addresses to use to access services – but it was unclear what exactly had happened, and whether this was an external hack, malicious action by an insider, or a catastrophic mistake.
Facebook has now admitted in a blog post that it was a mistake.
Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.
It took a long time to resolve the problem because the inaccessible systems included the servers and tools engineers would normally use to solve the problem remotely. Reports suggest that lower-level employees had to gain physical access to the data centers, and then rely on step-by-step instructions from more senior engineers in order to undo the mistake. Complicating this, the networks being unavailable meant that Facebook’s door access systems were also offline, physically preventing access.
How to understand the Facebook outage
We’ll doubtless get the full story in time, but the consensus view emerging is that the problem was some mix of domain name server (DNS) and border gateway protocol (BGP) configuration.
The best analogy I’ve seen is to think of network traffic as being like planes. Your device wants to fly to facebook.com. Your plane first needs to know the GPS coordinates of the destination airport, that is, the IP address it should connect to. It gets that information by asking a DNS, which tells it that facebook.com is located at (for example) 22.214.171.124.
But getting to the final destination – the actual server that can perform the task you want to do – relies on a kind of air traffic control system for network traffic, and that’s the BGP. The BGP tells your device which route to fly through the various servers en route to your final destination.
It appears that Facebook completely lost its BGP systems – so there was no way for Facebook to tell devices how to reach their destination. And that included Facebook’s own engineers reaching the systems they needed to undo the mistake.
Additionally, an informed source suggests that there was no problem with Facebook’s DNS per se, it was rather than the loss of BGP meant there was no way to reach the company’s domain name servers.
The outage has huge implications
If this were just people being unable to post cat videos for a few hours, that would be one thing (though, come on, what is life without cat videos?). But WhatsApp is effectively a critical piece of communications infrastructure in many countries, routinely used for communication between patients and doctors, for example, and used by many for payments.
The extended outage has drawn attention to how vulnerable the entire world is to failures of this nature.
For example, millions of people rely on Google DNS servers to reach every server on the planet. Imagine those servers going down for an extended period. That wouldn’t just affect consumers, it would disrupt commerce and critical infrastructure. Factory production, fleet transport, retail… the works.
The whole world is critically dependent on a relatively small number of servers, all of which could be taken offline by a mistake of the kind that happened here. A lot of thought needs to be put into how we prevent a far more significant internet outage in the future.
FTC: We use income earning auto affiliate links. More.