Facebook continues to share more details about what exactly caused the six-hour outage that took down Facebook, Messenger, Instagram, and WhatsApp on Monday. In a new blog post, Facebook dives into some of the technical details of what led to the outage, saying that it happened due to a mistake during one of its many “routine maintenance jobs.”
Facebook published its first recap of the outage late on Monday evening, attributing it to a single mistake that had a “cascading effect” on data center communication, therefore “bringing our services to a halt.”
Facebook says that while it has systems in place to audit commands that could take down its entire network, “a bug in that audit tool didn’t properly stop” this command.
The data traffic between all these computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And in the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance — perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself.
This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.
This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse.
One of the jobs performed by our smaller facilities is to respond to DNS queries. DNS is the address book of the internet, enabling the simple web names we type into browsers to be translated into specific server IP addresses. Those translation queries are answered by our authoritative name servers that occupy well known IP addresses themselves, which in turn are advertised to the rest of the internet via another protocol called the border gateway protocol (BGP).
To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.
Once all of Facebook’s platforms were down, its ability to troubleshoot the outage was impacted by internal tools also being affected by the outage. As such, Facebook sent engineers to data centers to obtain physical access to the hardware. Even this, however, took time because “the hardware and routers are designed to be difficult to modify even when you have physical access to them.”
In this specific instance, it says that the efforts it has taken to improve the security of its systems slowed down its ability to recover from the outage, but that this is a tradeoff it feels is worth it:
We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. I believe a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this
Facebook says that it has already started an “extensive review process to understand how we can make our systems more resilient.”
FTC: We use income earning auto affiliate links. More.
Comments