You must have heard about the Facebook outage; Everyone has, as it affected most of us. In a nutshell, the social media platforms Facebook, WhatsApp and Instagram (all owned by Facebook) experienced extensive downtimes on Monday, 4th October 2021. The reason for this (approximately) 6-hour outage (according to security researches) is summarized below:
- During routine maintenance, one of Facebook's engineers made a BGP (Border Gateway Protocol) configuration error.
- This configuration error, which was intended for the availability of Facebook's global backbone capacity, instead took down all the connections in Facebook's core network, effectively disconnecting all Facebook's data centres globally.
- The damage was extensive: while all Facebook data centres and servers were down, the remote access and tools required to investigate and correct these outages were unreachable because they were affected by the outage.
- There were also (unconfirmed) reports that Facebook personnel initially could not access office buildings to evaluate the extent of the outage because their tags would not work on the access doors.
As expected, the repercussions of these outages were vast. Several people across the world could not connect to their friends and loved ones for that time, while businesses that rely on Facebook lost money from being offline for a long while. As for Facebook, the extended downtime resulted in a 100 million USD revenue loss and a 4.9% drop in company stock (Fortune).
Now, as much as the outage was a network issue, there is a big lesson in disaster recovery for all organisations to take.
Disaster Recovery, in simple terms, is an organization's ability and methods to regain access/functionality to IT systems after an event that has taken them offline and/or affected business operations. These events could be a cyber-attack, the Covid-19 pandemic, or even the Facebook outage. Organisations (Facebook inclusive) typically earmark vast sums to their disaster recovery functions to ensure that IT outages are kept to a minimum.
However, a few disaster recovery lessons we can take away from the Facebook outage include:
- Establish alternate communication paths and resources in case of outages: Organisations must plan for effective communication during outages. For example, Facebook's recovery may have been delayed because of internal communication issues as email, Messenger and other internal tools were down. Considering how key communication is to disaster recovery, organisations must figure this out on time.
- Routinely evaluate different threat scenarios in your DR tests: The importance of disaster scenarios simulations cannot be overstated. If Facebook had performed a disaster recovery test on this BGP misconfiguration use case, then the recovery timeline may have been significantly reduced as they would be prepared to tackle the outage head-on. Organisations must be prepared to try out new scenarios (no matter how unlikely) in their DR tests.
Finally, in hindsight, it is easy to sit and point out what Facebook (with all its resources) may have missed in their disaster recovery efforts. However, it is also important to take learning points from Facebook's event to ensure that our organizations are prepared for disasters.