The Damage of Downtime
There has recently been a prominent example of how damaging a serious IT outage can be. The hours-long interruption in service that Facebook (and its other platforms Instagram and WhatsApp) suffered recently, made news around the world. It cut off social networks, friends, relatives, lovers and businesses. Only Twitter saw the funny side.
The root cause is still the subject of some speculation and we have no information on that, beyond what’s been published on the Internet. What was clear, however, is how disruptive and damaging an outage can be, howsoever it was caused. Facebook became the news as its share price fell almost 6%, leaving Mark Zuckerberg an estimated $7billion out of pocket. Now that’s a sizeable amount, but already the price has partly rebounded; so, he’s unlikely to starve!
The prevailing theory is that the outage was caused by a remote administrator updating the BGP routing configuration. The change meant that routing was disabled as the old configuration was removed – but the new configuration couldn’t be configured because it was being done remotely. As a result, Facebook’s application servers and DNS hosts became unreachable and, being remote, they couldn’t connect in to fix it. Reportedly someone who knew what they were doing had to physically get to site and reconfigure the settings on the routers to bring the environment back up.
Ignoring the frailty of IT systems to human error, and the difficulties and vulnerabilities of routing configurations and DNS, what can the rest of us learn from the disruption caused by the outage of such critical social infrastructure?
What can we learn?
A worst case scenario for many businesses, not just Facebook, is a complete loss of service. Facebook’s business model is totally reliant on online access and the Internet. Many other businesses don’t consider themselves to be as exposed to that kind of failure, but the reality is that in a digital world even a small outage can have a hugely disruptive effect.
This can be caused by misconfiguration or human error (as was perhaps the case for Facebook), an oversight, a physical failure or a deliberate act. The cause, as always, is much easier to pinpoint after the fact.
We have seen similar implications in non-IT businesses too – oil pipeline operators, food manufacturers and healthcare providers who businesses have suffered major outages as a result of ransomware attacks. Their reliance on IT, even though they trade in the physical world, meant that services and their delivery were similarly affected. This shows that no company can afford an IT outage – no matter how it is caused. Network misconfiguration is just one cause of failure; and ransomware another which has over recent times become more common than the calamitous events we saw in the social media world last week.
What the Facebook event shows is not how to avoid downtime, outages and blackouts –instead, it shows how small episodes that can seem almost trivial can give rise to such enormous consequences.
You can’t avoid all risks. Whether it’s a network administrator changing routes or a user with a malicious email attachment, people make mistakes. If, as the mathematician Lorenz proposes, a butterfly flapping its wings can result in a tornado, it’s important that early signs of risk are acknowledged as part of your risk management process.
We can learn about the risks of changing BGP configurations from Facebook; or when it comes to ransomware, learn how to reduce the risk of becoming infected. In both instances, however, effective mitigation strategies that prevent a risk or contain its impact are key to lessening the potential effect across an entire enterprise.
Back-ups mean so much
Maybe a backup router configuration strategy might’ve helped Facebook (if they had been easily accessible). Although, to be fair, massive on-line businesses like Facebook typically have huge backup data centres available to provide resilience and mitigation against catastrophic events.
For many other failure scenarios, however, backups are an important part of a Plan B. Loss or corruption of data can render even a fully working, internet connected, server inoperative. In the event of hardware failures, ransomware, theft, deliberate misuse or vandalism – it’s often the presence or absence of that make the biggest difference.
In some ransomware attacks, where the decryption process has been absent, unworkable or too slow, backups have provided the road to recovery. Colonial Pipeline found that; and so did Maersk when they were hit by NotPetya. They only managed to get their systems back because of a single domain controller, located in a remote Nigerian office and unaffected by the broader network outage. Incredibly, it was this only copy of the user and system Active Directory (which was ultimately flown back to head office) that enabled the recreation of the Maersk windows domain.
We’ve seen lots of significant systems outages in the past, resulting from numerous causes, and Facebook is just the most recent high profile “victim”. We also know that such disruptive events can stem from something as small as a butterfly flapping its wings.
Effective risk management means dealing with these, and where they can be foreseen, having controls in place. Every company can learn something about network support and administration from the Facebook experience, and in the same way every company can learn something about ransomware from Colonial Pipeline and about the importance of backups from Maersk.
You do have to sweat the small stuff!