The Day the Internet Died (Again)

Nov 20

This week the Internet broke. Again. This time Cloudflare took a break from whatever it is it does, and instead took down vast swathes of the Internet.

It’s one of those days where Downdetector itself was broken.

If you didn’t know the cause of the outage you couldn’t ask ChatGPT for any advice, because it was down too. Prompt engineers ground to a halt, making it impossible to explain switches to confused engineering managers.

Death by Config File

And what triggered this cataclysmic meltdown? A bad config file. Like, are you serious? It wasn’t North Korea. It wasn’t an AWS data centre suffering a nuclear strike. The cause was a rather pedestrian file.

Cloudflare’s apology was exactly as expected.

“We are sorry for the impact to our customers and to the Internet in general.”

Nice. Now do that every time you change the documentation and don’t update the actual config syntax for something that is meant to “handle threat traffic”.

Centralized for Failure

We’ve created a centralized internet where one company accidentally sneezes and the digital universe collapses like a soufflé in a wind tunnel. Cloudflare is great but when a significant proportion of websites rely on them it becomes a single point of failure.

Nobody worried that the Titanic had insufficient lifeboats for the number of people on board because the ship is unsinkable, right? Nobody worries that Cloudflare might go down and we can’t resolve half the Internet 🛜, so even if you don’t use Cloudflare your dependencies do. There is no escape. There is no backup plan. There might well be your VP asking why.

Management Confusion

The problem isn’t Cloudflare itself. This is about the way we think about software engineering and risk, and why we choose the solutions we do.

There’s an apocryphal story where a manager is presented with a number of computing solutions for a problem. They could choose a low-cost solution, they could choose an innovative solution. Or they could choose a tied and tested “name”, and the following quotation was born. “Nobody gets fired for choosing IBM”.

This lives on today. Every manager I’ve ever worked with has wanted “less complexity”. That might mean handing off the work to a third-party because we don’t want to think about it. It might mean choosing a suboptimal solution because it’s “battle-hardened”.

In one company I worked for that meant putting all changes behind a feature flag. A single String presented to the user needed to be behind a feature flag if changed. There were so many feature flags that when you turned one off the overall configuration wasn’t guaranteed to work. Adherence to the rule didn’t guarantee working software, but it made the managers happy.

In other areas of software development we are happy to give everything to one vendor and just hope that their interns don’t misconfigure the YAML.

The sad irony is that we use Cloudflare to stop our sites from going down due to DDoS attacks. But now we need protection from the protection.

Conclusion

This isn’t a new phenomenon. People just aren’t good at making sure the processes we have support a maintainable and quality system.

Maybe that’s just my experience though? Let me know what you think. My engagement is low recently, but maybe that’s because Cloudflare has been down?

About The Author

Professional Software Developer “The Secret Developer” can be found on Twitter @TheSDeveloper.

The Secret Developer “went down”, sleeping 30 hours over a weekend. Nobody cared, as long as they were still available for on call.📞

The Secret Developer