On January 15, 1991, a routine software update was deployed across AT&T's 4ESS switching network. A single misplaced break statement in a C recovery routine caused a cascading failure: each switch that restarted sent a message that crashed its neighbours, rippling across all 114 nodes and knocking out 75 million calls in nine hours.1
That was 35 years ago, but the underlying problem has only gotten worse. Vast, interconnected systems where a small change can have outsized consequences. Decades of mergers have left operators with overlapping BSS/OSS platforms, billing engines built on different Java versions, and customer portals that nobody wants to touch.
1 Communications of the ACM, 1993