Anyone taking DevOps to heart should read about Normal Accidents. The book is about failure modes of nuclear power plants; those highly automated and extremely instrumented things that they are still manage to fail in spite of everything that we do. The lessons here carry well into the highly automated environments we try to build in our distributed systems.
There are a couple of key learnings to take from this book and theory:
- Root cause can be something seemingly completely unrelated to the actual problem.
- Contributing causes can sneak in and make what would be a well handled event into something that gets you bad press.
- Monitoring instrumentation failures can be sneaky contributing causes.
- Single-failure events are easily handled, and may be invisible.
- Multiple-failure events are much harder to handle.
- Multiple-failure events can take months to show up if the individual failures happened over the course of months and were invisible.
The book had a failure mode much like this one:
After analysis, it was known that the flow direction of a specific coolant pipe was a critical item. If backflow occurred, hot fluid could enter areas not designed for handling it. As a result, a system was put in place to monitor flow direction, and automation put in place to close a valve on the pipe if backflow was detected.
After analyzing the entire system after a major event, it was discovered that the flow-sensor had correctly identified backflow, and had activated the valve close automation. However, it was also discovered that the valve had frozen open due to corrosion several months prior to the event. Additionally, the actuator had broken when the solenoid moved to close the valve. As a result, the valve was reported closed, and showed as such on the Operator panel, when in fact it was open.
- The valve had been subjected to manual examination 9 months before the event, and was due to be checked again in 3 more months. However, it had failed between checks.
- The actuator system was checked monthly and had passed every check. The actuator breakage happened during one of these monthly checks.
- The sensor on the actuator was monitoring power draw for the actuator. If the valve was frozen, the actuator should notice an above-normal current draw. However, as the actuator arm was disconnected from the valve it experienced a below-normal current draw and did not detect this as an alarm condition.
- The breaking of the actuator arm was noted in the maintenance report during the monthly check as a "brief flicker of the lamp" and put down as a 'blip'. The arm failed before the current meter triggered its event. As the system passed later tests, the event was disregarded.
- The backflow sensor actually installed was not directional. It alarmed on zero-flow, not negative-flow.
- Instrument the valve itself for open/close state.
- Introduce new logic so that if the backflow sensor continues to detect backflow, raise alarms.
- Replace the backflow sensor with a directional one as originally called for.
- Add a new flow sensor behind the valve.
- Change the alerting on the actuator sensor to alarm on too-low voltages.
- Increase the frequency of visual inspection of the physical plant
That valve being open caused Fun Times To Be Had. If that valve system had been operating correctly, the fault that caused the backflow would have been isolated as the system designers intended and the overall damage contained. However, this contributing cause, one that happened months before the triggering event, turned a minor problem into a major one.
So, why did that reactor release radioactive materials into the environment? Well, it's complicated...
And yet, after reading the post-mortem report you look at what actually failed and think, 'and these are the jokers running our nuclear power plants? We're lucky we're not all glowing in the dark!'
We get the same kind of fault-trees in massively automated distributed systems. Take this entirely fictional, but oh-so-plausible failure cascade:
ExampleCorp was notified by their datacenter provider of the need for emergency power maintenance in their primary datacenter. ExampleCorp (EC) operated a backup datacenter and had implemented a hot failover method, tested twice a year, for moving production to the backup facility. EC elected to perform a hot failover to the backup facility prior to the power work in their primary facility.
Shortly after the failover completed the backup facility crashed hard. Automation attempted to fail back to the primary facility, but technicians at the primary facility had already begun, but not yet completed, safe-shutdown procedures. As a result, the fail-back was interrupted part way through, and production stopped cold.
Service recovery happened at the primary site after power maintenance completed. However, the cold-start script was out of date by over a year so restoration was hampered by differences that came up during the startup process.
Analysis after the fact isolated several causes of the extensive downtime:
- In the time between the last hot-failover test, EC had deployed a new three-node management cluster for their network switch configuration and software management system, one three node cluster for each site.
- The EC-built DNS synchronization script used to keep the backup and primary sites in sync was transaction oriented. A network fault 5 weeks ago meant the transactions related to the DNS update for the cluster deployment were dropped and not noticed.
- The old three-node clusters were kept online "just in case".
- The differences in cluster software versions between the two sites was displayed in EC's monitoring panel, but was not alarmed, and disregarded as a 'glitch' by Operations. Interviews show that Ops staff are aware that the monitoring system will sometimes hold onto stale data if it isn't part of an alarm.
- At the time of the cluster migration Operations was testing a new switch firmware image. The image on the old cluster was determined to have a critical loading bug, which required attention from the switch vendor.
- Two weeks prior to the event EC performed an update of switch firmware using new code that passed validation. The new firmware was replicated to all cluster members in both sites using automation based on the IP addresses of the cluster members. The old cluster members were not updated.
- The automation driving the switch firmware update relied on the non-synchronized DNS entries, and reported no problems applying updates. The primary site got the known-good firmware, the backup site got the known-bad firmware.
- The hot-swap network load triggered the fault in the backup site's switch firmware, causing switches to reboot every 5 minutes.
- Recovery logic in the application attempted to work around the massive network faults and ended up duplicating some database transactions, and losing others. Some corrupted data was transferred to the primary site before it was fully shut down.
- Lack of technical personnel physically at the backup site hampered recovery from the backup site and extended the outage.
- Out of date documentation hampered efforts restart services from a cold stop.
- The inconsistent state of the databases further delayed recovery.
That is a terrible-horrible-no-good-very-bad-day, yes indeed. However, it shows what I'm talking about here. Several small errors crept in to make what was supposed to be a perfectly handleable fault something that caused many hours of downtime. This fault would have been discovered during the next routine test, but that hadn't happened yet.
Just like the nuke-plant failure, reading this list makes you go "what kind of cowboy outfit allows this kind of thing to happen?"
Or maybe, if it has happened to you, "Oh crimeny, I've so been there. Here's hoping I retire before it happens again."
It happens to us all. Netfix reduces this through the Chaos Monkey, using it to visibly trigger these small failures before they can cascade into big ones. And yet even they fall over when a really big failure happens naturally.
What can you do?
- Accept that the multiple-failure combinatorics problem is infinite and you won't be able to capture every fail case.
- Build your system to be as disaster resilient as possible.
- Test your remediations, and do so regularly.
- Validate your instrumentation is returning good results, and do so regularly.
- Cross-check where possible.
- Investigate glitches, and keep doing it after it gets tediously boring.
- Cause small failures and force your system to respond to them.
These are all known best-practices, and yet people are lazy, or can't get sufficient management buy-in to do it (a 'minimum viable product' is likely excessively vulnerable to this kind of thing). We do what we can, snark at those who visibly can't, and hope our turn doesn't come up.