October 2013 Archives

Anyone taking DevOps to heart should read about Normal Accidents. The book is about failure modes of nuclear power plants; those highly automated and extremely instrumented things that they are still manage to fail in spite of everything that we do. The lessons here carry well into the highly automated environments we try to build in our distributed systems.

There are a couple of key learnings to take from this book and theory:

  • Root cause can be something seemingly completely unrelated to the actual problem.
  • Contributing causes can sneak in and make what would be a well handled event into something that gets you bad press.
  • Monitoring instrumentation failures can be sneaky contributing causes.
  • Single-failure events are easily handled, and may be invisible.
  • Multiple-failure events are much harder to handle.
  • Multiple-failure events can take months to show up if the individual failures happened over the course of months and were invisible.

The book had a failure mode much like this one:

After analysis, it was known that the flow direction of a specific coolant pipe was a critical item. If backflow occurred, hot fluid could enter areas not designed for handling it. As a result, a system was put in place to monitor flow direction, and automation put in place to close a valve on the pipe if backflow was detected.

After analyzing the entire system after a major event, it was discovered that the flow-sensor had correctly identified backflow, and had activated the valve close automation. However, it was also discovered that the valve had frozen open due to corrosion several months prior to the event. Additionally, the actuator had broken when the solenoid moved to close the valve. As a result, the valve was reported closed, and showed as such on the Operator panel, when in fact it was open.

  • The valve had been subjected to manual examination 9 months before the event, and was due to be checked again in 3 more months. However, it had failed between checks.
  • The actuator system was checked monthly and had passed every check. The actuator breakage happened during one of these monthly checks.
  • The sensor on the actuator was monitoring power draw for the actuator. If the valve was frozen, the actuator should notice an above-normal current draw. However, as the actuator arm was disconnected from the valve it experienced a below-normal current draw and did not detect this as an alarm condition.
  • The breaking of the actuator arm was noted in the maintenance report during the monthly check as a "brief flicker of the lamp" and put down as a 'blip'. The arm failed before the current meter triggered its event. As the system passed later tests, the event was disregarded.
  • The backflow sensor actually installed was not directional. It alarmed on zero-flow, not negative-flow.

Remediations:

  • Instrument the valve itself for open/close state.
  • Introduce new logic so that if the backflow sensor continues to detect backflow, raise alarms.
  • Replace the backflow sensor with a directional one as originally called for.
  • Add a new flow sensor behind the valve.
  • Change the alerting on the actuator sensor to alarm on too-low voltages.
  • Increase the frequency of visual inspection of the physical plant

That valve being open caused Fun Times To Be Had. If that valve system had been operating correctly, the fault that caused the backflow would have been isolated as the system designers intended and the overall damage contained. However, this contributing cause, one that happened months before the triggering event, turned a minor problem into a major one.

So, why did that reactor release radioactive materials into the environment? Well, it's complicated...

And yet, after reading the post-mortem report you look at what actually failed and think, 'and these are the jokers running our nuclear power plants? We're lucky we're not all glowing in the dark!'

We get the same kind of fault-trees in massively automated distributed systems. Take this entirely fictional, but oh-so-plausible failure cascade:

ExampleCorp was notified by their datacenter provider of the need for emergency power maintenance in their primary datacenter. ExampleCorp (EC) operated a backup datacenter and had implemented a hot failover method, tested twice a year, for moving production to the backup facility. EC elected to perform a hot failover to the backup facility prior to the power work in their primary facility.

Shortly after the failover completed the backup facility crashed hard. Automation attempted to fail back to the primary facility, but technicians at the primary facility had already begun, but not yet completed, safe-shutdown procedures. As a result, the fail-back was interrupted part way through, and production stopped cold.

Service recovery happened at the primary site after power maintenance completed. However, the cold-start script was out of date by over a year so restoration was hampered by differences that came up during the startup process.

Analysis after the fact isolated several causes of the extensive downtime:

  • In the time between the last hot-failover test, EC had deployed a new three-node management cluster for their network switch configuration and software management system, one three node cluster for each site.
  • The EC-built DNS synchronization script used to keep the backup and primary sites in sync was transaction oriented. A network fault 5 weeks ago meant the transactions related to the DNS update for the cluster deployment were dropped and not noticed.
  • The old three-node clusters were kept online "just in case".
  • The differences in cluster software versions between the two sites was displayed in EC's monitoring panel, but was not alarmed, and disregarded as a 'glitch' by Operations. Interviews show that Ops staff are aware that the monitoring system will sometimes hold onto stale data if it isn't part of an alarm.
  • At the time of the cluster migration Operations was testing a new switch firmware image. The image on the old cluster was determined to have a critical loading bug, which required attention from the switch vendor.
  • Two weeks prior to the event EC performed an update of switch firmware using new code that passed validation. The new firmware was replicated to all cluster members in both sites using automation based on the IP addresses of the cluster members. The old cluster members were not updated.
  • The automation driving the switch firmware update relied on the non-synchronized DNS entries, and reported no problems applying updates. The primary site got the known-good firmware, the backup site got the known-bad firmware.
  • The hot-swap network load triggered the fault in the backup site's switch firmware, causing switches to reboot every 5 minutes.
  • Recovery logic in the application attempted to work around the massive network faults and ended up duplicating some database transactions, and losing others. Some corrupted data was transferred to the primary site before it was fully shut down.
  • Lack of technical personnel physically at the backup site hampered recovery from the backup site and extended the outage.
  • Out of date documentation hampered efforts restart services from a cold stop.
  • The inconsistent state of the databases further delayed recovery.

That is a terrible-horrible-no-good-very-bad-day, yes indeed. However, it shows what I'm talking about here. Several small errors crept in to make what was supposed to be a perfectly handleable fault something that caused many hours of downtime. This fault would have been discovered during the next routine test, but that hadn't happened yet.

Just like the nuke-plant failure, reading this list makes you go "what kind of cowboy outfit allows this kind of thing to happen?"

Or maybe, if it has happened to you, "Oh crimeny, I've so been there. Here's hoping I retire before it happens again."

It happens to us all. Netfix reduces this through the Chaos Monkey, using it to visibly trigger these small failures before they can cascade into big ones. And yet even they fall over when a really big failure happens naturally.

What can you do?

  • Accept that the multiple-failure combinatorics problem is infinite and you won't be able to capture every fail case.
  • Build your system to be as disaster resilient as possible.
  • Test your remediations, and do so regularly.
  • Validate your instrumentation is returning good results, and do so regularly.
  • Cross-check where possible.
  • Investigate glitches, and keep doing it after it gets tediously boring.
  • Cause small failures and force your system to respond to them.

These are all known best-practices, and yet people are lazy, or can't get sufficient management buy-in to do it (a 'minimum viable product' is likely excessively vulnerable to this kind of thing). We do what we can, snark at those who visibly can't, and hope our turn doesn't come up.

On how to answer surveys

[As it is National Coming Out Day, I point out I did that here]

If you're a systems administrator, you've taken surveys. It seems any vendor with a support contract will send you one after you interact with support in any way, and most commercial websites have 'how are we doing' popups to capture opinion. Some of these are quick 3 question wonders just to test the waters of opinion, but once in a while a real survey ends up presented to us.

Real surveys have multiple pages, ask the same kind of question multiple ways to cross-check opinion, and often have an incentive attached to actually take them. The real surveys are prepared by outside parties, presumably to be scientific in the opinion they're trying to assess (there is in fact a difference between 'it looks right' and 'there is science here').

There is an art to answering surveys, and it depends on what your goals are as an answerer and what the survey is attempting to gather.

The vast majority of surveys I get presented come in three flavors:

  1. Hand-wringy "how are we doing???" quickies.
  2. Standardized "how did we do?" longer form ones, usually after interacting with support.
  3. Formalized surveys assessing satisfaction with or overall opinion of a company or product (almost always from the company itself).

The first two kinds are the type of survey that will elicit a response from the company if you answer it right. If that's what you're looking for, perhaps you did have a bad experience with support or maybe there is an ongoing frustration you want to talk to someone about, here is how to get that.

5 point scales (5 is Very Satisfied, 1 is Very Dissatisfied)

4-5: Good! No contact needed.
3: Some contact likelihood.
2: Guaranteed contact. This is dissatisfied, but hasn't written the company off yet. Can be salvaged!
1: Has written the company off.

A 1-star review is someone blowing steam off and is pissed. A 2-star review is someone with specific problems that might be redeemable.

5 point scales are hard, which is why most seem to be going with 10 pointers these days.

10 point scales (10 is Very Satisfied, 1 is Very Dissatisfied)

8-10: Good!
7: Neutral. Good option to pick to avoid getting called.
3-6: Bad, but redeemable. Nigh guaranteed to get a call.
2: Really bad, probable call, but they'll have kid-gloves.
1: Written off.

This scale is nice since you get a nice little spectrum for how pissed off someone is. If you want someone to call you, anything from 3-6 is a good choice. 7 is good for "I don't like it but I don't want to talk about it."


Rule of thumb: 1-point responses are vindictive, and are probably people who can't be reasonable on this. Safely ignorable.


Formalized surveys need as many respondents as possible, good and bad, to be scientific.

This is why they usually come with incentives, to lure in the people who don't give a damn. If you don't want to answer it because it comes from a company that sells crap, this is your chance to let them know. Save the loaded language for Twitter, though; subtle carries far.


Quickie 3-question wonder surveys are mostly to smoke out dissatisfied customers, and maybe take opinion temperature.

I've already described how it's used as a customer-service opportunity to win back the dissatisfied. As for opinion temperature, the internal dialog goes kind of like this:

Our TPS score dropped from 8.2 to 7.9 this month, and our 1-point reviews jumped from 5% to 8% of respondents. Clearly our customers hate the new interface.

The people who take the time to answer those are people who have something to say (they love or hate the product), or those whose whim strikes. It won't catch the mushy middle who just don't care. They're about as useful for opinion targeting as a finger in the wind is to determine wind-speed.


Companies do strange things with unscientific survey data

Strange things like couple Support Engineer pay to the average score on the 3-question-wonder post-call surveys. If you happen to know what a certain company does with this data, you can further game it to your advantage.


Way back in the pre-internet era my family used to get phone survey calls periodically. One of the first questions they asked was this doozy:

Does anyone in your family work for market research or related field?

Because people who know how surveys work can game them and ruin their scientific validity. It just so happens we did have such a person in our household. And it just so happens they're right.