December 2021 Archives

Allowing 'root cause analysis'

"Root cause analysis" as a term invokes a strong response from the (software) reliability industry. The most common complaint:

There's no such thing as a "root" cause. They're always complex failures. There's this nice book about complex failures and the Three Mile Island incident I recommend you read.

This is the correct take. In any software system, even software+hardware ones, what triggered the incident is almost never the sole and complete cause of it. An effective technique for figuring out the causal chain is the "five whys" method. To take a fictional example:

  • Why did Icarus die?
    • Because he flew too close to the sun.
  • Why did he fly too close to the sun?
    • Because of the hubris of man.
  • Why was Icarus full of hubris?
    • Because his low altitude tests passed, and he made incorrect assumptions.
  • Why did the low altitude tests pass, but not the high altitude ones?
    • Because the material used to attach the feathers became more pliable the closer to the sun he was.
  • What happened after the material became more pliable?
    • It lost feathers, which compromised the flight profile, leading to a fatal incident.

    From this exercise we can determine the causal chain:

    Structural integrity monitoring was not included in the testing regime, which lead to a failure to detect reduced binding efficiency at higher ambient temperatures. The decision to move to high elevation testing was done in the absence of a data driven testing framework. The combination of the lack of data driven testing, and lack of comprehensive materials surveillance, allowed the lead investigator to make a fatal decision.

    This is a bit more comprehensive than the parable's typical, 'hubris of man,' moral. There are whole books written about building a post-incident review process, with the goal of maximizing the learning earned from the incident, in software systems. There are no root causes, only complex failures; and you reason about complex failures differently than attempting to find a lone root cause.

    Except.

    Except.

    The phrase 'root cause analysis' is freaking everywhere, and this is in spite of a decade of SREs pushing against the term. There are a few reasons for this, but to start the explanation here is another example from my history. My current manager knows better than to call incident reviews a "root cause analysis." Yet, when we have a vendor of ours shit the bed fantastically enough we get in trouble with our own customers, they are the first to press our account managers for an RCA Report. Why?

    Because an RCA Report is also a customer relations tool. My manager is code-switching between our internal engineer-driven incident review processes, which don't use the term, for the customer relations concept which manifestly does use it. Not at all coincidentally, other SREs grind their teeth any time a customer asks for an RCA Report, because what we do isn't Root Cause Analysis.

    Aside: For all that we as an SRE community focus on availability and build customer-centered metrics to base our SLOs on, SRE as a job function is often highly disconnected from the actual people-to-people interface with customers. Some companies will allow a senior SRE onto a customer call to better explain a failure chain, but my understanding is this practice is rare; most companies are more concerned that the senior SRE will over-share in some way that will compromise the company's liability stances.

    At the end of the day, customers want answers to three questions about the incident they're concerned over, all so they can reassess the risk of continuing to do business with us:

    1. What happened?
    2. What did you do to fix it?
    3. What are you doing to prevent this happening again?

    "What happened?" isn't supposed to be a single causative action, customers want to know if we understand the causal chain. 'Root cause' in this context is less a technical term meaning 'single', and more a term of art meaning 'failure'.

    The other reason that 'RCA' shows up as often as it does is that the term itself shows up in general safety engineering literature. DayJob has had a few availability incidents lately, after one of them a customer asked for a type of report I'd never heard of before: a CAPA report. I had to google that one. CAPA means corrective and preventive actions. Also known as questions 2 and 3 above. My industry has been building blameless post-mortem processes for a decade plus now, and never used CAPA. This concept was instantly familiar, even if I hadn't heard the acronym before.

    I found a blog post from a firm specializing in safety inside the beverage industry that describes how an RCA interacts with a CAPA. The beverage industry operates machine plants with bottle fillers and everything else involved in food handling. The software industry, um, doesn't (usually). Because beverage manufacturing and software manufacturing are both industrial processes, the same concepts apply to both. If you read into what an RCA is for them, it reads a lot like a complex failure report.

    This lead me to a realization: "Root cause analysis" is a term of art, not a technical term.

    Engineers look at that phrase and cringe, because what it says is not what it means and we find that kind of ambiguity to be a bug. This is probably why we're not allowed near customers unless we have close supervision or experience in customer-facing technical writing.

    Now a days I'm hearing internal folk decry "root cause analysis" as the wrong way to think about problems, and I nod and tell them they're right. While also telling them that we'll continue to use that term with customers because that's what customers are asking for, and we'll write those RCA reports like the complex failure analyses they are. We'll even give them a CAPA report and not call it a CAPA (unless they ask for a CAPA by name).