July 2024 Archives

SysAdmins have no trouble making big lists of what can go wrong and what we're doing to stave that off a little longer. The tricky problem is pushing large organizations to take a harder look at systemic risks and taking them seriously. I mean, the big companies have to have disaster recovery (DR) plans for compliance reasons; but there are a lot of differences between box-ticking DR plans and comprehensive DR plans.

Any company big enough to get past the running out of money is the biggest disaster phase has probably spent some time thinking about what to do if things go wrong. But how do you, the engineer in the room, get the deciders to think about disasters in productive ways?

The really big disasters are obvious:

  • The datacenter catches fire after a hurricane
  • The Region goes dark due to a major earthquake
  • Pandemic flu means 60% of the office is offline at the same time
  • An engineer or automation accidentally:
    • Drops all the tables in the database
    • Deletes all the objects out of the object store
    • Destroys all the clusters/servlets/pods
    • Deconfigures the VPN
  • The above happens and you find your backups haven't worked in months

All obvious stuff, and building to deal with them will let you tick the box for compliance DR. Cool.

But there are other disasters, the sneaky ones that make you think and take a hard look at process and procedures in a way that the "oops we lost everything of [x] type" disasters generally don't.

  • An attacker subverts your laptop management software (JAMF, InTune, etc) and pushes a cryptolocker to all employee laptops
  • 30% of your application secrets got exposed through a server side request forgery (SSRF) attack
  • Nefarious personages get access to your continuous integration environment and inject trojans into your dependency chains
  • A key third party, such as your payment processor, gets ransomwared and goes offline for three weeks
  • A Slack/Teams bot got subverted and has been feeding internal data to unauthorized third parties for months

The above are all kinda "security" disasters, and that's my point. SysAdmins sometimes think of these, but even we are guilty of not having the right mental models to rattle these off the top of our head when asked. Asking about disasters like this list should start conversations that generally don't happen. Or you get the bad case: people shrug and say "that's Security's problem, not ours," which is a sign you have a toxic reliability culture.

Security-type disasters have a phase that merely technical disasters lack: how do we restore trust in production systems? In technical disasters, you can start recovery as soon as you've detected the disaster. For security disasters recovery has to wait until the attacker has been evicted, which can take a while. This security delay means key recovery concepts like Recovery Time and Recovery Point Objectives (RTO/RPO) will be subtly different.

If you're trying to knock loose some ossified DR thinking, these security type disasters can crack open new opportunities to make your job safer.

I've now spent over a decade teaching how alarms are supposed to work (specific, actionable, with the appropriate urgency) and even wrote a book on how to manage metrics systems. One topic I was repeatedly asked to cover in the book, but declined because the topic is big enough for its own book, is how to do metrics right. The desire for an expert to lay down how to do metrics right comes from a number of directions:

  • No one ever looked at ours in a systematic way and our alerts are terrible [This is asking about alerts, not metrics; but they still were indirectly asking about metrics]
  • We keep having incidents and our metrics aren't helping, how do we make them help?
  • Our teams have so many alarms important ones are getting missed [Again, asking about alerts]
  • We've half assed it, and now we're getting a growth spurt. How do we know what we should be looking for?

People really do conflate alarms/alerts with metrics, so any discussion about "how do we do metrics better" is often a "how do we do alarms better" question in disguise. As for the other two points, where people have been using vibes to pick metrics and that's no longer scaling, we actually do have a whole lot of advice; you have a whole menu of "golden signals" to pick from depending on how your application is shaped.

That's only sort of why I'm writing this.

In the mathematical construct of Site Reliability Engineering, where everything is statistics and numerical analysis, metrics are easy. Track the things that affect availability, regularly triage your metrics to ensure continued relevance, and put human processes into place to make sure you're not burning out your worker-units. But the antiseptic concept of SRE only exists in a few places, the rest of us have to pollute the purity of math with human emotions. Let me explain.

Consider your Incident Management process. There are certain questions that commonly arise when people are doing the post incident reviews:

  • Could we have caught this before release? If so, what sort of pre-release checks should we add to catch this earlier?
  • Did we learn about this from metrics or customers? If customers, what metrics do we need to add to catch this earlier? If metrics, what processes or alarms should we tune to catch this earlier?
  • Could we have caught this before the feature flag rolled out to the Emerald users? Do we need to tune the alarm thresholds to catch issues like this in groups with less feature-usage before the high value customers on Emerald plans?

And so on. Note that each question asks about refining or adding metrics. Emotionally, metrics represent anxieties. Metrics are added to catch issues before they hurt us again. Metrics are retained because they're tracking something that used to hurt us and might hurt again. This makes removing metrics hard; the people involved remember why certain metrics are present, intuitively know they needs tracking, which means emotion says to keep.

Metrics are scar tissue, and removing scar tissue is hard, bloody work. How do you reduce the number of metrics, while also not compromising your availability goals? You need the hard math of SRE to work down those emotions, but all it takes is one Engineering Manager to say "this prevented a SEV, keep it" to blow that effort up. This also means you'll have much better luck with a metric reformation effort if teams are already feeling the pinch of alert fatigue or your SaaS metric provider bills are getting big enough that the top of the company is looking at metric usage to reduce costs.

Sometimes, metrics feed into Business Intelligence. That's less about scar tissue and more about optimizing your company's revenue operations. Such metrics are less likely to lead to rapid-response on-call rotations, but still can lead to months long investigations into revenue declines. That's a different but related problem.

I could write a book about making your metrics suck less, but that book by necessity has to cover a lot of human-factors issues and has to account for the role of Incident Management in metrics sprawl. Metrics are scar tissue, keep that in mind.