Recently in disasters Category

Applied risk management

I've been in the tech industry for an uncomfortable amount of time, but I've been doing resilience planning the whole time. You know, when and how often to take backups, segueing into worrying about power diversity, things like that. My last two years at Dropbox gave me exposure to how that works when you have multiple datacenters. It gets complex, and there are enough moving parts you can actually build models around expected failure rates in a given year to better help you prioritize remediation and prep work.

Meanwhile, everyone in the tech-disaster industry peeps over the shoulders of environmental disaster recoveries like hurricanes and earthquakes. You can learn a lot by watching the pros. I've talked about some of what we learned, mostly it has been procedural in nature:

Since then, the United States elected a guy who wants to be dictator, and a Congress who seems willing to let it happen. For those of us in the disliked minority of the moment, we're facing concerted efforts to roll back our ability to exist in public. That's risk. Below the fold I talk about using what I learned from IT risk management and how I apply those techniques to assess my own risks. It turns out building risks for "dictatorship in America" can't rely on prior art as much as risks for "datacenter going offline," which absolutely has prior art to include; and even incident rates to factor in.

Back in November I posted about how to categorize your incidents using the pick-one list common across incident automation platforms. In that post I said:

A few organizations go so far as to have a fully separate process for the 'High' and 'Critical' urgencies of events, maybe calling them Disaster Recovery events instead of Incidents. DR events need to be rare, which means that process isn't as well exercised as Incident response. However, a separate process makes it abundantly clear that certain urgencies and scopes require different process overall. More on this in a later blog-post.

This is the later blog post.

The SaaS industry as a whole has been referring to the California Fire Command (now known as the Incident Command System) model for inspiration on handling technical incidents. The basic structure is familiar to any SaaS engineer:

  • There is an Incident Commander who is responsible for running the whole thing, including post-incident processes
  • There is a Technical Lead who is responsible for the technical response

There may be additional roles available depending on organizational specifics:

  • A Business Manager who is responsible for the customer-facing response
  • A Legal Manager who is responsible for anything to do with legal
  • A Security Lead who is responsible for heading security investigations

Again, familiar. But truly large incidents put stress on this model. In a given year the vast majority of incidents experienced by an engineering organization will be the grass fire variety that can be handled by a team of four people in under 30 minutes. What happens when a major event happens?

The example I'm using here is a private information disclosure by a hostile party using a compromised credential. Someone not employed by the company dumped a database they shouldn't have had access to, and that database involved data that requires disclosure in the case of compromise. Given this, we already know some of the workstreams that incident response will be doing once this activity is discovered:

  • Investigatory work to determine where else the attacker got access to and fully define the scope of what leaked
  • Locking down the infrastructure to close the holes used by the attacker for the identified access
  • Cycling/retiring credentials possibly exposed to the attacker
  • Regulated notification generation and execution
  • Technical remediation work to lock down any exploited code vulnerabilities

An antiseptic list, but a scary one. The moment the company officially notices a breach of private information, legislation world-wide starts timers on when privacy regulators or the public need to be informed. For a profit driven company, this is admitting fault in public which is something none of them do lightly due to the lawsuits that will result. For publicly traded companies, stockholder notification will also need to be generated. Incidents like this look very little like an availability SLA breach SEV of the kind that happens 2-3 times a month in different systems.

Based on the rubric I showed back in November, an incident of this type is of Critical urgency due to regulated timelines, and will require eitherĀ Cross-Org or C-level response depending on the size of the company. What's more, the need to figure out where the attacker went blocks later stages of response, so this response process will actually be a 24 hour operation and likely run several days. No one person can safely stay awake for 4+ days straight.

The Incident Command Process defines three types of command structure:

  • Solitary command - where one person is running the whole show
  • Unified command - where multiple jurisdictions are involved and they need to coordinate, and also to provide shift changes through rotating who is the Operations Chief (what SaaS folk call the Technical Lead)
  • Area command - where multiple incidents are part of a larger complex, the Area Commander supports each Incident Command

Incidents of the scale of our private information breach lean into the Area Command style for a few reasons. First and foremost, there are discrete workstreams that need to be executed by different groups; such as the security review to isolate scope, building regulated notifications, and cycling credentials. All those workstreams need people to run them, and those workstream leads need to report to incident command. That looks a lot like Area Command to me.

If your daily incident experience are 4-7 person team responses, how ready are you to be involved in an Area Command style response? Not at all.

If you've been there for years and have seen a few multi-org responses in your time, how ready are you to handle Area Command style response? Better, you might be a good workstream lead.

One thing the Incident Command Process makes clear is that Area Commanders do not have an operational role, meaning they're not involved in the technical remediation. Their job is coordination, logistics, and high level decision making across response areas. For our pretend SaaS company, a good Area Commander will be someone:

  • Someone who has experience with incidents involving legal response
  • Someone who has experience with large security response, because the most likely incidents of this size are security related
  • Someone who has experience with incidents involving multiple workstreams requiring workstream leaders
  • Someone who has experience communicating with C-Levels and has their respect
  • Two to four of these people in order to safely staff a 24 hour response for multiple days

Is your company equipped to handle this scale of response?

In many cases, probably not. Companies handle incidents of this type a few different ways. As I mentioned in the earlier post, some categorize problems like this as a disaster instead of an incident and invoke a different process. This has the advantage of making it clear the response for these is different, at the cost of having far fewer people familiar with the response methods. You make up for the lack of in situ training, learn by doing, by regularly re-certifying key leaders on the process.

Other companies extend the existing incident response process on the fly rather than risk having a separate process that will get stale. This works so long as you have some people around who kind of know what they're doing and can herd others into the right shape. Though, after the second disaster of this scale, people will start talking about how to formalize procedures.

Whichever way your company goes, start thinking about this. Unless you're working for the hyperscalers, incidents of this response scope are going to be rare. This means you need to schedule quarterly time to train, practice, and certify your Area Commanders and workstream leads. This will speed up response time overall, because less time will be spent arguing over command and feedback structures.

Incident response programs

Honeycomb had a nice post where they describe dropping a priority list of incident severities in favor of an attribute list. Their list is still a pick-one list; but instead of using a 1-4 SEV scale, they're using a list of types like "ambiguous," "security," and "internal." The post goes into some detail about the problems with a unified list across a large organization, and the different response-levelĀ  needs of different types of incidents. All very true.

A good incident response program needs to be approachable by anyone in the company, meaning anyone looking to open one should have reasonable success in picking incident attributes right. The incident automation industry, tools such as PagerDuty's Jeli and the Rootly platform, has settled on a pick-one list for severity, with sometimes support for additional fields. Unless a company is looking to home build their own incident automation for creating slack channels, managing the post-incident review process, and tracking remediation action items, these de facto conventions constrain the options available to an incident response program.

As Honeycomb pointed out, there are two axis that need to be captured by "severity," and they are: urgency, and level of response. I propose the following pair of attributes:

Urgency

  1. Planning: the problem can be addressed through normal sprint or quarterly planning processes.
  2. Low: the problem has long lead times to either develop or validate the solution, where higher urgency would result in a lot of human resources stuck in wait loops.
  3. Medium: the problem can be addressed in regular business hours operations, waiting overnight or a weekend won't make things worse. Can preempt sprint-level deliverable targets without question
  4. High: the problem needs around the clock response and can preempt quarterly deliverable targets without question
  5. Critical: the problem requires investor notification or other regulated public disclosure, and likely affects annual planning. Rare by definition.

Level of response

  1. Individual: The person who broke it can revert/fix it without much effort, and impact blast-radius is limited to one team. Post-incident review may not be needed beyond the team level.
  2. Team: A single team can manage the full response, such as an issue with a single service. Impact blast radius is likely one team. Post-incident review at the peer-team level.
  3. Peer team: A group of teams in the same department are involved in response due to interdependencies or the nature of the event. Impact blast-radius is clearly multi-team. Post-incident review at the peer-team level, and higher up the org-chart if the management chain is deep enough for it.
  4. Cross-org: Major incident territory, where the issue cuts across more than one functional group. These are rare. Impact blast-radius may be whole-company, but likely whole-product. Post-incident review will be global.
  5. C-level: High executive needs to run it because response is whole company in scope. Will involve multiple post-incident reviews.

Is Private? Yes/No - If yes, only the people involved in the response are notified of the incident and updates. Useful for Security and Compliance type incidents, where discoverability is actually bad. Some incidents qualify as Material Non-Public Information, which matters to companies with stocks being traded.

The combinatorics indicate that 5*5=25 pairs, 50 if you include Is Private, which makes for an unwieldy pick-one list. However, like stellar types there is a kind of main sequence of pairs that are more common, with problematic outliers that make simple solutions a troublesome fit. Let's look at a few pairs that are on the main sequence of event types:

  • Planning + Individual: Probably a feature-flag had to be rolled back real quick. Spend some time digging into the case. Incidents like this sometimes get classified "bug" instead of "incident."
  • Low + Team: Such as a Business Intelligence failure, where revenue attribution was discovered to be incorrect for a new feature, and time is needed to back-correct issues and validate against expectations. May also be classified as "bug" instead of "incident."
  • Medium + Team: Probably the most common incident type that doesn't get classified as a "bug," these are the highway verge grass fires of the incident world; small in scope, over quick, one team can deal with it.
  • Medium + Peer Team: Much like the previous but involving more systems in scope. Likely requires coordinated response between multiple teams to reach a solution. These teams work together a lot, by definition, so it would be a professional and quick response.
  • High + Cross-org: A platform system had a failure that affected how application code responds to platform outages, leading to a complex, multi-org response. Response would include possibly renegotiating SLAs between platform and customer-facing systems. Also, remediating the Log4J vulnerability, which requires touching every usage of java in the company inclusive of vendored usage, counts as this kind of incident.
  • Critical + Cross-org: An event like the Log4J vulnerability, and the Security org has evidence that security probes found something. The same remediation response as the previous, but with added "reestablish trust in the system" work on top of it, and working on regulated customer notices.

Six of 25 combinations. But some of the others are still viable, even if they don't look plausible on the surface. Let's look at a few:

  • Critical + Team: A bug is found in SOX reporting that suggests incorrect data was reported to stock-holders. While the C-levels are interested, they're not in the response loop beyond the 'stakeholder' role and being the signature that stock-holder communications will be issued under.
  • Low + Cross-org: Rapid retirement of a deprecated platform system, forcing the teams still using the old system to crash-migrate to the new one.
  • Planning + Cross-org: The decision to retire a platform system is made as part of an incident, and migrations are inserted into regular planning.

How is an organization supposed to build a pick-one list from this mess that is usable? This is hard work!

Some organizations solve this by bucketing incidents using another field, and allowing the pick-one list to mean different things based on what that other field says. A Security SEV1 gets a different scale of response than a Revenue SEV1, which in turn gets a different type of response than an Availability SEV1. Systems like this have problems with incidents that cross buckets, such as a Security issue that also affects Availability. It's for this reason that Honeycomb has an 'ambiguous' bucket.

A few organizations go so far as to have a fully separate process for the 'High' and 'Critical' urgencies of events, maybe calling them Disaster Recovery events instead of Incidents. DR events need to be rare, which means that process isn't as well exercised as Incident response. However, a separate process makes it abundantly clear that certain urgencies and scopes require different process overall. More on this in a later blog-post.

Other orgs handle the outlier problem differently, taking them out of incidents and into another process all together. Longer flow problems, low urgency above, get called something like a Code Yellow after a Google effort, or Code Red for the Critical + C-Team level to handle long flow big problems.

Honeycomb took the bucketing idea one step further and dropped urgency and level of response entirely, focusing instead on incident type. A process like this still needs ways to manage urgency and response-scope differences, but this is being handled at a layer below incident automation. In my opinion, a setup like this works best when Engineering is around Dunbar's Number or less in size, allowing informal relationships to carry a lot of weight. Companies with deeper management chains, and thus more engineers, will need more formalism to determine cross-org interaction and prioritization.

Another approach is to go super broad with your pick-one list, and make it apply to everyone. While this approach disambiguates pretty well between the SEV 1 highest urgency problems and SEV 2 urgent but not pants on fire urgent, they're less good at disambiguating SEV 3 and SEV 4 incidents. Those incidents tend to only have local scope, so local definitions will prevail, meaning only locals will know how to correctly categorize issues.


There are several simple answers for this problem, but each simplification has it's own problem. Your job is to pick the problems your org will put up with.

  • How much informal structure can you rely on? The smaller the org, the more one size is likely to fit all.
  • Do you need to interoperate with a separate incident response process, perhaps an acquisition or a parent company?
  • How often do product-local vs global incidents happen? For one product companies, these are the same thing. For companies that are truly multi-product, this distinction matters. The answer here influences how items on your pick-one list are dealt with, and whether incident reporters are likely to file cross-product reports.
  • Does your incident automation platform allow decision supports in their reporting workflow? Think of a next, next, next, done wizard; each screen asks clarifying questions. Helpful for folk who are not sure how a given area wants their incidents marked up, less helpful for old hands who know exactly what needs to go in each field.

SysAdmins have no trouble making big lists of what can go wrong and what we're doing to stave that off a little longer. The tricky problem is pushing large organizations to take a harder look at systemic risks and taking them seriously. I mean, the big companies have to have disaster recovery (DR) plans for compliance reasons; but there are a lot of differences between box-ticking DR plans and comprehensive DR plans.

Any company big enough to get past the running out of money is the biggest disaster phase has probably spent some time thinking about what to do if things go wrong. But how do you, the engineer in the room, get the deciders to think about disasters in productive ways?

The really big disasters are obvious:

  • The datacenter catches fire after a hurricane
  • The Region goes dark due to a major earthquake
  • Pandemic flu means 60% of the office is offline at the same time
  • An engineer or automation accidentally:
    • Drops all the tables in the database
    • Deletes all the objects out of the object store
    • Destroys all the clusters/servlets/pods
    • Deconfigures the VPN
  • The above happens and you find your backups haven't worked in months

All obvious stuff, and building to deal with them will let you tick the box for compliance DR. Cool.

But there are other disasters, the sneaky ones that make you think and take a hard look at process and procedures in a way that the "oops we lost everything of [x] type" disasters generally don't.

  • An attacker subverts your laptop management software (JAMF, InTune, etc) and pushes a cryptolocker to all employee laptops
  • 30% of your application secrets got exposed through a server side request forgery (SSRF) attack
  • Nefarious personages get access to your continuous integration environment and inject trojans into your dependency chains
  • A key third party, such as your payment processor, gets ransomwared and goes offline for three weeks
  • A Slack/Teams bot got subverted and has been feeding internal data to unauthorized third parties for months

The above are all kinda "security" disasters, and that's my point. SysAdmins sometimes think of these, but even we are guilty of not having the right mental models to rattle these off the top of our head when asked. Asking about disasters like this list should start conversations that generally don't happen. Or you get the bad case: people shrug and say "that's Security's problem, not ours," which is a sign you have a toxic reliability culture.

Security-type disasters have a phase that merely technical disasters lack: how do we restore trust in production systems? In technical disasters, you can start recovery as soon as you've detected the disaster. For security disasters recovery has to wait until the attacker has been evicted, which can take a while. This security delay means key recovery concepts like Recovery Time and Recovery Point Objectives (RTO/RPO) will be subtly different.

If you're trying to knock loose some ossified DR thinking, these security type disasters can crack open new opportunities to make your job safer.

24/7 availability and oncall

There is another meme going around OpsTwitter the past few days. This is a familiar refrain in discussions about on-call and quality of life. But the essence is:

If you need 24/7 availability, you also need follow-the-sun support. That way any crisis is in someone's day-time, regular-work day.

I agree, this is the standard you need to judge your solution against. However, this solution has some assumptions baked into it. Here are a few:

  • You have three teams operating 8 timezones from their neighbors (or two timezones spanning 12)
  • No one set of employment laws spans 24 timezones, so these teams will each be under different labor and national holiday laws.
  • Each timezone needs an on-call rotation.
  • The minimum viable on-call rotation per timezone is 3 people, but 6 is far more friendly to the people supporting the site.
  • Due to staffing reasons, your global on-call team needs 9 to 18 people on it (or 6 to 12 for a 12 timezone spread).
  • Due to the timezone spread, each team will have minimal coordination with each other. What coordination there is will involve one team being on a video-call at o-dark-thirty.
  • You need enough work to keep 9 to 18 people busy in addition to their fire-watch duties.

You know who can pull that off? Really big companies.

You know who can't pull that off? Companies employing in a single labor market, such as the US.

I mean, Guam is a US holding (UTC+10). Theoretically if you had a team in Guam and a team in New York City (UTC-4) you would have a 10 hour difference between them. You could sort of make this work while staying inside the US tax and legal domains, but you're reliant on the technical base of people in Guam which has a population a bit smaller than Des Moines, Iowa. Colonialism means people will think about hiring in Ireland or India before Guam. To do this you need to go international.

Most smaller companies won't go international, way too much paperwork involved at a time when you're supposed to be lean and fast.

I have worked with follow-the-sun exactly once in my career. We had Ops teams in the US East Coast, Poland, and China. It wasn't a true 8/8/8 split, but it was enough of a split that "after hours maintenance" always happened in someone's daytime. It was pretty dang nice. Then we had a layoff round and the Poland office went away. And we fired our Chinese Ops folk to save money, which meant we were waking the US staff up at o-dark-thirty to do maintenance.


I'm conflicted on this advice. On the surface, I totally get the sentiment: keep the annoying shit in everyone's daytime and don't force people to work overnights.

As an industry, we have history with split-shifting and incident response. The night operator used to be a common feature of any company with a computer, the person (or team of people) responsible for loading/unloading tapes, swapping paper for the printer, collating and packaging print-jobs, making sure the batch-jobs ran, calling the SYSOP when things smelled off, and a bunch of other now-forgotten tasks. Most organizations have gotten rid of the night operator for a lot of reasons. The two biggest being:

  1. We've mostly automated the job out of existence. Tapes (if tapes are still in use) are handled by robots. Print-jobs now show up as a PDF in your email. Batch-schedulers are really fancy now, so getting those batch-jobs run is highly automated. Monitoring systems keep track of way more things than we could track in the night operator era.
  2. No one wants to work overnights. Like, no one. At least not enough to easily find a replacement when the one person who does like it decides to leave/retire.

(The second point hit WWU while I was there)

As an industry we no longer have tradition of doing shift-work. The robust expectation is we'll have a day-time job and go home in the evenings. If you offer me an overnight job at +30% pay -- I'll take it for a while, but I'm still job-hunting for a real daytime job. Not sustainable, which is why on-call is how we're solving the one night operator task we couldn't automate out of existence: incident response.

Everyone needs some way to do incident response, even if they're 15 people with a big idea and a website -- far too small to be doing follow-the-sun rotations. Are they supposed to make it clear that you only guarantee availability certain hours? I think there is some legs in that idea, but the site will be negatively compared with the site next door that offers 24/7 availability (at the cost of little sleep for their few engineers).

Forcing change to the idea that Ops-type work is always done with a pager attached with unknown extra hours will take a shit-ton of work. Sea changes like that don't happen naturally. We cross-faded from night operators to on-call rotations due to the changing nature of the role: there wasn't enough work to do on the 11pm to 7am shift to keep someone fully occupied, so we tacked those duties onto the 7am-3pm crew (who now work a nicer 9am to 5pm schedule).

The only way to break the need for on-call for Ops-type roles is to stop making availability promises when you're not staffed to support it with people responding as part of their normal working hours. If your support-desk isn't answering the phone, site availability shouldn't be promised.

It's that or unionizing the entire sector.