Incident response programs

Honeycomb had a nice post where they describe dropping a priority list of incident severities in favor of an attribute list. Their list is still a pick-one list; but instead of using a 1-4 SEV scale, they're using a list of types like "ambiguous," "security," and "internal." The post goes into some detail about the problems with a unified list across a large organization, and the different response-levelĀ  needs of different types of incidents. All very true.

A good incident response program needs to be approachable by anyone in the company, meaning anyone looking to open one should have reasonable success in picking incident attributes right. The incident automation industry, tools such as PagerDuty's Jeli and the Rootly platform, has settled on a pick-one list for severity, with sometimes support for additional fields. Unless a company is looking to home build their own incident automation for creating slack channels, managing the post-incident review process, and tracking remediation action items, these de facto conventions constrain the options available to an incident response program.

As Honeycomb pointed out, there are two axis that need to be captured by "severity," and they are: urgency, and level of response. I propose the following pair of attributes:

Urgency

  1. Planning: the problem can be addressed through normal sprint or quarterly planning processes.
  2. Low: the problem has long lead times to either develop or validate the solution, where higher urgency would result in a lot of human resources stuck in wait loops.
  3. Medium: the problem can be addressed in regular business hours operations, waiting overnight or a weekend won't make things worse. Can preempt sprint-level deliverable targets without question
  4. High: the problem needs around the clock response and can preempt quarterly deliverable targets without question
  5. Critical: the problem requires investor notification or other regulated public disclosure, and likely affects annual planning. Rare by definition.

Level of response

  1. Individual: The person who broke it can revert/fix it without much effort, and impact blast-radius is limited to one team. Post-incident review may not be needed beyond the team level.
  2. Team: A single team can manage the full response, such as an issue with a single service. Impact blast radius is likely one team. Post-incident review at the peer-team level.
  3. Peer team: A group of teams in the same department are involved in response due to interdependencies or the nature of the event. Impact blast-radius is clearly multi-team. Post-incident review at the peer-team level, and higher up the org-chart if the management chain is deep enough for it.
  4. Cross-org: Major incident territory, where the issue cuts across more than one functional group. These are rare. Impact blast-radius may be whole-company, but likely whole-product. Post-incident review will be global.
  5. C-level: High executive needs to run it because response is whole company in scope. Will involve multiple post-incident reviews.

Is Private? Yes/No - If yes, only the people involved in the response are notified of the incident and updates. Useful for Security and Compliance type incidents, where discoverability is actually bad. Some incidents qualify as Material Non-Public Information, which matters to companies with stocks being traded.

The combinatorics indicate that 5*5=25 pairs, 50 if you include Is Private, which makes for an unwieldy pick-one list. However, like stellar types there is a kind of main sequence of pairs that are more common, with problematic outliers that make simple solutions a troublesome fit. Let's look at a few pairs that are on the main sequence of event types:

  • Planning + Individual: Probably a feature-flag had to be rolled back real quick. Spend some time digging into the case. Incidents like this sometimes get classified "bug" instead of "incident."
  • Low + Team: Such as a Business Intelligence failure, where revenue attribution was discovered to be incorrect for a new feature, and time is needed to back-correct issues and validate against expectations. May also be classified as "bug" instead of "incident."
  • Medium + Team: Probably the most common incident type that doesn't get classified as a "bug," these are the highway verge grass fires of the incident world; small in scope, over quick, one team can deal with it.
  • Medium + Peer Team: Much like the previous but involving more systems in scope. Likely requires coordinated response between multiple teams to reach a solution. These teams work together a lot, by definition, so it would be a professional and quick response.
  • High + Cross-org: A platform system had a failure that affected how application code responds to platform outages, leading to a complex, multi-org response. Response would include possibly renegotiating SLAs between platform and customer-facing systems. Also, remediating the Log4J vulnerability, which requires touching every usage of java in the company inclusive of vendored usage, counts as this kind of incident.
  • Critical + Cross-org: An event like the Log4J vulnerability, and the Security org has evidence that security probes found something. The same remediation response as the previous, but with added "reestablish trust in the system" work on top of it, and working on regulated customer notices.

Six of 25 combinations. But some of the others are still viable, even if they don't look plausible on the surface. Let's look at a few:

  • Critical + Team: A bug is found in SOX reporting that suggests incorrect data was reported to stock-holders. While the C-levels are interested, they're not in the response loop beyond the 'stakeholder' role and being the signature that stock-holder communications will be issued under.
  • Low + Cross-org: Rapid retirement of a deprecated platform system, forcing the teams still using the old system to crash-migrate to the new one.
  • Planning + Cross-org: The decision to retire a platform system is made as part of an incident, and migrations are inserted into regular planning.

How is an organization supposed to build a pick-one list from this mess that is usable? This is hard work!

Some organizations solve this by bucketing incidents using another field, and allowing the pick-one list to mean different things based on what that other field says. A Security SEV1 gets a different scale of response than a Revenue SEV1, which in turn gets a different type of response than an Availability SEV1. Systems like this have problems with incidents that cross buckets, such as a Security issue that also affects Availability. It's for this reason that Honeycomb has an 'ambiguous' bucket.

A few organizations go so far as to have a fully separate process for the 'High' and 'Critical' urgencies of events, maybe calling them Disaster Recovery events instead of Incidents. DR events need to be rare, which means that process isn't as well exercised as Incident response. However, a separate process makes it abundantly clear that certain urgencies and scopes require different process overall. More on this in a later blog-post.

Other orgs handle the outlier problem differently, taking them out of incidents and into another process all together. Longer flow problems, low urgency above, get called something like a Code Yellow after a Google effort, or Code Red for the Critical + C-Team level to handle long flow big problems.

Honeycomb took the bucketing idea one step further and dropped urgency and level of response entirely, focusing instead on incident type. A process like this still needs ways to manage urgency and response-scope differences, but this is being handled at a layer below incident automation. In my opinion, a setup like this works best when Engineering is around Dunbar's Number or less in size, allowing informal relationships to carry a lot of weight. Companies with deeper management chains, and thus more engineers, will need more formalism to determine cross-org interaction and prioritization.

Another approach is to go super broad with your pick-one list, and make it apply to everyone. While this approach disambiguates pretty well between the SEV 1 highest urgency problems and SEV 2 urgent but not pants on fire urgent, they're less good at disambiguating SEV 3 and SEV 4 incidents. Those incidents tend to only have local scope, so local definitions will prevail, meaning only locals will know how to correctly categorize issues.


There are several simple answers for this problem, but each simplification has it's own problem. Your job is to pick the problems your org will put up with.

  • How much informal structure can you rely on? The smaller the org, the more one size is likely to fit all.
  • Do you need to interoperate with a separate incident response process, perhaps an acquisition or a parent company?
  • How often do product-local vs global incidents happen? For one product companies, these are the same thing. For companies that are truly multi-product, this distinction matters. The answer here influences how items on your pick-one list are dealt with, and whether incident reporters are likely to file cross-product reports.
  • Does your incident automation platform allow decision supports in their reporting workflow? Think of a next, next, next, done wizard; each screen asks clarifying questions. Helpful for folk who are not sure how a given area wants their incidents marked up, less helpful for old hands who know exactly what needs to go in each field.