Back in November I posted about how to categorize your incidents using the pick-one list common across incident automation platforms. In that post I said:
A few organizations go so far as to have a fully separate process for the 'High' and 'Critical' urgencies of events, maybe calling them Disaster Recovery events instead of Incidents. DR events need to be rare, which means that process isn't as well exercised as Incident response. However, a separate process makes it abundantly clear that certain urgencies and scopes require different process overall. More on this in a later blog-post.
This is the later blog post.
The SaaS industry as a whole has been referring to the California Fire Command (now known as the Incident Command System) model for inspiration on handling technical incidents. The basic structure is familiar to any SaaS engineer:
- There is an Incident Commander who is responsible for running the whole thing, including post-incident processes
- There is a Technical Lead who is responsible for the technical response
There may be additional roles available depending on organizational specifics:
- A Business Manager who is responsible for the customer-facing response
- A Legal Manager who is responsible for anything to do with legal
- A Security Lead who is responsible for heading security investigations
Again, familiar. But truly large incidents put stress on this model. In a given year the vast majority of incidents experienced by an engineering organization will be the grass fire variety that can be handled by a team of four people in under 30 minutes. What happens when a major event happens?
The example I'm using here is a private information disclosure by a hostile party using a compromised credential. Someone not employed by the company dumped a database they shouldn't have had access to, and that database involved data that requires disclosure in the case of compromise. Given this, we already know some of the workstreams that incident response will be doing once this activity is discovered:
- Investigatory work to determine where else the attacker got access to and fully define the scope of what leaked
- Locking down the infrastructure to close the holes used by the attacker for the identified access
- Cycling/retiring credentials possibly exposed to the attacker
- Regulated notification generation and execution
- Technical remediation work to lock down any exploited code vulnerabilities
An antiseptic list, but a scary one. The moment the company officially notices a breach of private information, legislation world-wide starts timers on when privacy regulators or the public need to be informed. For a profit driven company, this is admitting fault in public which is something none of them do lightly due to the lawsuits that will result. For publicly traded companies, stockholder notification will also need to be generated. Incidents like this look very little like an availability SLA breach SEV of the kind that happens 2-3 times a month in different systems.
Based on the rubric I showed back in November, an incident of this type is of Critical urgency due to regulated timelines, and will require eitherĀ Cross-Org or C-level response depending on the size of the company. What's more, the need to figure out where the attacker went blocks later stages of response, so this response process will actually be a 24 hour operation and likely run several days. No one person can safely stay awake for 4+ days straight.
The Incident Command Process defines three types of command structure:
- Solitary command - where one person is running the whole show
- Unified command - where multiple jurisdictions are involved and they need to coordinate, and also to provide shift changes through rotating who is the Operations Chief (what SaaS folk call the Technical Lead)
- Area command - where multiple incidents are part of a larger complex, the Area Commander supports each Incident Command
Incidents of the scale of our private information breach lean into the Area Command style for a few reasons. First and foremost, there are discrete workstreams that need to be executed by different groups; such as the security review to isolate scope, building regulated notifications, and cycling credentials. All those workstreams need people to run them, and those workstream leads need to report to incident command. That looks a lot like Area Command to me.
If your daily incident experience are 4-7 person team responses, how ready are you to be involved in an Area Command style response? Not at all.
If you've been there for years and have seen a few multi-org responses in your time, how ready are you to handle Area Command style response? Better, you might be a good workstream lead.
One thing the Incident Command Process makes clear is that Area Commanders do not have an operational role, meaning they're not involved in the technical remediation. Their job is coordination, logistics, and high level decision making across response areas. For our pretend SaaS company, a good Area Commander will be someone:
- Someone who has experience with incidents involving legal response
- Someone who has experience with large security response, because the most likely incidents of this size are security related
- Someone who has experience with incidents involving multiple workstreams requiring workstream leaders
- Someone who has experience communicating with C-Levels and has their respect
- Two to four of these people in order to safely staff a 24 hour response for multiple days
Is your company equipped to handle this scale of response?
In many cases, probably not. Companies handle incidents of this type a few different ways. As I mentioned in the earlier post, some categorize problems like this as a disaster instead of an incident and invoke a different process. This has the advantage of making it clear the response for these is different, at the cost of having far fewer people familiar with the response methods. You make up for the lack of in situ training, learn by doing, by regularly re-certifying key leaders on the process.
Other companies extend the existing incident response process on the fly rather than risk having a separate process that will get stale. This works so long as you have some people around who kind of know what they're doing and can herd others into the right shape. Though, after the second disaster of this scale, people will start talking about how to formalize procedures.
Whichever way your company goes, start thinking about this. Unless you're working for the hyperscalers, incidents of this response scope are going to be rare. This means you need to schedule quarterly time to train, practice, and certify your Area Commanders and workstream leads. This will speed up response time overall, because less time will be spent arguing over command and feedback structures.