The Department of Government Efficiency, Musk's vehicle. made news by "discovering" the General Services Administration uses tapes, and plans to save $1M by switching to something else (disks, or cloud-based storage). Long time readers of this blog may remember I used to talk a lot about storage and tape backup. Guess it's time to get my antique Storage Nerd hat out of the closet (this is my first storage post since 2013) to explain why tape is still relevant in an era of 400Gb backbone networks and 30TB SMR disks.

The SaaS revolution has utterly transformed the office automation space. The job I had in 2005, in the early years of this blog, only exists in small pockets anymore. So many office systems have been SaaSified that the old problems I used to blog about around backups and storage tech are much less pressing in the modern era. Where we have stuff like that are places that have decades of old file data, staring in the mid to late 1980s, that is still being hauled around. Even when I was still doing this in the late 2000s the needle was shifting to large arrays of cheap disks replacing tape arrays.

Where you still see tape being used here are offices with policies for "off-site" or "offline" storage of key office data. A lot of that stuff is also done on disk these days, but some offices still kept their tape libraries. I suspect a lot of what DoGE found was in this category of offices retaining tape infrastructure. Is disk cheaper here? Marginally, the true savings will be much less than the $1M headline rate.

But there is another area where tape continues to be the economical option, and it's another area DoGE is going to run into: large scientific datasets.

To explain why, I want to use a contrasting example: A vacation picture you took on an iPhone in 2011, put into Dropbox, shared twice, and haven't looked at in 14 years. That file has followed you to new laptops and phones, unseen, unloved, but available. A lot goes into making sure it's available.

All the big object-stores like S3, and file-sync-and-share services (like Dropbox, Box, MS live, Google Drive, Proton Drive, etc) use a common architecture because this architecture has been proven to be reliable at avoiding visible data-loss:

  • Every uploaded file is split into 4KB blocks (the size is relevant to disk technology, which I'm not going into here)
  • Each block is written between 3 and 7 times to disk in a given datacenter or region, the exact replication factor changes based on service and internal realities
  • Each block is replicated to more than one geographic region as a disaster resilience move, generally at least 2, often 3 or more

The end result of the above is that the 1MB vacation picture is written to disk 6 to 14 different times. The nice thing about the above is you can lose an entire rack-row of a datacenter and not lose data; you might lose 2 of your 5 copies of a given block, but you have 3 left to rebuild, and your other region still has full copies.

But I mentioned this 1MB file has been kept online for 14 years. Assuming an average disk life-span of 5 years, each block has been migrated to new hardware 3 times in those years. Meaning each 4KB block of that file has been resident on between 24 and 42 hardrives; or more, if your provider replicates to more than 2 discrete geographic region. Those drives have been spinning and using power (and therefore requiring cooling) the entire time.

These systems need to go to all of this effort because they need to be sure that all files are available all the time, when you need it, where you need it, as fast as possible. If a person in that vacation photo retires, and you suddenly need that picture for the Retirement Montage at their going away party, you don't want to wait hours for it to come off tape. You want it now.

Contrast this to a scientific dataset. Once the data has stopped being used for Science! it can safely be archived until someone else needs to use it. This is the use-case behind AWS S3 Glacier: you pay a lot less for storing data, so long as you're willing to accept delays measurable in hours before you can access it. This is also the use-case where tape shines.

A lab gets done chewing on a dataset sized at 100TB, which is pretty chonky for 2011. They send it to cold storage. Their IT section dutifully copies the 100TB dataset onto LTO-5 drives at 1.5TB per tape, for a stack of 67 tapes, and removes the dataset from their disk-based storage arrays.

Time passes, as with the Dropbox-style data. LTO drives can read between 1 and 2 generations prior. Assuming the lab IT section keeps up on tape technology, it would be the advent of LTO-7 in 2015 that would prompt a great restore and rearchive effort of all LTO-5 and previous media. LTO-7 can do 6TB per tape, for a much smaller stack of 17 tapes.

LTO-8 changed this, with only a one version lookback. So when LTO-8 comes out in 2017 with a 9TB capacity, a read restore/rearchive effort runs again, changing our stack of tapes from 17 to 12. LTO-9 comes out in 2021 with 18TB per tape, and that stack reduces to 6 tapes to hold 100TB.

All in all, our cold dataset had to relocate to new media three times, same as the disk-based stuff. However, keeping stacks of tape in a climate controlled room is vastly cheaper than a room of powered, spinning disk. The actual reality is somewhat different, as the few data archive people I know mention they do great restore/archive runs about every 8 to 10 years, largely driven by changes in drive connectivity (SCSI, SATA, FibreChannel, Infiniband, SAS, etc), OS and software support, and corporate purchasing cycles. Keeping old drives around for as long as possible is fiscally smart, so the true recopy events for our example data is likely "1".

So another lab wants to use that dataset and puts in a request. A day later, the data is on a disk-array for usage. Done. Carrying costs for that data in the intervening 14 years are significantly lower than the always available model of S3 and Dropbox.

Tape: still quite useful in the right contexts.

Applied risk management

I've been in the tech industry for an uncomfortable amount of time, but I've been doing resilience planning the whole time. You know, when and how often to take backups, segueing into worrying about power diversity, things like that. My last two years at Dropbox gave me exposure to how that works when you have multiple datacenters. It gets complex, and there are enough moving parts you can actually build models around expected failure rates in a given year to better help you prioritize remediation and prep work.

Meanwhile, everyone in the tech-disaster industry peeps over the shoulders of environmental disaster recoveries like hurricanes and earthquakes. You can learn a lot by watching the pros. I've talked about some of what we learned, mostly it has been procedural in nature:

Since then, the United States elected a guy who wants to be dictator, and a Congress who seems willing to let it happen. For those of us in the disliked minority of the moment, we're facing concerted efforts to roll back our ability to exist in public. That's risk. Below the fold I talk about using what I learned from IT risk management and how I apply those techniques to assess my own risks. It turns out building risks for "dictatorship in America" can't rely on prior art as much as risks for "datacenter going offline," which absolutely has prior art to include; and even incident rates to factor in.

Blog Questions Challenge 2025

Thanks to Ben Cotton for sharing.

Why did you start blogging in the first place?

I covered a lot of that in 20 years of this nonsense from about a year ago. The quick version is I was charged with creating a "Static pages from your NetWare home directory" project and needed something to test with, so here we are. That version was done with Blogger before the Google acquisition, when they still supported publish-by-ftp (which I also had to set up as part of the same project).

What platform are you using to manage your blog, and why do you use it?

When blogger got rid of the publish-by-ftp method, I had to move. I came to my own domain and went looking for blogging software. On advice from an author I like, I kept in mind the slashdot effect so wanted to be sure if I had an order of magnitude more traffic for an article it wouldn't melt the server it was one. So I wanted something relatively light weight, which at the time was Movable Type. Wordpress required database hits for every webpage, which didn't seem to scale.

I stuck with it because Movable Type continues to do the job quite well, and be ergonomic for me. I turned off comments a while ago, as that was an anti-spam nightmare I needed recency to solve. Movable Type now requires a thousand dollars a year for a subscription, which pencils out to about $125 per blog post at my current posting rate. Not worth it.

Have you blogged on other platforms before?

Like just about everyone my age, I was on Livejournal. I don't remember if this blog or LJ came first, and I'm not going to go check. I had another blog on Blogger for a while, about local politics. It has been lost to time, though is still visible on archive.org if you know where to look for it.

How do you write your posts?

Most are spur of the moment. I have a topic, and time, and remember I can be long-form about it. Once in a while I'll get into something on social media and realize I need actual wordcount to do it justice, so I do it here instead. The advent of twitter absolutely slowed down my posting rate here!

Once I have the words in, I schedule a post for a few days hence.

When do you feel most inspired to write?

As with all writers, it comes when it comes. Sometimes I set out goals and I stick to them. But blogging hasn't been a focus of mine for a long time, so it's entirely whim. I do know I need an hour or so of mostly uninterrupted time to get my thoughts in order, which is hard to come by without arranging for it.

Do you normally publish immediately after writing, or do you let it simmer a bit?

As mentioned above, I use scheduled-post. Typically for 9am, unless I've got something spicy and don't care. This is rare, I've also learned that posting spicy takes absolutely needs a cooling off period. I've pulled posts after writing them because I realize they didn't actually need to get posted, I merely needed to write them.

What's your favorite post on your blog?

That's changed a lot over the years as I've changed.

  • For a long time, I was proud of my Know your IO series from 2010. That was prompted by a drop-by conversation from one of our student workers who had a question about storage technology. I infodumped for most of an hour, and realized I had a blog series. This is still linked from my sidebar on the right.
  • From recent history, the post Why I don't like Markdown in a git repo as documentation is a still accurate distillation of why I seriously dislike this reflexive answer to workplace knowledge sharing.
  • This post about the lost history of why you wait for the first service pack before deploying anything is me bringing old-timer points of view to newer audiences. The experiences in this post are drawn directly from where I was working in 2014-2015. Yes Virginia, people still do ship shrink-wrap software to Enterprise distros. Some of you are painfully aware of this.

I'm not stopping blogging any time soon. At some point the dependency chain for Movable Type will rot and I'll have to port to something else, probably a static site generator. I believe I'm spoiled for choice in that domain.

Back in November I posted about how to categorize your incidents using the pick-one list common across incident automation platforms. In that post I said:

A few organizations go so far as to have a fully separate process for the 'High' and 'Critical' urgencies of events, maybe calling them Disaster Recovery events instead of Incidents. DR events need to be rare, which means that process isn't as well exercised as Incident response. However, a separate process makes it abundantly clear that certain urgencies and scopes require different process overall. More on this in a later blog-post.

This is the later blog post.

The SaaS industry as a whole has been referring to the California Fire Command (now known as the Incident Command System) model for inspiration on handling technical incidents. The basic structure is familiar to any SaaS engineer:

  • There is an Incident Commander who is responsible for running the whole thing, including post-incident processes
  • There is a Technical Lead who is responsible for the technical response

There may be additional roles available depending on organizational specifics:

  • A Business Manager who is responsible for the customer-facing response
  • A Legal Manager who is responsible for anything to do with legal
  • A Security Lead who is responsible for heading security investigations

Again, familiar. But truly large incidents put stress on this model. In a given year the vast majority of incidents experienced by an engineering organization will be the grass fire variety that can be handled by a team of four people in under 30 minutes. What happens when a major event happens?

The example I'm using here is a private information disclosure by a hostile party using a compromised credential. Someone not employed by the company dumped a database they shouldn't have had access to, and that database involved data that requires disclosure in the case of compromise. Given this, we already know some of the workstreams that incident response will be doing once this activity is discovered:

  • Investigatory work to determine where else the attacker got access to and fully define the scope of what leaked
  • Locking down the infrastructure to close the holes used by the attacker for the identified access
  • Cycling/retiring credentials possibly exposed to the attacker
  • Regulated notification generation and execution
  • Technical remediation work to lock down any exploited code vulnerabilities

An antiseptic list, but a scary one. The moment the company officially notices a breach of private information, legislation world-wide starts timers on when privacy regulators or the public need to be informed. For a profit driven company, this is admitting fault in public which is something none of them do lightly due to the lawsuits that will result. For publicly traded companies, stockholder notification will also need to be generated. Incidents like this look very little like an availability SLA breach SEV of the kind that happens 2-3 times a month in different systems.

Based on the rubric I showed back in November, an incident of this type is of Critical urgency due to regulated timelines, and will require either Cross-Org or C-level response depending on the size of the company. What's more, the need to figure out where the attacker went blocks later stages of response, so this response process will actually be a 24 hour operation and likely run several days. No one person can safely stay awake for 4+ days straight.

The Incident Command Process defines three types of command structure:

  • Solitary command - where one person is running the whole show
  • Unified command - where multiple jurisdictions are involved and they need to coordinate, and also to provide shift changes through rotating who is the Operations Chief (what SaaS folk call the Technical Lead)
  • Area command - where multiple incidents are part of a larger complex, the Area Commander supports each Incident Command

Incidents of the scale of our private information breach lean into the Area Command style for a few reasons. First and foremost, there are discrete workstreams that need to be executed by different groups; such as the security review to isolate scope, building regulated notifications, and cycling credentials. All those workstreams need people to run them, and those workstream leads need to report to incident command. That looks a lot like Area Command to me.

If your daily incident experience are 4-7 person team responses, how ready are you to be involved in an Area Command style response? Not at all.

If you've been there for years and have seen a few multi-org responses in your time, how ready are you to handle Area Command style response? Better, you might be a good workstream lead.

One thing the Incident Command Process makes clear is that Area Commanders do not have an operational role, meaning they're not involved in the technical remediation. Their job is coordination, logistics, and high level decision making across response areas. For our pretend SaaS company, a good Area Commander will be someone:

  • Someone who has experience with incidents involving legal response
  • Someone who has experience with large security response, because the most likely incidents of this size are security related
  • Someone who has experience with incidents involving multiple workstreams requiring workstream leaders
  • Someone who has experience communicating with C-Levels and has their respect
  • Two to four of these people in order to safely staff a 24 hour response for multiple days

Is your company equipped to handle this scale of response?

In many cases, probably not. Companies handle incidents of this type a few different ways. As I mentioned in the earlier post, some categorize problems like this as a disaster instead of an incident and invoke a different process. This has the advantage of making it clear the response for these is different, at the cost of having far fewer people familiar with the response methods. You make up for the lack of in situ training, learn by doing, by regularly re-certifying key leaders on the process.

Other companies extend the existing incident response process on the fly rather than risk having a separate process that will get stale. This works so long as you have some people around who kind of know what they're doing and can herd others into the right shape. Though, after the second disaster of this scale, people will start talking about how to formalize procedures.

Whichever way your company goes, start thinking about this. Unless you're working for the hyperscalers, incidents of this response scope are going to be rare. This means you need to schedule quarterly time to train, practice, and certify your Area Commanders and workstream leads. This will speed up response time overall, because less time will be spent arguing over command and feedback structures.

"Columnar databases store data in columns, not rows," says the definition. I made a passing reference to the technology in Software Telemetry, but didn't spend any time on what they are and how they can help telemetry and observability. Over the last six months I worked on converting a centralized logging flow based on Elasticsearch, Kibana, and Logstash, to one based on Logstash and Apache Spark. For this article, I'll be using examples from both methods to illustrate what columnar databases let you do in telemetry systems.

How Elasticsearch is columnar (or not)

First of all, Elasticsearch isn't exactly columnar but it can fake it to a point. You use Elasticsearch when you need full indexing and tokenization of every field in order to accelerate query-time performance. Born as it was in the early part of the 2010s, Elasticsearch balances ingestion-side complexity in order to optimize read-side performance. There is a reason that if you have a search field in your app, there is a good chance that Elasticsearch or OpenSearch is involved in the business logic. While Elasticsearch is "schema-less," schema still matters, and there are clear limits to how many fields you can add to an Elasticsearch index.

Each Elasticsearch index or datastream has defined fields. Fields can be defined at index/datastream creation, or configured to auto-create on first use. Both are quite handy in telemetry contexts. Each document in an index or datastream has a reference for every defined field, even if the contents of that field are null. If you have 30K fields, and one document has only 19 fields defined, the rest will still exist on the document but be nulled; which in turn makes that 19 defined-field document rather larger than the same document in an index/datastream with only 300 defined fields.

Larger average document size slows down search for everything in general, due to the number and size of field-indexes the system has to keep track of. This also balloons index/datastream size, which has operational impacts when it comes to routine operations like patching and maintenance. As I mentioned in Software Telemetry, Elasticsearch's cardinality problem manifests in number of fields, not in unique values in each field.

If you are willing to get complicated in your ingestion pipeline through careful crafting of telemetry shape, and ingestion into multiple index/datastreams to bucket types of telemetry into shards of similarly shaped telemetry, you can mitigate some of the above problems. Create an alias to use as your search endpoint, and populate the alias with the index/datastreams of your various shards. Elasticsearch is smart enough to know where to search, which lets you bucket your field-count cardinality problems in ways that will perform faster and save space. However, this is clearly adding complexity that you have to manage yourself.

How Apache Spark is columnar

Spark is pretty clearly columnar, which is why it's the de facto platform of choice for Business Intelligence operations. You know, telemetry for business ops not engineering ops. A table defined in Spark (and most of its backing databases like Parquet or Hive) can have arbitrary columns defined in it. Data for each column is stored in separate files, which means queries like the following looking to build a histogram of log-entries per hour "COUNT timestamp GROUP BY hour(timestamp)" are extremely efficient as the system only needs to look at a single file out of thousands.

Columnar databases have to do quite a bit of read-time and ingestion-time optimization to truly perform fast, which demonstrates some of the tradeoffs of the style. Where Elasticsearch was trading ingestion-time complexity to speed up read-time performance, columnar databases are tilting the needle more towards increasing read-time complexity in order to optimize overall resource usage. In short columnar databases have better scaling profiles than something like Elasticsearch, but they don't query as fast as a result of the changed priorities. This is a far easier trade-off to make in 2024 than it was in 2014!

Columnar databases also don't tokenize the way Elasticsearch does. Have a free-text field that you want to do sub-string searches on? Elasticsearch is built from the bolts out to make that search as fast as possible. Columnar databases, on the other hand, do all of the string walking and searching at query-time instead of pulling the values out of some b-trees.

Where Elasticsearch suffers performance issues when field-count rises, Spark only encounters this problems if the query is designed to encounter it through use of "select *" or similar constructs. The files hit by the query will only be the ones for columns referenced in the query! Have a table with 30K columns in it? So long as you query right, it should perform quite well; the 19 defined fields in a row problem shouldn't be a problem so long as you're only referencing one of those 19 fields/columns.

Why columnar is neat

A good centralized logging system can stand in for both metrics and traces, and in large part can do so because the backing databases for centralized logging are often columnar or columnar-like. There is nothing stopping you from creating metric_name and metric_value fields in your logging system, and building a bunch of metrics-type queries using those rows.

As for emulating tracing, this isn't done through OpenTelemetry, this is done old-school through hacking. Chapter 5 in Software Telemetry covered how the Presentation Stage uses correlation identifiers:

"A correlation identifier is a string or number that uniquely identifies that specific execution or workflow."

Correlation identifiers allow you to build the charts that tracing systems like Jaeger, Tempo, and Honeycomb are known for. There is nothing stopping you creating an array-of-strings type field named "span_id" where you dump the span-stack for each log-line. Want to see all the logs for a given Span? Here you are. Given a sophisticated enough visualization engine, you can even emulate the waterfall diagrams in dedicated tracing platforms.

The reason we haven't used columnar databases for metrics systems has to do with cost. If you're willing to accept cardinality limits, you can store a far greater number of metrics for the same amount of money as doing it in a columnar database. However, the biggest companies already are using columnar datastores for engineering metrics, and nearly all companies are using columnar for business metrics.

But if you're willing to spend the extra resources to use a columnar-like datasource for metrics, you can start answering questions like "how many 5xx response-codes did accounts with the Iridium subscription encounter on October 19th." Traditional metrics system would consider subscription-type to be too highly cardinal, where columnar databases shrug and move on.

What this means for the future of telemetry and observability

Telemetry over the last 60 years of computing has gone from digging through the SYSLOG printout from one of your two servers, to digging through /var/log/syslog, to the creation of dedicated metrics systems, to the creation of tracing techniques. Every decade's evolution of telemetry has been constrained by the compute and storage performance envelope available to the average system operator.

  • The 1980s saw the proliferation of multi-server architectures as the old mainframe style went out of fashion, so centralized logging had to involve the network. NFS shares for Syslog.
  • The 1990s saw the first big scale systems recognizable as such by people in 2024, and the beginnings of analytics on engineering data. People started sending their web-logs direct to relational databases, getting out of the "tail and grep" realm and into something that kinda looks like metrics if you squint. Distributed processing got its start here, though hardly recognizable today.
  • The 2000s saw the first bespoke metrics systems and protocols, such as statsd and graphite. This era also saw the SaaS revolution begin, with Splunk being a big name in centralized logging, and NewRelic gaining traction for web-based metrics. Distributed processing got more involved, and at the end of the decade the big companies like Google and Microsoft lived and breathed these systems. Storage was still spinning disk, with some limited SSD usage in niche markets.
  • The 2010s saw the first tracing systems and the SaaS revolution ate a good chunk of the telemetry/observability space. The word observability entered wide usage. Distributed processing ended the decade as the default stance for everything, including storage. Storage bifurcated into bulk (spinning disk) and performance (SSD) tiers greatly reducing cost.

We're part way through the 2020s, and it's already clear to me that columnar databases are probably where telemetry systems are going to end up by the end of the decade. Business intelligence is already using them, so most of our companies have them in our infrastructure already. Barriers to adoption are going to be finding ways to handle the different retention and granularity requirements of what we now call the three pillars of observability:

  • Metrics need visibility going back years, and are aggregated not sampled. Observability systems doing metrics will need to allow multi-year retention somehow.
  • Tracing retention is 100% based on cost and sample-rate, which should improve over the decade.
  • Centralized logging is like tracing in that retention is 100% based on cost. True columnar stores scale more economically than Elasticsearch-style databases, which increases retention. How sample rate affects retention is less clear, and would have to involve some measure of aggregation to remain viable over time.

Having columnar databases at the core allows a convergence of the pillars of observability. How far we get in convergence over the next five years remains to be seen, and I look forward to finding out.

Incident response programs

Honeycomb had a nice post where they describe dropping a priority list of incident severities in favor of an attribute list. Their list is still a pick-one list; but instead of using a 1-4 SEV scale, they're using a list of types like "ambiguous," "security," and "internal." The post goes into some detail about the problems with a unified list across a large organization, and the different response-level  needs of different types of incidents. All very true.

A good incident response program needs to be approachable by anyone in the company, meaning anyone looking to open one should have reasonable success in picking incident attributes right. The incident automation industry, tools such as PagerDuty's Jeli and the Rootly platform, has settled on a pick-one list for severity, with sometimes support for additional fields. Unless a company is looking to home build their own incident automation for creating slack channels, managing the post-incident review process, and tracking remediation action items, these de facto conventions constrain the options available to an incident response program.

As Honeycomb pointed out, there are two axis that need to be captured by "severity," and they are: urgency, and level of response. I propose the following pair of attributes:

Urgency

  1. Planning: the problem can be addressed through normal sprint or quarterly planning processes.
  2. Low: the problem has long lead times to either develop or validate the solution, where higher urgency would result in a lot of human resources stuck in wait loops.
  3. Medium: the problem can be addressed in regular business hours operations, waiting overnight or a weekend won't make things worse. Can preempt sprint-level deliverable targets without question
  4. High: the problem needs around the clock response and can preempt quarterly deliverable targets without question
  5. Critical: the problem requires investor notification or other regulated public disclosure, and likely affects annual planning. Rare by definition.

Level of response

  1. Individual: The person who broke it can revert/fix it without much effort, and impact blast-radius is limited to one team. Post-incident review may not be needed beyond the team level.
  2. Team: A single team can manage the full response, such as an issue with a single service. Impact blast radius is likely one team. Post-incident review at the peer-team level.
  3. Peer team: A group of teams in the same department are involved in response due to interdependencies or the nature of the event. Impact blast-radius is clearly multi-team. Post-incident review at the peer-team level, and higher up the org-chart if the management chain is deep enough for it.
  4. Cross-org: Major incident territory, where the issue cuts across more than one functional group. These are rare. Impact blast-radius may be whole-company, but likely whole-product. Post-incident review will be global.
  5. C-level: High executive needs to run it because response is whole company in scope. Will involve multiple post-incident reviews.

Is Private? Yes/No - If yes, only the people involved in the response are notified of the incident and updates. Useful for Security and Compliance type incidents, where discoverability is actually bad. Some incidents qualify as Material Non-Public Information, which matters to companies with stocks being traded.

The combinatorics indicate that 5*5=25 pairs, 50 if you include Is Private, which makes for an unwieldy pick-one list. However, like stellar types there is a kind of main sequence of pairs that are more common, with problematic outliers that make simple solutions a troublesome fit. Let's look at a few pairs that are on the main sequence of event types:

  • Planning + Individual: Probably a feature-flag had to be rolled back real quick. Spend some time digging into the case. Incidents like this sometimes get classified "bug" instead of "incident."
  • Low + Team: Such as a Business Intelligence failure, where revenue attribution was discovered to be incorrect for a new feature, and time is needed to back-correct issues and validate against expectations. May also be classified as "bug" instead of "incident."
  • Medium + Team: Probably the most common incident type that doesn't get classified as a "bug," these are the highway verge grass fires of the incident world; small in scope, over quick, one team can deal with it.
  • Medium + Peer Team: Much like the previous but involving more systems in scope. Likely requires coordinated response between multiple teams to reach a solution. These teams work together a lot, by definition, so it would be a professional and quick response.
  • High + Cross-org: A platform system had a failure that affected how application code responds to platform outages, leading to a complex, multi-org response. Response would include possibly renegotiating SLAs between platform and customer-facing systems. Also, remediating the Log4J vulnerability, which requires touching every usage of java in the company inclusive of vendored usage, counts as this kind of incident.
  • Critical + Cross-org: An event like the Log4J vulnerability, and the Security org has evidence that security probes found something. The same remediation response as the previous, but with added "reestablish trust in the system" work on top of it, and working on regulated customer notices.

Six of 25 combinations. But some of the others are still viable, even if they don't look plausible on the surface. Let's look at a few:

  • Critical + Team: A bug is found in SOX reporting that suggests incorrect data was reported to stock-holders. While the C-levels are interested, they're not in the response loop beyond the 'stakeholder' role and being the signature that stock-holder communications will be issued under.
  • Low + Cross-org: Rapid retirement of a deprecated platform system, forcing the teams still using the old system to crash-migrate to the new one.
  • Planning + Cross-org: The decision to retire a platform system is made as part of an incident, and migrations are inserted into regular planning.

How is an organization supposed to build a pick-one list from this mess that is usable? This is hard work!

Some organizations solve this by bucketing incidents using another field, and allowing the pick-one list to mean different things based on what that other field says. A Security SEV1 gets a different scale of response than a Revenue SEV1, which in turn gets a different type of response than an Availability SEV1. Systems like this have problems with incidents that cross buckets, such as a Security issue that also affects Availability. It's for this reason that Honeycomb has an 'ambiguous' bucket.

A few organizations go so far as to have a fully separate process for the 'High' and 'Critical' urgencies of events, maybe calling them Disaster Recovery events instead of Incidents. DR events need to be rare, which means that process isn't as well exercised as Incident response. However, a separate process makes it abundantly clear that certain urgencies and scopes require different process overall. More on this in a later blog-post.

Other orgs handle the outlier problem differently, taking them out of incidents and into another process all together. Longer flow problems, low urgency above, get called something like a Code Yellow after a Google effort, or Code Red for the Critical + C-Team level to handle long flow big problems.

Honeycomb took the bucketing idea one step further and dropped urgency and level of response entirely, focusing instead on incident type. A process like this still needs ways to manage urgency and response-scope differences, but this is being handled at a layer below incident automation. In my opinion, a setup like this works best when Engineering is around Dunbar's Number or less in size, allowing informal relationships to carry a lot of weight. Companies with deeper management chains, and thus more engineers, will need more formalism to determine cross-org interaction and prioritization.

Another approach is to go super broad with your pick-one list, and make it apply to everyone. While this approach disambiguates pretty well between the SEV 1 highest urgency problems and SEV 2 urgent but not pants on fire urgent, they're less good at disambiguating SEV 3 and SEV 4 incidents. Those incidents tend to only have local scope, so local definitions will prevail, meaning only locals will know how to correctly categorize issues.


There are several simple answers for this problem, but each simplification has it's own problem. Your job is to pick the problems your org will put up with.

  • How much informal structure can you rely on? The smaller the org, the more one size is likely to fit all.
  • Do you need to interoperate with a separate incident response process, perhaps an acquisition or a parent company?
  • How often do product-local vs global incidents happen? For one product companies, these are the same thing. For companies that are truly multi-product, this distinction matters. The answer here influences how items on your pick-one list are dealt with, and whether incident reporters are likely to file cross-product reports.
  • Does your incident automation platform allow decision supports in their reporting workflow? Think of a next, next, next, done wizard; each screen asks clarifying questions. Helpful for folk who are not sure how a given area wants their incidents marked up, less helpful for old hands who know exactly what needs to go in each field.

Rust and the Linux kernel

One of the kernel maintainers made social waves by bad mouthing Rust and the project to rebuild the Linux kernel in Rust. The idea of rebuilding the kernel in "Rust: the memory-safe language" not "The C in CVE stands for C/C++" makes a whole lot of sense. However, there is more to a language than how memory safe it is and whether a well known engineer calls it a "toy" language.

One of the products offered by my employer is written in Elixir, which is built on top of Erlang. Elixir had an 8 or so month period of fame, which is when the decision to write that product was made. We picked Elixir because the Erlang engine gives you a lot of concurrency and async processing for relatively easy. And it worked! That product was a beast in relatively little CPU. We had a few cases of 10x usage from customers, and it just scaled up no muss no fuss.

Where the problems with the product came wasn't in the writing, but in the maintaining and productionizing. Some of the issues we've had over the years, many of which got better as Elixir as an ecosystem matured:

  • The ability to make a repeatable build, needed for CI systems
  • Dependency management in modules
  • Observability ecosystem support, such as OpenTelemetery SDKs
  • Build tooling support usable by our CI systems
  • Maturity of the module ecosystem, meaning we had to DIY certain tasks that our other main product never had to bother with. Or the modules that exist only covered 80% of the use-cases.
  • Managing Erlang VM startup during deploys

My opinion is that the dismissiveness from this particular Linux Kernel Maintainer had to do with this list. The Linux kernel and module ecosystem is massive, with highly complex build processes spanning many organizations, and regression testing frameworks to match. Ecosystem maturity matters way more for CI, regression, and repeatable build problems than language maturity.

Rust has something Elixir never had: durable mindshare. Yeah, the kernel rebuild process has taken many years, and has many years to go. Durable mindshare means that engineers are sticking with it, instead of chasing the next hot new memory safe language.

SysAdmins have no trouble making big lists of what can go wrong and what we're doing to stave that off a little longer. The tricky problem is pushing large organizations to take a harder look at systemic risks and taking them seriously. I mean, the big companies have to have disaster recovery (DR) plans for compliance reasons; but there are a lot of differences between box-ticking DR plans and comprehensive DR plans.

Any company big enough to get past the running out of money is the biggest disaster phase has probably spent some time thinking about what to do if things go wrong. But how do you, the engineer in the room, get the deciders to think about disasters in productive ways?

The really big disasters are obvious:

  • The datacenter catches fire after a hurricane
  • The Region goes dark due to a major earthquake
  • Pandemic flu means 60% of the office is offline at the same time
  • An engineer or automation accidentally:
    • Drops all the tables in the database
    • Deletes all the objects out of the object store
    • Destroys all the clusters/servlets/pods
    • Deconfigures the VPN
  • The above happens and you find your backups haven't worked in months

All obvious stuff, and building to deal with them will let you tick the box for compliance DR. Cool.

But there are other disasters, the sneaky ones that make you think and take a hard look at process and procedures in a way that the "oops we lost everything of [x] type" disasters generally don't.

  • An attacker subverts your laptop management software (JAMF, InTune, etc) and pushes a cryptolocker to all employee laptops
  • 30% of your application secrets got exposed through a server side request forgery (SSRF) attack
  • Nefarious personages get access to your continuous integration environment and inject trojans into your dependency chains
  • A key third party, such as your payment processor, gets ransomwared and goes offline for three weeks
  • A Slack/Teams bot got subverted and has been feeding internal data to unauthorized third parties for months

The above are all kinda "security" disasters, and that's my point. SysAdmins sometimes think of these, but even we are guilty of not having the right mental models to rattle these off the top of our head when asked. Asking about disasters like this list should start conversations that generally don't happen. Or you get the bad case: people shrug and say "that's Security's problem, not ours," which is a sign you have a toxic reliability culture.

Security-type disasters have a phase that merely technical disasters lack: how do we restore trust in production systems? In technical disasters, you can start recovery as soon as you've detected the disaster. For security disasters recovery has to wait until the attacker has been evicted, which can take a while. This security delay means key recovery concepts like Recovery Time and Recovery Point Objectives (RTO/RPO) will be subtly different.

If you're trying to knock loose some ossified DR thinking, these security type disasters can crack open new opportunities to make your job safer.

I've now spent over a decade teaching how alarms are supposed to work (specific, actionable, with the appropriate urgency) and even wrote a book on how to manage metrics systems. One topic I was repeatedly asked to cover in the book, but declined because the topic is big enough for its own book, is how to do metrics right. The desire for an expert to lay down how to do metrics right comes from a number of directions:

  • No one ever looked at ours in a systematic way and our alerts are terrible [This is asking about alerts, not metrics; but they still were indirectly asking about metrics]
  • We keep having incidents and our metrics aren't helping, how do we make them help?
  • Our teams have so many alarms important ones are getting missed [Again, asking about alerts]
  • We've half assed it, and now we're getting a growth spurt. How do we know what we should be looking for?

People really do conflate alarms/alerts with metrics, so any discussion about "how do we do metrics better" is often a "how do we do alarms better" question in disguise. As for the other two points, where people have been using vibes to pick metrics and that's no longer scaling, we actually do have a whole lot of advice; you have a whole menu of "golden signals" to pick from depending on how your application is shaped.

That's only sort of why I'm writing this.

In the mathematical construct of Site Reliability Engineering, where everything is statistics and numerical analysis, metrics are easy. Track the things that affect availability, regularly triage your metrics to ensure continued relevance, and put human processes into place to make sure you're not burning out your worker-units. But the antiseptic concept of SRE only exists in a few places, the rest of us have to pollute the purity of math with human emotions. Let me explain.

Consider your Incident Management process. There are certain questions that commonly arise when people are doing the post incident reviews:

  • Could we have caught this before release? If so, what sort of pre-release checks should we add to catch this earlier?
  • Did we learn about this from metrics or customers? If customers, what metrics do we need to add to catch this earlier? If metrics, what processes or alarms should we tune to catch this earlier?
  • Could we have caught this before the feature flag rolled out to the Emerald users? Do we need to tune the alarm thresholds to catch issues like this in groups with less feature-usage before the high value customers on Emerald plans?

And so on. Note that each question asks about refining or adding metrics. Emotionally, metrics represent anxieties. Metrics are added to catch issues before they hurt us again. Metrics are retained because they're tracking something that used to hurt us and might hurt again. This makes removing metrics hard; the people involved remember why certain metrics are present, intuitively know they needs tracking, which means emotion says to keep.

Metrics are scar tissue, and removing scar tissue is hard, bloody work. How do you reduce the number of metrics, while also not compromising your availability goals? You need the hard math of SRE to work down those emotions, but all it takes is one Engineering Manager to say "this prevented a SEV, keep it" to blow that effort up. This also means you'll have much better luck with a metric reformation effort if teams are already feeling the pinch of alert fatigue or your SaaS metric provider bills are getting big enough that the top of the company is looking at metric usage to reduce costs.

Sometimes, metrics feed into Business Intelligence. That's less about scar tissue and more about optimizing your company's revenue operations. Such metrics are less likely to lead to rapid-response on-call rotations, but still can lead to months long investigations into revenue declines. That's a different but related problem.

I could write a book about making your metrics suck less, but that book by necessity has to cover a lot of human-factors issues and has to account for the role of Incident Management in metrics sprawl. Metrics are scar tissue, keep that in mind.

In a Slack I'm on someone asked a series of questions that boil down to:

Our company has a Reliability team, but another team is ignoring SLA/SLO obligations. What can SRE do to fix this?

I got most of the way through a multi-paragraph answer before noticing my answer was, "This isn't SRE's job, it's management's job." I figured a blog post might help explain this stance better.

The genius behind the Site Reliability Engineer concept at Google is they figured out how to make service uptime and reliability matter to business management. The mathematical framework behind SRE is all about quantizing risk, quantizing impact, and that allows quantizing lost revenue; possibly even quantizing lost sales opportunity. All this quantizing falls squarely into the management you can't manage what you can't measure mindset crossed with if I can't measure it, it's an outside dependency I can ignore subtext. SRE is all about making uptime and reliability a business problem worth spending management cycles on.

In the questioner's case we already have some signal that their management has integrated SRE concepts into management practice:

  • They have a Reliability team, which only happens if someone in management believes reliability is important enough to devote dedicated headcount and a manager to.
  • They have Service Level Agreement and Service Level Objective concepts in place
  • Those SLA/SLO obligations apply to more teams than the Reliability team itself, indicating there is at least some management push to distribute reliability thinking outside of the dedicated Reliability team.

The core problem the questioner is running into is that this non-compliant team is getting away with ignoring SLA/SLO stuff, and the answer to "what can SRE do to fix this" is to be found in why and how that team is getting away with this ignoring. Management is all about making trade-off decisions against competing priorities, clearly something else is becoming a higher priority than compliance with SLA/SLO practices. What are these more important priorities, and are they in alignment with upper management's priorities?

As soon as you start asking questions along the lines of "what can a mere individual contributor do to make another manager pay attention to their own manager," you have identified a pathological power imbalance. The one tool you have is "complain to the higher level manager to make them aware of the non-compliance," and hope that higher level manager will do the needful things. If that higher level manager does not do the needful things, the individual contributor is kind of out of luck.

Under their own authority, that is. In the case of the questioner, there is a Reliability team with a manager. This means there is someone in the management chain who officially cares about this stuff, and can raise concerns higher up the org-chart. Non-compliance with policy is supposed to be a management problem, and should have management solutions. The fact the policy in question was put in place due to SRE thinking is relevant, but not the driving concern here.


The above works for organizations that are hierarchical, which implies deeper management chains. You count the number of managers between the VP of Engineering and the average engineer, and that number is between 1.0 and 2.5, you probably have a short enough org-chart to directly talk to the team in question for direct education (bridging the org-chart to use Dr. Westruum's term.) If the org-chart is >2.5 managers, you're better served going through the org-chart to solve this particular problem.

But if you're in a short org-chart company, and that other team is still refusing to comply with SLA/SLO policies, you're kind of stuck complaining to the VP of Engineering and hoping that individual force alignment through some method. But if the VPofE doesn't, that is a clear signal that Reliability is not as important to management as you thought, and you should go back to the fundamentals of making the case for prioritizing SRE practices generally.