The tech community needs so much repair

This BlueSky thread from Cat Hicks is on the money

There actually has not been enough reflection in the tech community about DOGE
all I see are "they're not engineers" "that's not what engineering is" I want a thick good really real piece to sink my teeth in about all the parts of this that ARE "like engineering"
I would write it except I want to stay safe
Engineering can be so much better than this and more than this. It's absurd how much we're resigned to this entire area of human work being held hostage by this kind of culture.
I do not believe that the majority of the software people I have worked with all these years sit around thinking, "I want cancer research to be destroyed." We have mountains to climb no doubt, we are trapped in some bad cultures no doubt, but I do not believe this of you for one moment.
Our tech community needs SO much more repair, processing, and collective reflection. The people who worked to support government and science were so betrayed by other groups. The people in flashy big tech cos feeling like their values were betrayed. Just so much repair needed here

I absolutely agree that our tech community needs quite a lot of collective reflection, processing, and work on repair. For a variety of reasons I've done a fair amount of casual research into recovering from trauma of various types. Some of this comes from having lived through the dot-com, great recession, and profitability-crunch contractions in the tech market, and some from good old fashioned life experience outside of the workplace. Trauma is trauma, and we have some pretty good ideas what chronic (ongoing, persistent) trauma does vs acute (single traumatic event) trauma.

Acute trauma: sudden-death layoff.

Chronic trauma: never being allowed to stay on one project long enough to get something good for your performance review, making you constantly fear the next layoff will have your name in the list.

Each of these affects the body and mind differently, and when entire populations are subjected to chronic traumas you get population-level reactions. When those chronic traumas are structural, as is the case with management culture in all of big US-tech, remediation becomes next to impossible. When individuals in this chronically traumatized population can't fix it, you get three big trauma-responses:

  • Cynicism. I can't fix it, that's just how it is. If you set your expectations low enough, you get to be happily surprised once in a while! It's great.
  • Heroism. Rally your fellow workers to overthrow the corrupt system! Who is with me? If not, I'll do what I can alone.
  • Trauma harder as a way of life. Obviously, we're in a cut-throat system which means I need to cut throats. QED.

Big-tech management likes workers in the trauma harder category, is somewhat tolerant of cynics, and is designed from the org-chart out to prevent the heroes from getting anywhere useful and to redirect their energies in positive directions like burnishing the company's reputation among diversity hires. Do this for a few decades and you have an entire population of highly educated workers who have been trained their entire careers to look at the next sprint/month/quarter's deliverable for your team and kinda ignore what the rest of the company is doing. How your quarter's project to reduce stream latency by 15% at the p95 level relates to the efficiency of data-analysis in ICE isn't always obvious, don't look up or you might find out.

So take these workers who live in this pit of oppression every day for their day-jobs, and put them into an open-source community for their fun-time activity. What happens then?

  • The cynics contribute as they're able, expecting corporate malfeasance to show up at any point, often seeing it when it isn't there.
  • The heroes go about building a community that actually is healthy for a change! Whew.
  • The trauma harder crew perpetuate corporate-style power structures because that's how tech works, accidentally reinforcing the cynics and frustrating the heroes

Fixing this sort of thing requires so much work.

  • The cynics need to be taught that their defensive pessimism is not appropriate by repairing the structural injustices
  • The heroes need to understand they're not alone and are being listened to through an effort of collective reflection
  • The trauma harder crew needs to realize that alternate structures are viable, and understand the damage they've endured in the existing system through extensive processing

You can't do this overnight, it will take a revolution of some kind. Some revolutions are slow, like the heroes getting somewhere with governmental support allowing unionization to creep in higher and higher numbers until union contracts dominate worker terms and conditions rather than Radford salary reports. Some revolutions come quick like whole industries getting nationalized after a socialist junta and remodeled away from oligarchic control. Some come generationally, like China overtaking the US for big-tech exports forcing the oligarchs to look elsewhere to stay fat.

Our tech community needs SO much more repair, processing, and collective reflection.

The desperation of Windows

It is no secret I'm a long time now-former Windows system administrator. The first Windows I professionally administered was Windows NT, and I was with it up through Windows 2008 (and a touch of 2012). I ran into an observation today that made me to Hmm, and that leads to blogposts.

@xgranade The common complaint is that you need to be a professional software developer to use Linux. But to use Windows 11 (and have it be usable) you need to be a professional sysadmin who uses terms like "group policy".

https://aus.social/@natarasee/115501020317612539

Because this is spot on. I'd argue Linux is usable by non-devs these days, but you still need a tolerance for fiddling and non-standard UI. The Windows side is extremely true. Windows in a corporate context is way more tolerable than Windows in a home context because the corporate context has a group of grumpy Windows sysadmins setting new Group Policy every time a security or feature release comes out to turn down the suck. Those grumpy sysadmins are as grumpy at Microsoft pulling this consent-violating shit as you are, and Windows lets you centrally shut it off (in a corporate context.) 

Windows has been losing desktop market-share to Apple for years, and the old "Wintel" cash-cow they used to enjoy is not milking as much as it used to. When software makers see flagging revenue and soft user demand, it's time to do demand forcing! And demand forcing leads to shittier experiences as the use more software! message gets ever more aggressive.

The long time followers of this blog have seen enough of this industry to know the cycle when they see it.

  • Darling product stops being darling for whatever reason. Competition, flagging significance, major incident spoiling user trust, private equity takeover, whatever.
  • The product's Product org has to make number go up in spite of all this so jacks renewal prices.
  • Renewal prices only jack so far before growth reverses, so Product has to ship new features to justify the price increases.
  • Bad uptake of new features means features get added to base plans to justify jacking the price.
  • Bad uptake of now baseline features means more aggressive prompting of those features to drive up Monthly Active Users metrics
  • Repeat

Do this for enough years and you get the sclerotic Windows 11, full of demand-forcing promptware that pisses your customers off.

On Fediverse, Paul Cantrell, CompSci professor at Macalester in St. Paul Minnesota, posted the following list:

Here’s the lightning sketch of Paul’s Treatise Against Efficiency that I’ve never written:

1. Efficiency is asymptotically inefficient: as costs approach zero, the cost of further reducing them approaches infinity.
2. Efficiency prioritizes the measurable over the difficult-to-measure.
3. Efficiency prioritizes what those in power see (or imagine) over on-the-ground reality.
4. Following from 2 and 3, efficiency reduces the amount and quality of information flowing into a human system.
5. Efficiency foments institutional inflexibility.
6. By removing slack, efficiency causes small failures to cascade more readily and increases the risk of catastrophic failure.
7. Following rom 4, 5, and 6, efficiency trades small costs for massive risks: from failures, from missed opportunities, and from inability to adjust.
8. Efficiency, when pushed, strangles the emergent phenomena that in the long term create all new things of value.
9. Thus, although it can be a by-product of evolution, efficiency as a goal in itself strangles evolution.
10. Efficiency as a goal strangles joy.

It turns out I do have time to write an article about that, below the fold.

ServerFault spam issues

For the maybe three of you still there, the StackExchange sites have been experiencing a multi-month spam wave from a certain group of spammers who've built automation for our sites. SuperUser gets it harder than ServerFault, but we both get hit. So I did some diving today to figure out just how bad the floods are.

For 2025:

  • Non-deleted questions: 2512
  • Deleted questions (mostly spam): 9748
  • One specific spam type: 6863

Yes, we're getting way more spam than legitimate questions right now, and have all year.

The Metasmoke people are doing a lot to make sure you rarely see it, but it's still a burden on the moderation staff due to the need to delete the spam-users. That deletion helps the StackExchange anti-spam systems identify bad IPs and others which increases the burden on spam-campaigners like these.

I wrote a weird little book. I'm still getting royalties, so thank you all for buying, but this book does not easily fit into 2025 concepts of "observability engineering," so I want to talk about my goals and how it still fits.

At the base, I ended up writing a book for Platform teams looking to deliver internally deployed observability systems. That's not quite what I had in mind when I started writing in 2020, but that's where it lives five years later. My actual goal was to write a book that was usable by people in the SaaS industry, but also in businesses where the main user of internally developed software was internal users. The non-SaaS population often gets ignored in book targeting, and I wanted something that would let these people feel seen in a way that reading yet another Observability for Cloud Systems book would. In 2025, this book is a Platform book.

In 2019 and early 2020 when I was working with Manning on the title and terms, the word "observability" came up. It seems hard to remember in 2025, but "observability' was still a vague term that didn't yet have industry consensus behind it. OpenTelemetry was a thing at the time, but the "metrics" leg of OTel was still in beta, and "logs" was merely roadmapped. In 2025 there are debates around whether the fourth pillar is profiles, performance traces, or errors, which could be stack-dumps or a category of logs. If we had decided to use "observability" instead of "telemetry" the book may have sold better, but the term "telemetry" works better for me because observability is a practice built on top of telemetry signals. I wasn't writing a book about practice, I was writing about herding signals.

Herding signals, not interpreting them.

In 2025, most of the herding is supposed to be done through OpenTelemetry these days. Or if it isn't OTel, the signals are being herded through other systems like Apache Spark. This is industry consensus; instrument your code, add attributes in the emitters and collectors, change your vendors as you need to, build dashboards in your vendor's platform. A rewrite of Software Telemetry would reference OTel far more often than I did, but I would still make sure to mention non-OTel styles due to OTel not actually being supported (or in some cases, a good fit) in certain environments like network telemetry.

Whatever the API format of the signals getting herded, platform engineers need to know the fundamentals of how telemetry systems operate and that's what I wrote about. But also, I wrote about storing those signals, which is something that OpenTelemetry deliberately leaves out as a detail for the implementer. As I extensively wrote about, storing signals and creating a reporting interface is a hard enough part of telemetry that you can build a business around it. In fact, the Observability Tools market in 2025 is valued at around $2.75 Billion, and they all would love for you to use OTel to send them data to store and present.

In the language of my book, OpenTelemetry is an early shipping stage technology. Early because it has no role in storage. OTel arguably has a role in the emitting stage through explicit markup in code itself. OpenTelemetry's impact to the presentation stage is mostly in tagging and attribute schemas and how they get represented in storage. Observability needs to consider every stage, but also the SRE Guide problems of figuring out what to instrument, to which markup standards, following which procedures to ensure reliability. Observability sits on top of telemetry.

One of the consistent comments I got during the pre-publication reviews was: "I want to know what to track."

My answer was simple: that's not the book I'm writing.

This book is for you, the growth engineer tasked with taking a Kafka topic (or group of topics) of logging data, sent there by OTel, and transform it in the big Databricks instance with all  the other business data.

This book is for you, the network engineer tasked with extracting network metrics out of a proprietary system, so you can chart network things in the main engineering dashboarding platform.

This book is for you, the security engineer tasked with extracting security event data out of a cloud provider to put into the SIEM system.

This book is for you, the project manager who has just been given a digital transformation project to revitalize how all the internally developed apps will produce telemetry, and how engineers will observe the system.

A sentiment just crossed Fediverse recently, which is in the vein of "RSS was peak social media, change my mind". The original post was from https://hachyderm.io/@Daojoan@mastodon.social and is quoted below:

RSS never tracked you.
Email never throttled you.
Blogs never begged for dopamine.
The old web wasn’t perfect.
But it was yours.

https://mastodon.social/@Daojoan/114587431688413845M

I was there for the rise and fall of blogging, so the rest of this post is me over thinking this particular post.

The Department of Government Efficiency, Musk's vehicle. made news by "discovering" the General Services Administration uses tapes, and plans to save $1M by switching to something else (disks, or cloud-based storage). Long time readers of this blog may remember I used to talk a lot about storage and tape backup. Guess it's time to get my antique Storage Nerd hat out of the closet (this is my first storage post since 2013) to explain why tape is still relevant in an era of 400Gb backbone networks and 30TB SMR disks.

The SaaS revolution has utterly transformed the office automation space. The job I had in 2005, in the early years of this blog, only exists in small pockets anymore. So many office systems have been SaaSified that the old problems I used to blog about around backups and storage tech are much less pressing in the modern era. Where we have stuff like that are places that have decades of old file data, starting in the mid to late 1980s, that is still being hauled around. Even when I was still doing this in the late 2000s the needle was shifting to large arrays of cheap disks replacing tape arrays.

Where you still see tape being used here are offices with policies for "off-site" or "offline" storage of key office data. A lot of that stuff is also done on disk these days, but some offices still kept their tape libraries. The InfoSec space is keen to point out you can't crypto-locker an offline tape, so offline tape is a useful tool in recovering from a ransomware incident. I suspect a lot of what DoGE found was in this category of offices retaining tape infrastructure. Is disk cheaper here? Marginally, the true savings will be much less than the $1M headline rate.

But there is another area where tape continues to be the economical option, and it's another area DoGE is going to run into: large scientific datasets.

To explain why, I want to use a contrasting example: A vacation picture you took on an iPhone in 2011, put into Dropbox, shared twice, and haven't looked at in 14 years. That file has followed you to new laptops and phones, unseen, unloved, but available. A lot goes into making sure it's available.

All the big object-stores like S3, and file-sync-and-share services (like Dropbox, Box, MS live, Google Drive, Proton Drive, etc) use a common architecture because this architecture has been proven to be reliable at avoiding visible data-loss:

  • Every uploaded file is split into 4KB blocks (the size is relevant to disk technology, which I'm not going into here)
  • Each block is written between 3 and 7 times to disk in a given datacenter or region, the exact replication factor changes based on service and internal realities
  • Each block is replicated to more than one geographic region as a disaster resilience move, generally at least 2, often 3 or more

The end result of the above is that the 1MB vacation picture is written to disk 6 to 14 different times. The nice thing about the above is you can lose an entire rack-row of a datacenter and not lose data; you might lose 2 of your 5 copies of a given block, but you have 3 left to rebuild, and your other region still has full copies.

But I mentioned this 1MB file has been kept online for 14 years. Assuming an average disk life-span of 5 years, each block has been migrated to new hardware 3 times in those years. Meaning each 4KB block of that file has been resident on between 24 and 42 hardrives; or more, if your provider replicates to more than 2 discrete geographic region. Those drives have been spinning and using power (and therefore requiring cooling) the entire time.

These systems need to go to all of this effort because they need to be sure that all files are available all the time, when you need it, where you need it, as fast as possible. If a person in that vacation photo retires, and you suddenly need that picture for the Retirement Montage at their going away party, you don't want to wait hours for it to come off tape. You want it now.

Contrast this to a scientific dataset. Once the data has stopped being used for Science! it can safely be archived until someone else needs to use it. This is the use-case behind AWS S3 Glacier: you pay a lot less for storing data, so long as you're willing to accept delays measurable in hours before you can access it. This is also the use-case where tape shines.

A lab gets done chewing on a dataset sized at 100TB, which is pretty chonky for 2011. They send it to cold storage. Their IT section dutifully copies the 100TB dataset onto LTO-5 drives at 1.5TB per tape, for a stack of 67 tapes, and removes the dataset from their disk-based storage arrays.

Time passes, as with the Dropbox-style data. LTO drives can read between 1 and 2 generations prior. Assuming the lab IT section keeps up on tape technology, it would be the advent of LTO-7 in 2015 that would prompt a great restore and rearchive effort of all LTO-5 and previous media. LTO-7 can do 6TB per tape, for a much smaller stack of 17 tapes.

LTO-8 changed this, with only a one version lookback. So when LTO-8 comes out in 2017 with a 9TB capacity, a read restore/rearchive effort runs again, changing our stack of tapes from 17 to 12. LTO-9 comes out in 2021 with 18TB per tape, and that stack reduces to 6 tapes to hold 100TB.

All in all, our cold dataset had to relocate to new media three times, same as the disk-based stuff. However, keeping stacks of tape in a climate controlled room is vastly cheaper than a room of powered, spinning disk. The actual reality is somewhat different, as the few data archive people I know mention they do great restore/archive runs about every 8 to 10 years, largely driven by changes in drive connectivity (SCSI, SATA, FibreChannel, Infiniband, SAS, etc), OS and software support, and corporate purchasing cycles. Keeping old drives around for as long as possible is fiscally smart, so the true recopy events for our example data is likely "1".

So another lab wants to use that dataset and puts in a request. A day later, the data is on a disk-array for usage. Done. Carrying costs for that data in the intervening 14 years are significantly lower than the always available model of S3 and Dropbox.

Tape: still quite useful in the right contexts.

Applied risk management

I've been in the tech industry for an uncomfortable amount of time, but I've been doing resilience planning the whole time. You know, when and how often to take backups, segueing into worrying about power diversity, things like that. My last two years at Dropbox gave me exposure to how that works when you have multiple datacenters. It gets complex, and there are enough moving parts you can actually build models around expected failure rates in a given year to better help you prioritize remediation and prep work.

Meanwhile, everyone in the tech-disaster industry peeps over the shoulders of environmental disaster recoveries like hurricanes and earthquakes. You can learn a lot by watching the pros. I've talked about some of what we learned, mostly it has been procedural in nature:

Since then, the United States elected a guy who wants to be dictator, and a Congress who seems willing to let it happen. For those of us in the disliked minority of the moment, we're facing concerted efforts to roll back our ability to exist in public. That's risk. Below the fold I talk about using what I learned from IT risk management and how I apply those techniques to assess my own risks. It turns out building risks for "dictatorship in America" can't rely on prior art as much as risks for "datacenter going offline," which absolutely has prior art to include; and even incident rates to factor in.

Blog Questions Challenge 2025

Thanks to Ben Cotton for sharing.

Why did you start blogging in the first place?

I covered a lot of that in 20 years of this nonsense from about a year ago. The quick version is I was charged with creating a "Static pages from your NetWare home directory" project and needed something to test with, so here we are. That version was done with Blogger before the Google acquisition, when they still supported publish-by-ftp (which I also had to set up as part of the same project).

What platform are you using to manage your blog, and why do you use it?

When blogger got rid of the publish-by-ftp method, I had to move. I came to my own domain and went looking for blogging software. On advice from an author I like, I kept in mind the slashdot effect so wanted to be sure if I had an order of magnitude more traffic for an article it wouldn't melt the server it was one. So I wanted something relatively light weight, which at the time was Movable Type. Wordpress required database hits for every webpage, which didn't seem to scale.

I stuck with it because Movable Type continues to do the job quite well, and be ergonomic for me. I turned off comments a while ago, as that was an anti-spam nightmare I needed recency to solve. Movable Type now requires a thousand dollars a year for a subscription, which pencils out to about $125 per blog post at my current posting rate. Not worth it.

Have you blogged on other platforms before?

Like just about everyone my age, I was on Livejournal. I don't remember if this blog or LJ came first, and I'm not going to go check. I had another blog on Blogger for a while, about local politics. It has been lost to time, though is still visible on archive.org if you know where to look for it.

How do you write your posts?

Most are spur of the moment. I have a topic, and time, and remember I can be long-form about it. Once in a while I'll get into something on social media and realize I need actual wordcount to do it justice, so I do it here instead. The advent of twitter absolutely slowed down my posting rate here!

Once I have the words in, I schedule a post for a few days hence.

When do you feel most inspired to write?

As with all writers, it comes when it comes. Sometimes I set out goals and I stick to them. But blogging hasn't been a focus of mine for a long time, so it's entirely whim. I do know I need an hour or so of mostly uninterrupted time to get my thoughts in order, which is hard to come by without arranging for it.

Do you normally publish immediately after writing, or do you let it simmer a bit?

As mentioned above, I use scheduled-post. Typically for 9am, unless I've got something spicy and don't care. This is rare, I've also learned that posting spicy takes absolutely needs a cooling off period. I've pulled posts after writing them because I realize they didn't actually need to get posted, I merely needed to write them.

What's your favorite post on your blog?

That's changed a lot over the years as I've changed.

  • For a long time, I was proud of my Know your IO series from 2010. That was prompted by a drop-by conversation from one of our student workers who had a question about storage technology. I infodumped for most of an hour, and realized I had a blog series. This is still linked from my sidebar on the right.
  • From recent history, the post Why I don't like Markdown in a git repo as documentation is a still accurate distillation of why I seriously dislike this reflexive answer to workplace knowledge sharing.
  • This post about the lost history of why you wait for the first service pack before deploying anything is me bringing old-timer points of view to newer audiences. The experiences in this post are drawn directly from where I was working in 2014-2015. Yes Virginia, people still do ship shrink-wrap software to Enterprise distros. Some of you are painfully aware of this.

I'm not stopping blogging any time soon. At some point the dependency chain for Movable Type will rot and I'll have to port to something else, probably a static site generator. I believe I'm spoiled for choice in that domain.

Back in November I posted about how to categorize your incidents using the pick-one list common across incident automation platforms. In that post I said:

A few organizations go so far as to have a fully separate process for the 'High' and 'Critical' urgencies of events, maybe calling them Disaster Recovery events instead of Incidents. DR events need to be rare, which means that process isn't as well exercised as Incident response. However, a separate process makes it abundantly clear that certain urgencies and scopes require different process overall. More on this in a later blog-post.

This is the later blog post.

The SaaS industry as a whole has been referring to the California Fire Command (now known as the Incident Command System) model for inspiration on handling technical incidents. The basic structure is familiar to any SaaS engineer:

  • There is an Incident Commander who is responsible for running the whole thing, including post-incident processes
  • There is a Technical Lead who is responsible for the technical response

There may be additional roles available depending on organizational specifics:

  • A Business Manager who is responsible for the customer-facing response
  • A Legal Manager who is responsible for anything to do with legal
  • A Security Lead who is responsible for heading security investigations

Again, familiar. But truly large incidents put stress on this model. In a given year the vast majority of incidents experienced by an engineering organization will be the grass fire variety that can be handled by a team of four people in under 30 minutes. What happens when a major event happens?

The example I'm using here is a private information disclosure by a hostile party using a compromised credential. Someone not employed by the company dumped a database they shouldn't have had access to, and that database involved data that requires disclosure in the case of compromise. Given this, we already know some of the workstreams that incident response will be doing once this activity is discovered:

  • Investigatory work to determine where else the attacker got access to and fully define the scope of what leaked
  • Locking down the infrastructure to close the holes used by the attacker for the identified access
  • Cycling/retiring credentials possibly exposed to the attacker
  • Regulated notification generation and execution
  • Technical remediation work to lock down any exploited code vulnerabilities

An antiseptic list, but a scary one. The moment the company officially notices a breach of private information, legislation world-wide starts timers on when privacy regulators or the public need to be informed. For a profit driven company, this is admitting fault in public which is something none of them do lightly due to the lawsuits that will result. For publicly traded companies, stockholder notification will also need to be generated. Incidents like this look very little like an availability SLA breach SEV of the kind that happens 2-3 times a month in different systems.

Based on the rubric I showed back in November, an incident of this type is of Critical urgency due to regulated timelines, and will require either Cross-Org or C-level response depending on the size of the company. What's more, the need to figure out where the attacker went blocks later stages of response, so this response process will actually be a 24 hour operation and likely run several days. No one person can safely stay awake for 4+ days straight.

The Incident Command Process defines three types of command structure:

  • Solitary command - where one person is running the whole show
  • Unified command - where multiple jurisdictions are involved and they need to coordinate, and also to provide shift changes through rotating who is the Operations Chief (what SaaS folk call the Technical Lead)
  • Area command - where multiple incidents are part of a larger complex, the Area Commander supports each Incident Command

Incidents of the scale of our private information breach lean into the Area Command style for a few reasons. First and foremost, there are discrete workstreams that need to be executed by different groups; such as the security review to isolate scope, building regulated notifications, and cycling credentials. All those workstreams need people to run them, and those workstream leads need to report to incident command. That looks a lot like Area Command to me.

If your daily incident experience are 4-7 person team responses, how ready are you to be involved in an Area Command style response? Not at all.

If you've been there for years and have seen a few multi-org responses in your time, how ready are you to handle Area Command style response? Better, you might be a good workstream lead.

One thing the Incident Command Process makes clear is that Area Commanders do not have an operational role, meaning they're not involved in the technical remediation. Their job is coordination, logistics, and high level decision making across response areas. For our pretend SaaS company, a good Area Commander will be someone:

  • Someone who has experience with incidents involving legal response
  • Someone who has experience with large security response, because the most likely incidents of this size are security related
  • Someone who has experience with incidents involving multiple workstreams requiring workstream leaders
  • Someone who has experience communicating with C-Levels and has their respect
  • Two to four of these people in order to safely staff a 24 hour response for multiple days

Is your company equipped to handle this scale of response?

In many cases, probably not. Companies handle incidents of this type a few different ways. As I mentioned in the earlier post, some categorize problems like this as a disaster instead of an incident and invoke a different process. This has the advantage of making it clear the response for these is different, at the cost of having far fewer people familiar with the response methods. You make up for the lack of in situ training, learn by doing, by regularly re-certifying key leaders on the process.

Other companies extend the existing incident response process on the fly rather than risk having a separate process that will get stale. This works so long as you have some people around who kind of know what they're doing and can herd others into the right shape. Though, after the second disaster of this scale, people will start talking about how to formalize procedures.

Whichever way your company goes, start thinking about this. Unless you're working for the hyperscalers, incidents of this response scope are going to be rare. This means you need to schedule quarterly time to train, practice, and certify your Area Commanders and workstream leads. This will speed up response time overall, because less time will be spent arguing over command and feedback structures.