Growth vs throughput mindsets

I've talked about this one to various managers over the years when the topic of work-scheduling humans comes up. There are two big frameworks for deciding who works which tickets:

  • Throughput: Assign the person who will complete the task fastest to the ticket or story. Helpdesks often work on this model, since they're frequently quite under water.
  • Growth: Assign the second most knowledgeable person to the ticket/story (or third, or fourth if you have that many people) so they can get experience solving the problem. This method solves the cross-training problem.

Both approaches are valid, and which approach a team takes is balanced against the incentives the team operates in. For a Helpdesk team, which is often under-staffed, throughput tends to be highly prioritized. For a Dev team with on-call responsibilities, growth tends to be the norm to make sure problems can be handled by anyone and to stop burning out your senior talent. Security teams tend to be a blend of both; with ticket-based security reviews urging throughput mindsets, where incident-response prioritizes growth. For Platform teams that tend to be involved in a lot of incidents due to running the systems that problems happen on, even if the problem wasn't actually the platform system, throughput mindsets are hard to avoid.

Now add coding LLMs to the mix.

The on-label reason to use agents such as Claude is throughput: spend less time working on tasks to improve your throughput. For Engineering Directors this is a great thing, because you can get throughput advances in your reporting teams without sacrificing engineering maturity, letting you lean on the product roadmap a little harder.

I'd argue that the throughput nature of LLM agents actually sacrifices some growth at population-scale when used in currently typical tech-companies. By moving from a purely generative mindset when building code to a revisionary mindset, going from writing code to reviewing and editing generated code, engineers spend less time learning problem-spaces in detail. Less time learning means taking more calendar time building up true domain knowledge. Coding agents are a somewhat different thing than the decades of "just add another abstraction layer and forget about the lower level details" we've been doing since 1970. It is true that most engineers in SaaS companies aren't spending lots of time hand-tuning assembly for execution on their ISA, and doing so is actually very bad practice unless you're in extremely specific problem domains. The difference between yet another abstraction and coding agent is the difference between abstraction and synthesis.

Abstraction is a distillation of a problem-space into an API, which can represented by grpc protobufs or function-calls, or whatever. You have inputs, expected actions, and expected results; the abstraction handles the details so you don't have to and often someone else is responsible for updates over time.

Synthesis is building novel functionality through combining multiple things to create a new thing. Humans traditionally have been the synthesis engines of code-production, but coding agents are beginning to automate portions of this.

Writing code is synthesis. Until the advent of coding agents, your IDE would give you support through syntax highlighting and API short-cuts, reducing the cognitive load of bare syntax problems to let you focus on higher level problems like how a function should handle bad inputs. The IDE gives you support for abstractions, but leaves the synthesis to you.

Once coding agents get added in, you shift from generating code, synthesis, to reviewing automatically generated code whenever the agent creates something for you. This is when cognitive biases come into play. Remember the mindset thing I opened this post with? An engineer under throughput stress is going to be less diligent about checking generated code in detail, and will shift more of their domain learning to troubleshooting test-failures and incident-response. Less domain knowledge will be developed during the initial writing. An engineer with no throughput stress, perhaps they're doing some for-fun coding, is more likely to take a fine toothed comb to generated code to learn what the generated code is trying to do, why it works the way it does, and what best-practices it seems to be following; in this throughput-free case the coding agent is a driver of growth.

Coding agents end up magnifying the biases present in the team's environment, enhancing positive feedback loops. The advertised reason to adopt coding agents is to increase developer throughput! The throughput bias is baked into the technology and its marketing, which means organizations requiring coding agent use need to take steps to provide some negative feedback loop enhancement to avoid large-scale degradation in coding quality and incident severity. Use of these agents make an organization more sensitive to growth/throughput bias shifts prompted by org-chart and quarterly goal shifts.

Combine the bias-enhancing quality of coding agents with the industry-wide retraction in US-based jobs for economic reasons, and you should be seeing a general retraction in overall growth mindsets among US-based engineers.

The tech community needs so much repair

This BlueSky thread from Cat Hicks is on the money

There actually has not been enough reflection in the tech community about DOGE
all I see are "they're not engineers" "that's not what engineering is" I want a thick good really real piece to sink my teeth in about all the parts of this that ARE "like engineering"
I would write it except I want to stay safe
Engineering can be so much better than this and more than this. It's absurd how much we're resigned to this entire area of human work being held hostage by this kind of culture.
I do not believe that the majority of the software people I have worked with all these years sit around thinking, "I want cancer research to be destroyed." We have mountains to climb no doubt, we are trapped in some bad cultures no doubt, but I do not believe this of you for one moment.
Our tech community needs SO much more repair, processing, and collective reflection. The people who worked to support government and science were so betrayed by other groups. The people in flashy big tech cos feeling like their values were betrayed. Just so much repair needed here

I absolutely agree that our tech community needs quite a lot of collective reflection, processing, and work on repair. For a variety of reasons I've done a fair amount of casual research into recovering from trauma of various types. Some of this comes from having lived through the dot-com, great recession, and profitability-crunch contractions in the tech market, and some from good old fashioned life experience outside of the workplace. Trauma is trauma, and we have some pretty good ideas what chronic (ongoing, persistent) trauma does vs acute (single traumatic event) trauma.

Acute trauma: sudden-death layoff.

Chronic trauma: never being allowed to stay on one project long enough to get something good for your performance review, making you constantly fear the next layoff will have your name in the list.

Each of these affects the body and mind differently, and when entire populations are subjected to chronic traumas you get population-level reactions. When those chronic traumas are structural, as is the case with management culture in all of big US-tech, remediation becomes next to impossible. When individuals in this chronically traumatized population can't fix it, you get three big trauma-responses:

  • Cynicism. I can't fix it, that's just how it is. If you set your expectations low enough, you get to be happily surprised once in a while! It's great.
  • Heroism. Rally your fellow workers to overthrow the corrupt system! Who is with me? If not, I'll do what I can alone.
  • Trauma harder as a way of life. Obviously, we're in a cut-throat system which means I need to cut throats. QED.

Big-tech management likes workers in the trauma harder category, is somewhat tolerant of cynics, and is designed from the org-chart out to prevent the heroes from getting anywhere useful and to redirect their energies in positive directions like burnishing the company's reputation among diversity hires. Do this for a few decades and you have an entire population of highly educated workers who have been trained their entire careers to look at the next sprint/month/quarter's deliverable for your team and kinda ignore what the rest of the company is doing. How your quarter's project to reduce stream latency by 15% at the p95 level relates to the efficiency of data-analysis in ICE isn't always obvious, don't look up or you might find out.

So take these workers who live in this pit of oppression every day for their day-jobs, and put them into an open-source community for their fun-time activity. What happens then?

  • The cynics contribute as they're able, expecting corporate malfeasance to show up at any point, often seeing it when it isn't there.
  • The heroes go about building a community that actually is healthy for a change! Whew.
  • The trauma harder crew perpetuate corporate-style power structures because that's how tech works, accidentally reinforcing the cynics and frustrating the heroes

Fixing this sort of thing requires so much work.

  • The cynics need to be taught that their defensive pessimism is not appropriate by repairing the structural injustices
  • The heroes need to understand they're not alone and are being listened to through an effort of collective reflection
  • The trauma harder crew needs to realize that alternate structures are viable, and understand the damage they've endured in the existing system through extensive processing

You can't do this overnight, it will take a revolution of some kind. Some revolutions are slow, like the heroes getting somewhere with governmental support allowing unionization to creep in higher and higher numbers until union contracts dominate worker terms and conditions rather than Radford salary reports. Some revolutions come quick like whole industries getting nationalized after a socialist junta and remodeled away from oligarchic control. Some come generationally, like China overtaking the US for big-tech exports forcing the oligarchs to look elsewhere to stay fat.

Our tech community needs SO much more repair, processing, and collective reflection.

The desperation of Windows

It is no secret I'm a long time now-former Windows system administrator. The first Windows I professionally administered was Windows NT, and I was with it up through Windows 2008 (and a touch of 2012). I ran into an observation today that made me to Hmm, and that leads to blogposts.

@xgranade The common complaint is that you need to be a professional software developer to use Linux. But to use Windows 11 (and have it be usable) you need to be a professional sysadmin who uses terms like "group policy".

https://aus.social/@natarasee/115501020317612539

Because this is spot on. I'd argue Linux is usable by non-devs these days, but you still need a tolerance for fiddling and non-standard UI. The Windows side is extremely true. Windows in a corporate context is way more tolerable than Windows in a home context because the corporate context has a group of grumpy Windows sysadmins setting new Group Policy every time a security or feature release comes out to turn down the suck. Those grumpy sysadmins are as grumpy at Microsoft pulling this consent-violating shit as you are, and Windows lets you centrally shut it off (in a corporate context.) 

Windows has been losing desktop market-share to Apple for years, and the old "Wintel" cash-cow they used to enjoy is not milking as much as it used to. When software makers see flagging revenue and soft user demand, it's time to do demand forcing! And demand forcing leads to shittier experiences as the use more software! message gets ever more aggressive.

The long time followers of this blog have seen enough of this industry to know the cycle when they see it.

  • Darling product stops being darling for whatever reason. Competition, flagging significance, major incident spoiling user trust, private equity takeover, whatever.
  • The product's Product org has to make number go up in spite of all this so jacks renewal prices.
  • Renewal prices only jack so far before growth reverses, so Product has to ship new features to justify the price increases.
  • Bad uptake of new features means features get added to base plans to justify jacking the price.
  • Bad uptake of now baseline features means more aggressive prompting of those features to drive up Monthly Active Users metrics
  • Repeat

Do this for enough years and you get the sclerotic Windows 11, full of demand-forcing promptware that pisses your customers off.

On Fediverse, Paul Cantrell, CompSci professor at Macalester in St. Paul Minnesota, posted the following list:

Here’s the lightning sketch of Paul’s Treatise Against Efficiency that I’ve never written:

1. Efficiency is asymptotically inefficient: as costs approach zero, the cost of further reducing them approaches infinity.
2. Efficiency prioritizes the measurable over the difficult-to-measure.
3. Efficiency prioritizes what those in power see (or imagine) over on-the-ground reality.
4. Following from 2 and 3, efficiency reduces the amount and quality of information flowing into a human system.
5. Efficiency foments institutional inflexibility.
6. By removing slack, efficiency causes small failures to cascade more readily and increases the risk of catastrophic failure.
7. Following rom 4, 5, and 6, efficiency trades small costs for massive risks: from failures, from missed opportunities, and from inability to adjust.
8. Efficiency, when pushed, strangles the emergent phenomena that in the long term create all new things of value.
9. Thus, although it can be a by-product of evolution, efficiency as a goal in itself strangles evolution.
10. Efficiency as a goal strangles joy.

It turns out I do have time to write an article about that, below the fold.

ServerFault spam issues

For the maybe three of you still there, the StackExchange sites have been experiencing a multi-month spam wave from a certain group of spammers who've built automation for our sites. SuperUser gets it harder than ServerFault, but we both get hit. So I did some diving today to figure out just how bad the floods are.

For 2025:

  • Non-deleted questions: 2512
  • Deleted questions (mostly spam): 9748
  • One specific spam type: 6863

Yes, we're getting way more spam than legitimate questions right now, and have all year.

The Metasmoke people are doing a lot to make sure you rarely see it, but it's still a burden on the moderation staff due to the need to delete the spam-users. That deletion helps the StackExchange anti-spam systems identify bad IPs and others which increases the burden on spam-campaigners like these.

I wrote a weird little book. I'm still getting royalties, so thank you all for buying, but this book does not easily fit into 2025 concepts of "observability engineering," so I want to talk about my goals and how it still fits.

At the base, I ended up writing a book for Platform teams looking to deliver internally deployed observability systems. That's not quite what I had in mind when I started writing in 2020, but that's where it lives five years later. My actual goal was to write a book that was usable by people in the SaaS industry, but also in businesses where the main user of internally developed software was internal users. The non-SaaS population often gets ignored in book targeting, and I wanted something that would let these people feel seen in a way that reading yet another Observability for Cloud Systems book would. In 2025, this book is a Platform book.

In 2019 and early 2020 when I was working with Manning on the title and terms, the word "observability" came up. It seems hard to remember in 2025, but "observability' was still a vague term that didn't yet have industry consensus behind it. OpenTelemetry was a thing at the time, but the "metrics" leg of OTel was still in beta, and "logs" was merely roadmapped. In 2025 there are debates around whether the fourth pillar is profiles, performance traces, or errors, which could be stack-dumps or a category of logs. If we had decided to use "observability" instead of "telemetry" the book may have sold better, but the term "telemetry" works better for me because observability is a practice built on top of telemetry signals. I wasn't writing a book about practice, I was writing about herding signals.

Herding signals, not interpreting them.

In 2025, most of the herding is supposed to be done through OpenTelemetry these days. Or if it isn't OTel, the signals are being herded through other systems like Apache Spark. This is industry consensus; instrument your code, add attributes in the emitters and collectors, change your vendors as you need to, build dashboards in your vendor's platform. A rewrite of Software Telemetry would reference OTel far more often than I did, but I would still make sure to mention non-OTel styles due to OTel not actually being supported (or in some cases, a good fit) in certain environments like network telemetry.

Whatever the API format of the signals getting herded, platform engineers need to know the fundamentals of how telemetry systems operate and that's what I wrote about. But also, I wrote about storing those signals, which is something that OpenTelemetry deliberately leaves out as a detail for the implementer. As I extensively wrote about, storing signals and creating a reporting interface is a hard enough part of telemetry that you can build a business around it. In fact, the Observability Tools market in 2025 is valued at around $2.75 Billion, and they all would love for you to use OTel to send them data to store and present.

In the language of my book, OpenTelemetry is an early shipping stage technology. Early because it has no role in storage. OTel arguably has a role in the emitting stage through explicit markup in code itself. OpenTelemetry's impact to the presentation stage is mostly in tagging and attribute schemas and how they get represented in storage. Observability needs to consider every stage, but also the SRE Guide problems of figuring out what to instrument, to which markup standards, following which procedures to ensure reliability. Observability sits on top of telemetry.

One of the consistent comments I got during the pre-publication reviews was: "I want to know what to track."

My answer was simple: that's not the book I'm writing.

This book is for you, the growth engineer tasked with taking a Kafka topic (or group of topics) of logging data, sent there by OTel, and transform it in the big Databricks instance with all  the other business data.

This book is for you, the network engineer tasked with extracting network metrics out of a proprietary system, so you can chart network things in the main engineering dashboarding platform.

This book is for you, the security engineer tasked with extracting security event data out of a cloud provider to put into the SIEM system.

This book is for you, the project manager who has just been given a digital transformation project to revitalize how all the internally developed apps will produce telemetry, and how engineers will observe the system.

A sentiment just crossed Fediverse recently, which is in the vein of "RSS was peak social media, change my mind". The original post was from https://hachyderm.io/@Daojoan@mastodon.social and is quoted below:

RSS never tracked you.
Email never throttled you.
Blogs never begged for dopamine.
The old web wasn’t perfect.
But it was yours.

https://mastodon.social/@Daojoan/114587431688413845M

I was there for the rise and fall of blogging, so the rest of this post is me over thinking this particular post.

The Department of Government Efficiency, Musk's vehicle. made news by "discovering" the General Services Administration uses tapes, and plans to save $1M by switching to something else (disks, or cloud-based storage). Long time readers of this blog may remember I used to talk a lot about storage and tape backup. Guess it's time to get my antique Storage Nerd hat out of the closet (this is my first storage post since 2013) to explain why tape is still relevant in an era of 400Gb backbone networks and 30TB SMR disks.

The SaaS revolution has utterly transformed the office automation space. The job I had in 2005, in the early years of this blog, only exists in small pockets anymore. So many office systems have been SaaSified that the old problems I used to blog about around backups and storage tech are much less pressing in the modern era. Where we have stuff like that are places that have decades of old file data, starting in the mid to late 1980s, that is still being hauled around. Even when I was still doing this in the late 2000s the needle was shifting to large arrays of cheap disks replacing tape arrays.

Where you still see tape being used here are offices with policies for "off-site" or "offline" storage of key office data. A lot of that stuff is also done on disk these days, but some offices still kept their tape libraries. The InfoSec space is keen to point out you can't crypto-locker an offline tape, so offline tape is a useful tool in recovering from a ransomware incident. I suspect a lot of what DoGE found was in this category of offices retaining tape infrastructure. Is disk cheaper here? Marginally, the true savings will be much less than the $1M headline rate.

But there is another area where tape continues to be the economical option, and it's another area DoGE is going to run into: large scientific datasets.

To explain why, I want to use a contrasting example: A vacation picture you took on an iPhone in 2011, put into Dropbox, shared twice, and haven't looked at in 14 years. That file has followed you to new laptops and phones, unseen, unloved, but available. A lot goes into making sure it's available.

All the big object-stores like S3, and file-sync-and-share services (like Dropbox, Box, MS live, Google Drive, Proton Drive, etc) use a common architecture because this architecture has been proven to be reliable at avoiding visible data-loss:

  • Every uploaded file is split into 4KB blocks (the size is relevant to disk technology, which I'm not going into here)
  • Each block is written between 3 and 7 times to disk in a given datacenter or region, the exact replication factor changes based on service and internal realities
  • Each block is replicated to more than one geographic region as a disaster resilience move, generally at least 2, often 3 or more

The end result of the above is that the 1MB vacation picture is written to disk 6 to 14 different times. The nice thing about the above is you can lose an entire rack-row of a datacenter and not lose data; you might lose 2 of your 5 copies of a given block, but you have 3 left to rebuild, and your other region still has full copies.

But I mentioned this 1MB file has been kept online for 14 years. Assuming an average disk life-span of 5 years, each block has been migrated to new hardware 3 times in those years. Meaning each 4KB block of that file has been resident on between 24 and 42 hardrives; or more, if your provider replicates to more than 2 discrete geographic region. Those drives have been spinning and using power (and therefore requiring cooling) the entire time.

These systems need to go to all of this effort because they need to be sure that all files are available all the time, when you need it, where you need it, as fast as possible. If a person in that vacation photo retires, and you suddenly need that picture for the Retirement Montage at their going away party, you don't want to wait hours for it to come off tape. You want it now.

Contrast this to a scientific dataset. Once the data has stopped being used for Science! it can safely be archived until someone else needs to use it. This is the use-case behind AWS S3 Glacier: you pay a lot less for storing data, so long as you're willing to accept delays measurable in hours before you can access it. This is also the use-case where tape shines.

A lab gets done chewing on a dataset sized at 100TB, which is pretty chonky for 2011. They send it to cold storage. Their IT section dutifully copies the 100TB dataset onto LTO-5 drives at 1.5TB per tape, for a stack of 67 tapes, and removes the dataset from their disk-based storage arrays.

Time passes, as with the Dropbox-style data. LTO drives can read between 1 and 2 generations prior. Assuming the lab IT section keeps up on tape technology, it would be the advent of LTO-7 in 2015 that would prompt a great restore and rearchive effort of all LTO-5 and previous media. LTO-7 can do 6TB per tape, for a much smaller stack of 17 tapes.

LTO-8 changed this, with only a one version lookback. So when LTO-8 comes out in 2017 with a 9TB capacity, a read restore/rearchive effort runs again, changing our stack of tapes from 17 to 12. LTO-9 comes out in 2021 with 18TB per tape, and that stack reduces to 6 tapes to hold 100TB.

All in all, our cold dataset had to relocate to new media three times, same as the disk-based stuff. However, keeping stacks of tape in a climate controlled room is vastly cheaper than a room of powered, spinning disk. The actual reality is somewhat different, as the few data archive people I know mention they do great restore/archive runs about every 8 to 10 years, largely driven by changes in drive connectivity (SCSI, SATA, FibreChannel, Infiniband, SAS, etc), OS and software support, and corporate purchasing cycles. Keeping old drives around for as long as possible is fiscally smart, so the true recopy events for our example data is likely "1".

So another lab wants to use that dataset and puts in a request. A day later, the data is on a disk-array for usage. Done. Carrying costs for that data in the intervening 14 years are significantly lower than the always available model of S3 and Dropbox.

Tape: still quite useful in the right contexts.

Applied risk management

I've been in the tech industry for an uncomfortable amount of time, but I've been doing resilience planning the whole time. You know, when and how often to take backups, segueing into worrying about power diversity, things like that. My last two years at Dropbox gave me exposure to how that works when you have multiple datacenters. It gets complex, and there are enough moving parts you can actually build models around expected failure rates in a given year to better help you prioritize remediation and prep work.

Meanwhile, everyone in the tech-disaster industry peeps over the shoulders of environmental disaster recoveries like hurricanes and earthquakes. You can learn a lot by watching the pros. I've talked about some of what we learned, mostly it has been procedural in nature:

Since then, the United States elected a guy who wants to be dictator, and a Congress who seems willing to let it happen. For those of us in the disliked minority of the moment, we're facing concerted efforts to roll back our ability to exist in public. That's risk. Below the fold I talk about using what I learned from IT risk management and how I apply those techniques to assess my own risks. It turns out building risks for "dictatorship in America" can't rely on prior art as much as risks for "datacenter going offline," which absolutely has prior art to include; and even incident rates to factor in.

Blog Questions Challenge 2025

Thanks to Ben Cotton for sharing.

Why did you start blogging in the first place?

I covered a lot of that in 20 years of this nonsense from about a year ago. The quick version is I was charged with creating a "Static pages from your NetWare home directory" project and needed something to test with, so here we are. That version was done with Blogger before the Google acquisition, when they still supported publish-by-ftp (which I also had to set up as part of the same project).

What platform are you using to manage your blog, and why do you use it?

When blogger got rid of the publish-by-ftp method, I had to move. I came to my own domain and went looking for blogging software. On advice from an author I like, I kept in mind the slashdot effect so wanted to be sure if I had an order of magnitude more traffic for an article it wouldn't melt the server it was one. So I wanted something relatively light weight, which at the time was Movable Type. Wordpress required database hits for every webpage, which didn't seem to scale.

I stuck with it because Movable Type continues to do the job quite well, and be ergonomic for me. I turned off comments a while ago, as that was an anti-spam nightmare I needed recency to solve. Movable Type now requires a thousand dollars a year for a subscription, which pencils out to about $125 per blog post at my current posting rate. Not worth it.

Have you blogged on other platforms before?

Like just about everyone my age, I was on Livejournal. I don't remember if this blog or LJ came first, and I'm not going to go check. I had another blog on Blogger for a while, about local politics. It has been lost to time, though is still visible on archive.org if you know where to look for it.

How do you write your posts?

Most are spur of the moment. I have a topic, and time, and remember I can be long-form about it. Once in a while I'll get into something on social media and realize I need actual wordcount to do it justice, so I do it here instead. The advent of twitter absolutely slowed down my posting rate here!

Once I have the words in, I schedule a post for a few days hence.

When do you feel most inspired to write?

As with all writers, it comes when it comes. Sometimes I set out goals and I stick to them. But blogging hasn't been a focus of mine for a long time, so it's entirely whim. I do know I need an hour or so of mostly uninterrupted time to get my thoughts in order, which is hard to come by without arranging for it.

Do you normally publish immediately after writing, or do you let it simmer a bit?

As mentioned above, I use scheduled-post. Typically for 9am, unless I've got something spicy and don't care. This is rare, I've also learned that posting spicy takes absolutely needs a cooling off period. I've pulled posts after writing them because I realize they didn't actually need to get posted, I merely needed to write them.

What's your favorite post on your blog?

That's changed a lot over the years as I've changed.

  • For a long time, I was proud of my Know your IO series from 2010. That was prompted by a drop-by conversation from one of our student workers who had a question about storage technology. I infodumped for most of an hour, and realized I had a blog series. This is still linked from my sidebar on the right.
  • From recent history, the post Why I don't like Markdown in a git repo as documentation is a still accurate distillation of why I seriously dislike this reflexive answer to workplace knowledge sharing.
  • This post about the lost history of why you wait for the first service pack before deploying anything is me bringing old-timer points of view to newer audiences. The experiences in this post are drawn directly from where I was working in 2014-2015. Yes Virginia, people still do ship shrink-wrap software to Enterprise distros. Some of you are painfully aware of this.

I'm not stopping blogging any time soon. At some point the dependency chain for Movable Type will rot and I'll have to port to something else, probably a static site generator. I believe I'm spoiled for choice in that domain.