Blog Questions Challenge 2025

Thanks to Ben Cotton for sharing.

Why did you start blogging in the first place?

I covered a lot of that in 20 years of this nonsense from about a year ago. The quick version is I was charged with creating a "Static pages from your NetWare home directory" project and needed something to test with, so here we are. That version was done with Blogger before the Google acquisition, when they still supported publish-by-ftp (which I also had to set up as part of the same project).

What platform are you using to manage your blog, and why do you use it?

When blogger got rid of the publish-by-ftp method, I had to move. I came to my own domain and went looking for blogging software. On advice from an author I like, I kept in mind the slashdot effect so wanted to be sure if I had an order of magnitude more traffic for an article it wouldn't melt the server it was one. So I wanted something relatively light weight, which at the time was Movable Type. Wordpress required database hits for every webpage, which didn't seem to scale.

I stuck with it because Movable Type continues to do the job quite well, and be ergonomic for me. I turned off comments a while ago, as that was an anti-spam nightmare I needed recency to solve. Movable Type now requires a thousand dollars a year for a subscription, which pencils out to about $125 per blog post at my current posting rate. Not worth it.

Have you blogged on other platforms before?

Like just about everyone my age, I was on Livejournal. I don't remember if this blog or LJ came first, and I'm not going to go check. I had another blog on Blogger for a while, about local politics. It has been lost to time, though is still visible on archive.org if you know where to look for it.

How do you write your posts?

Most are spur of the moment. I have a topic, and time, and remember I can be long-form about it. Once in a while I'll get into something on social media and realize I need actual wordcount to do it justice, so I do it here instead. The advent of twitter absolutely slowed down my posting rate here!

Once I have the words in, I schedule a post for a few days hence.

When do you feel most inspired to write?

As with all writers, it comes when it comes. Sometimes I set out goals and I stick to them. But blogging hasn't been a focus of mine for a long time, so it's entirely whim. I do know I need an hour or so of mostly uninterrupted time to get my thoughts in order, which is hard to come by without arranging for it.

Do you normally publish immediately after writing, or do you let it simmer a bit?

As mentioned above, I use scheduled-post. Typically for 9am, unless I've got something spicy and don't care. This is rare, I've also learned that posting spicy takes absolutely needs a cooling off period. I've pulled posts after writing them because I realize they didn't actually need to get posted, I merely needed to write them.

What's your favorite post on your blog?

That's changed a lot over the years as I've changed.

  • For a long time, I was proud of my Know your IO series from 2010. That was prompted by a drop-by conversation from one of our student workers who had a question about storage technology. I infodumped for most of an hour, and realized I had a blog series. This is still linked from my sidebar on the right.
  • From recent history, the post Why I don't like Markdown in a git repo as documentation is a still accurate distillation of why I seriously dislike this reflexive answer to workplace knowledge sharing.
  • This post about the lost history of why you wait for the first service pack before deploying anything is me bringing old-timer points of view to newer audiences. The experiences in this post are drawn directly from where I was working in 2014-2015. Yes Virginia, people still do ship shrink-wrap software to Enterprise distros. Some of you are painfully aware of this.

I'm not stopping blogging any time soon. At some point the dependency chain for Movable Type will rot and I'll have to port to something else, probably a static site generator. I believe I'm spoiled for choice in that domain.

Back in November I posted about how to categorize your incidents using the pick-one list common across incident automation platforms. In that post I said:

A few organizations go so far as to have a fully separate process for the 'High' and 'Critical' urgencies of events, maybe calling them Disaster Recovery events instead of Incidents. DR events need to be rare, which means that process isn't as well exercised as Incident response. However, a separate process makes it abundantly clear that certain urgencies and scopes require different process overall. More on this in a later blog-post.

This is the later blog post.

The SaaS industry as a whole has been referring to the California Fire Command (now known as the Incident Command System) model for inspiration on handling technical incidents. The basic structure is familiar to any SaaS engineer:

  • There is an Incident Commander who is responsible for running the whole thing, including post-incident processes
  • There is a Technical Lead who is responsible for the technical response

There may be additional roles available depending on organizational specifics:

  • A Business Manager who is responsible for the customer-facing response
  • A Legal Manager who is responsible for anything to do with legal
  • A Security Lead who is responsible for heading security investigations

Again, familiar. But truly large incidents put stress on this model. In a given year the vast majority of incidents experienced by an engineering organization will be the grass fire variety that can be handled by a team of four people in under 30 minutes. What happens when a major event happens?

The example I'm using here is a private information disclosure by a hostile party using a compromised credential. Someone not employed by the company dumped a database they shouldn't have had access to, and that database involved data that requires disclosure in the case of compromise. Given this, we already know some of the workstreams that incident response will be doing once this activity is discovered:

  • Investigatory work to determine where else the attacker got access to and fully define the scope of what leaked
  • Locking down the infrastructure to close the holes used by the attacker for the identified access
  • Cycling/retiring credentials possibly exposed to the attacker
  • Regulated notification generation and execution
  • Technical remediation work to lock down any exploited code vulnerabilities

An antiseptic list, but a scary one. The moment the company officially notices a breach of private information, legislation world-wide starts timers on when privacy regulators or the public need to be informed. For a profit driven company, this is admitting fault in public which is something none of them do lightly due to the lawsuits that will result. For publicly traded companies, stockholder notification will also need to be generated. Incidents like this look very little like an availability SLA breach SEV of the kind that happens 2-3 times a month in different systems.

Based on the rubric I showed back in November, an incident of this type is of Critical urgency due to regulated timelines, and will require either Cross-Org or C-level response depending on the size of the company. What's more, the need to figure out where the attacker went blocks later stages of response, so this response process will actually be a 24 hour operation and likely run several days. No one person can safely stay awake for 4+ days straight.

The Incident Command Process defines three types of command structure:

  • Solitary command - where one person is running the whole show
  • Unified command - where multiple jurisdictions are involved and they need to coordinate, and also to provide shift changes through rotating who is the Operations Chief (what SaaS folk call the Technical Lead)
  • Area command - where multiple incidents are part of a larger complex, the Area Commander supports each Incident Command

Incidents of the scale of our private information breach lean into the Area Command style for a few reasons. First and foremost, there are discrete workstreams that need to be executed by different groups; such as the security review to isolate scope, building regulated notifications, and cycling credentials. All those workstreams need people to run them, and those workstream leads need to report to incident command. That looks a lot like Area Command to me.

If your daily incident experience are 4-7 person team responses, how ready are you to be involved in an Area Command style response? Not at all.

If you've been there for years and have seen a few multi-org responses in your time, how ready are you to handle Area Command style response? Better, you might be a good workstream lead.

One thing the Incident Command Process makes clear is that Area Commanders do not have an operational role, meaning they're not involved in the technical remediation. Their job is coordination, logistics, and high level decision making across response areas. For our pretend SaaS company, a good Area Commander will be someone:

  • Someone who has experience with incidents involving legal response
  • Someone who has experience with large security response, because the most likely incidents of this size are security related
  • Someone who has experience with incidents involving multiple workstreams requiring workstream leaders
  • Someone who has experience communicating with C-Levels and has their respect
  • Two to four of these people in order to safely staff a 24 hour response for multiple days

Is your company equipped to handle this scale of response?

In many cases, probably not. Companies handle incidents of this type a few different ways. As I mentioned in the earlier post, some categorize problems like this as a disaster instead of an incident and invoke a different process. This has the advantage of making it clear the response for these is different, at the cost of having far fewer people familiar with the response methods. You make up for the lack of in situ training, learn by doing, by regularly re-certifying key leaders on the process.

Other companies extend the existing incident response process on the fly rather than risk having a separate process that will get stale. This works so long as you have some people around who kind of know what they're doing and can herd others into the right shape. Though, after the second disaster of this scale, people will start talking about how to formalize procedures.

Whichever way your company goes, start thinking about this. Unless you're working for the hyperscalers, incidents of this response scope are going to be rare. This means you need to schedule quarterly time to train, practice, and certify your Area Commanders and workstream leads. This will speed up response time overall, because less time will be spent arguing over command and feedback structures.

"Columnar databases store data in columns, not rows," says the definition. I made a passing reference to the technology in Software Telemetry, but didn't spend any time on what they are and how they can help telemetry and observability. Over the last six months I worked on converting a centralized logging flow based on Elasticsearch, Kibana, and Logstash, to one based on Logstash and Apache Spark. For this article, I'll be using examples from both methods to illustrate what columnar databases let you do in telemetry systems.

How Elasticsearch is columnar (or not)

First of all, Elasticsearch isn't exactly columnar but it can fake it to a point. You use Elasticsearch when you need full indexing and tokenization of every field in order to accelerate query-time performance. Born as it was in the early part of the 2010s, Elasticsearch balances ingestion-side complexity in order to optimize read-side performance. There is a reason that if you have a search field in your app, there is a good chance that Elasticsearch or OpenSearch is involved in the business logic. While Elasticsearch is "schema-less," schema still matters, and there are clear limits to how many fields you can add to an Elasticsearch index.

Each Elasticsearch index or datastream has defined fields. Fields can be defined at index/datastream creation, or configured to auto-create on first use. Both are quite handy in telemetry contexts. Each document in an index or datastream has a reference for every defined field, even if the contents of that field are null. If you have 30K fields, and one document has only 19 fields defined, the rest will still exist on the document but be nulled; which in turn makes that 19 defined-field document rather larger than the same document in an index/datastream with only 300 defined fields.

Larger average document size slows down search for everything in general, due to the number and size of field-indexes the system has to keep track of. This also balloons index/datastream size, which has operational impacts when it comes to routine operations like patching and maintenance. As I mentioned in Software Telemetry, Elasticsearch's cardinality problem manifests in number of fields, not in unique values in each field.

If you are willing to get complicated in your ingestion pipeline through careful crafting of telemetry shape, and ingestion into multiple index/datastreams to bucket types of telemetry into shards of similarly shaped telemetry, you can mitigate some of the above problems. Create an alias to use as your search endpoint, and populate the alias with the index/datastreams of your various shards. Elasticsearch is smart enough to know where to search, which lets you bucket your field-count cardinality problems in ways that will perform faster and save space. However, this is clearly adding complexity that you have to manage yourself.

How Apache Spark is columnar

Spark is pretty clearly columnar, which is why it's the de facto platform of choice for Business Intelligence operations. You know, telemetry for business ops not engineering ops. A table defined in Spark (and most of its backing databases like Parquet or Hive) can have arbitrary columns defined in it. Data for each column is stored in separate files, which means queries like the following looking to build a histogram of log-entries per hour "COUNT timestamp GROUP BY hour(timestamp)" are extremely efficient as the system only needs to look at a single file out of thousands.

Columnar databases have to do quite a bit of read-time and ingestion-time optimization to truly perform fast, which demonstrates some of the tradeoffs of the style. Where Elasticsearch was trading ingestion-time complexity to speed up read-time performance, columnar databases are tilting the needle more towards increasing read-time complexity in order to optimize overall resource usage. In short columnar databases have better scaling profiles than something like Elasticsearch, but they don't query as fast as a result of the changed priorities. This is a far easier trade-off to make in 2024 than it was in 2014!

Columnar databases also don't tokenize the way Elasticsearch does. Have a free-text field that you want to do sub-string searches on? Elasticsearch is built from the bolts out to make that search as fast as possible. Columnar databases, on the other hand, do all of the string walking and searching at query-time instead of pulling the values out of some b-trees.

Where Elasticsearch suffers performance issues when field-count rises, Spark only encounters this problems if the query is designed to encounter it through use of "select *" or similar constructs. The files hit by the query will only be the ones for columns referenced in the query! Have a table with 30K columns in it? So long as you query right, it should perform quite well; the 19 defined fields in a row problem shouldn't be a problem so long as you're only referencing one of those 19 fields/columns.

Why columnar is neat

A good centralized logging system can stand in for both metrics and traces, and in large part can do so because the backing databases for centralized logging are often columnar or columnar-like. There is nothing stopping you from creating metric_name and metric_value fields in your logging system, and building a bunch of metrics-type queries using those rows.

As for emulating tracing, this isn't done through OpenTelemetry, this is done old-school through hacking. Chapter 5 in Software Telemetry covered how the Presentation Stage uses correlation identifiers:

"A correlation identifier is a string or number that uniquely identifies that specific execution or workflow."

Correlation identifiers allow you to build the charts that tracing systems like Jaeger, Tempo, and Honeycomb are known for. There is nothing stopping you creating an array-of-strings type field named "span_id" where you dump the span-stack for each log-line. Want to see all the logs for a given Span? Here you are. Given a sophisticated enough visualization engine, you can even emulate the waterfall diagrams in dedicated tracing platforms.

The reason we haven't used columnar databases for metrics systems has to do with cost. If you're willing to accept cardinality limits, you can store a far greater number of metrics for the same amount of money as doing it in a columnar database. However, the biggest companies already are using columnar datastores for engineering metrics, and nearly all companies are using columnar for business metrics.

But if you're willing to spend the extra resources to use a columnar-like datasource for metrics, you can start answering questions like "how many 5xx response-codes did accounts with the Iridium subscription encounter on October 19th." Traditional metrics system would consider subscription-type to be too highly cardinal, where columnar databases shrug and move on.

What this means for the future of telemetry and observability

Telemetry over the last 60 years of computing has gone from digging through the SYSLOG printout from one of your two servers, to digging through /var/log/syslog, to the creation of dedicated metrics systems, to the creation of tracing techniques. Every decade's evolution of telemetry has been constrained by the compute and storage performance envelope available to the average system operator.

  • The 1980s saw the proliferation of multi-server architectures as the old mainframe style went out of fashion, so centralized logging had to involve the network. NFS shares for Syslog.
  • The 1990s saw the first big scale systems recognizable as such by people in 2024, and the beginnings of analytics on engineering data. People started sending their web-logs direct to relational databases, getting out of the "tail and grep" realm and into something that kinda looks like metrics if you squint. Distributed processing got its start here, though hardly recognizable today.
  • The 2000s saw the first bespoke metrics systems and protocols, such as statsd and graphite. This era also saw the SaaS revolution begin, with Splunk being a big name in centralized logging, and NewRelic gaining traction for web-based metrics. Distributed processing got more involved, and at the end of the decade the big companies like Google and Microsoft lived and breathed these systems. Storage was still spinning disk, with some limited SSD usage in niche markets.
  • The 2010s saw the first tracing systems and the SaaS revolution ate a good chunk of the telemetry/observability space. The word observability entered wide usage. Distributed processing ended the decade as the default stance for everything, including storage. Storage bifurcated into bulk (spinning disk) and performance (SSD) tiers greatly reducing cost.

We're part way through the 2020s, and it's already clear to me that columnar databases are probably where telemetry systems are going to end up by the end of the decade. Business intelligence is already using them, so most of our companies have them in our infrastructure already. Barriers to adoption are going to be finding ways to handle the different retention and granularity requirements of what we now call the three pillars of observability:

  • Metrics need visibility going back years, and are aggregated not sampled. Observability systems doing metrics will need to allow multi-year retention somehow.
  • Tracing retention is 100% based on cost and sample-rate, which should improve over the decade.
  • Centralized logging is like tracing in that retention is 100% based on cost. True columnar stores scale more economically than Elasticsearch-style databases, which increases retention. How sample rate affects retention is less clear, and would have to involve some measure of aggregation to remain viable over time.

Having columnar databases at the core allows a convergence of the pillars of observability. How far we get in convergence over the next five years remains to be seen, and I look forward to finding out.

Incident response programs

Honeycomb had a nice post where they describe dropping a priority list of incident severities in favor of an attribute list. Their list is still a pick-one list; but instead of using a 1-4 SEV scale, they're using a list of types like "ambiguous," "security," and "internal." The post goes into some detail about the problems with a unified list across a large organization, and the different response-level  needs of different types of incidents. All very true.

A good incident response program needs to be approachable by anyone in the company, meaning anyone looking to open one should have reasonable success in picking incident attributes right. The incident automation industry, tools such as PagerDuty's Jeli and the Rootly platform, has settled on a pick-one list for severity, with sometimes support for additional fields. Unless a company is looking to home build their own incident automation for creating slack channels, managing the post-incident review process, and tracking remediation action items, these de facto conventions constrain the options available to an incident response program.

As Honeycomb pointed out, there are two axis that need to be captured by "severity," and they are: urgency, and level of response. I propose the following pair of attributes:

Urgency

  1. Planning: the problem can be addressed through normal sprint or quarterly planning processes.
  2. Low: the problem has long lead times to either develop or validate the solution, where higher urgency would result in a lot of human resources stuck in wait loops.
  3. Medium: the problem can be addressed in regular business hours operations, waiting overnight or a weekend won't make things worse. Can preempt sprint-level deliverable targets without question
  4. High: the problem needs around the clock response and can preempt quarterly deliverable targets without question
  5. Critical: the problem requires investor notification or other regulated public disclosure, and likely affects annual planning. Rare by definition.

Level of response

  1. Individual: The person who broke it can revert/fix it without much effort, and impact blast-radius is limited to one team. Post-incident review may not be needed beyond the team level.
  2. Team: A single team can manage the full response, such as an issue with a single service. Impact blast radius is likely one team. Post-incident review at the peer-team level.
  3. Peer team: A group of teams in the same department are involved in response due to interdependencies or the nature of the event. Impact blast-radius is clearly multi-team. Post-incident review at the peer-team level, and higher up the org-chart if the management chain is deep enough for it.
  4. Cross-org: Major incident territory, where the issue cuts across more than one functional group. These are rare. Impact blast-radius may be whole-company, but likely whole-product. Post-incident review will be global.
  5. C-level: High executive needs to run it because response is whole company in scope. Will involve multiple post-incident reviews.

Is Private? Yes/No - If yes, only the people involved in the response are notified of the incident and updates. Useful for Security and Compliance type incidents, where discoverability is actually bad. Some incidents qualify as Material Non-Public Information, which matters to companies with stocks being traded.

The combinatorics indicate that 5*5=25 pairs, 50 if you include Is Private, which makes for an unwieldy pick-one list. However, like stellar types there is a kind of main sequence of pairs that are more common, with problematic outliers that make simple solutions a troublesome fit. Let's look at a few pairs that are on the main sequence of event types:

  • Planning + Individual: Probably a feature-flag had to be rolled back real quick. Spend some time digging into the case. Incidents like this sometimes get classified "bug" instead of "incident."
  • Low + Team: Such as a Business Intelligence failure, where revenue attribution was discovered to be incorrect for a new feature, and time is needed to back-correct issues and validate against expectations. May also be classified as "bug" instead of "incident."
  • Medium + Team: Probably the most common incident type that doesn't get classified as a "bug," these are the highway verge grass fires of the incident world; small in scope, over quick, one team can deal with it.
  • Medium + Peer Team: Much like the previous but involving more systems in scope. Likely requires coordinated response between multiple teams to reach a solution. These teams work together a lot, by definition, so it would be a professional and quick response.
  • High + Cross-org: A platform system had a failure that affected how application code responds to platform outages, leading to a complex, multi-org response. Response would include possibly renegotiating SLAs between platform and customer-facing systems. Also, remediating the Log4J vulnerability, which requires touching every usage of java in the company inclusive of vendored usage, counts as this kind of incident.
  • Critical + Cross-org: An event like the Log4J vulnerability, and the Security org has evidence that security probes found something. The same remediation response as the previous, but with added "reestablish trust in the system" work on top of it, and working on regulated customer notices.

Six of 25 combinations. But some of the others are still viable, even if they don't look plausible on the surface. Let's look at a few:

  • Critical + Team: A bug is found in SOX reporting that suggests incorrect data was reported to stock-holders. While the C-levels are interested, they're not in the response loop beyond the 'stakeholder' role and being the signature that stock-holder communications will be issued under.
  • Low + Cross-org: Rapid retirement of a deprecated platform system, forcing the teams still using the old system to crash-migrate to the new one.
  • Planning + Cross-org: The decision to retire a platform system is made as part of an incident, and migrations are inserted into regular planning.

How is an organization supposed to build a pick-one list from this mess that is usable? This is hard work!

Some organizations solve this by bucketing incidents using another field, and allowing the pick-one list to mean different things based on what that other field says. A Security SEV1 gets a different scale of response than a Revenue SEV1, which in turn gets a different type of response than an Availability SEV1. Systems like this have problems with incidents that cross buckets, such as a Security issue that also affects Availability. It's for this reason that Honeycomb has an 'ambiguous' bucket.

A few organizations go so far as to have a fully separate process for the 'High' and 'Critical' urgencies of events, maybe calling them Disaster Recovery events instead of Incidents. DR events need to be rare, which means that process isn't as well exercised as Incident response. However, a separate process makes it abundantly clear that certain urgencies and scopes require different process overall. More on this in a later blog-post.

Other orgs handle the outlier problem differently, taking them out of incidents and into another process all together. Longer flow problems, low urgency above, get called something like a Code Yellow after a Google effort, or Code Red for the Critical + C-Team level to handle long flow big problems.

Honeycomb took the bucketing idea one step further and dropped urgency and level of response entirely, focusing instead on incident type. A process like this still needs ways to manage urgency and response-scope differences, but this is being handled at a layer below incident automation. In my opinion, a setup like this works best when Engineering is around Dunbar's Number or less in size, allowing informal relationships to carry a lot of weight. Companies with deeper management chains, and thus more engineers, will need more formalism to determine cross-org interaction and prioritization.

Another approach is to go super broad with your pick-one list, and make it apply to everyone. While this approach disambiguates pretty well between the SEV 1 highest urgency problems and SEV 2 urgent but not pants on fire urgent, they're less good at disambiguating SEV 3 and SEV 4 incidents. Those incidents tend to only have local scope, so local definitions will prevail, meaning only locals will know how to correctly categorize issues.


There are several simple answers for this problem, but each simplification has it's own problem. Your job is to pick the problems your org will put up with.

  • How much informal structure can you rely on? The smaller the org, the more one size is likely to fit all.
  • Do you need to interoperate with a separate incident response process, perhaps an acquisition or a parent company?
  • How often do product-local vs global incidents happen? For one product companies, these are the same thing. For companies that are truly multi-product, this distinction matters. The answer here influences how items on your pick-one list are dealt with, and whether incident reporters are likely to file cross-product reports.
  • Does your incident automation platform allow decision supports in their reporting workflow? Think of a next, next, next, done wizard; each screen asks clarifying questions. Helpful for folk who are not sure how a given area wants their incidents marked up, less helpful for old hands who know exactly what needs to go in each field.

Rust and the Linux kernel

One of the kernel maintainers made social waves by bad mouthing Rust and the project to rebuild the Linux kernel in Rust. The idea of rebuilding the kernel in "Rust: the memory-safe language" not "The C in CVE stands for C/C++" makes a whole lot of sense. However, there is more to a language than how memory safe it is and whether a well known engineer calls it a "toy" language.

One of the products offered by my employer is written in Elixir, which is built on top of Erlang. Elixir had an 8 or so month period of fame, which is when the decision to write that product was made. We picked Elixir because the Erlang engine gives you a lot of concurrency and async processing for relatively easy. And it worked! That product was a beast in relatively little CPU. We had a few cases of 10x usage from customers, and it just scaled up no muss no fuss.

Where the problems with the product came wasn't in the writing, but in the maintaining and productionizing. Some of the issues we've had over the years, many of which got better as Elixir as an ecosystem matured:

  • The ability to make a repeatable build, needed for CI systems
  • Dependency management in modules
  • Observability ecosystem support, such as OpenTelemetery SDKs
  • Build tooling support usable by our CI systems
  • Maturity of the module ecosystem, meaning we had to DIY certain tasks that our other main product never had to bother with. Or the modules that exist only covered 80% of the use-cases.
  • Managing Erlang VM startup during deploys

My opinion is that the dismissiveness from this particular Linux Kernel Maintainer had to do with this list. The Linux kernel and module ecosystem is massive, with highly complex build processes spanning many organizations, and regression testing frameworks to match. Ecosystem maturity matters way more for CI, regression, and repeatable build problems than language maturity.

Rust has something Elixir never had: durable mindshare. Yeah, the kernel rebuild process has taken many years, and has many years to go. Durable mindshare means that engineers are sticking with it, instead of chasing the next hot new memory safe language.

SysAdmins have no trouble making big lists of what can go wrong and what we're doing to stave that off a little longer. The tricky problem is pushing large organizations to take a harder look at systemic risks and taking them seriously. I mean, the big companies have to have disaster recovery (DR) plans for compliance reasons; but there are a lot of differences between box-ticking DR plans and comprehensive DR plans.

Any company big enough to get past the running out of money is the biggest disaster phase has probably spent some time thinking about what to do if things go wrong. But how do you, the engineer in the room, get the deciders to think about disasters in productive ways?

The really big disasters are obvious:

  • The datacenter catches fire after a hurricane
  • The Region goes dark due to a major earthquake
  • Pandemic flu means 60% of the office is offline at the same time
  • An engineer or automation accidentally:
    • Drops all the tables in the database
    • Deletes all the objects out of the object store
    • Destroys all the clusters/servlets/pods
    • Deconfigures the VPN
  • The above happens and you find your backups haven't worked in months

All obvious stuff, and building to deal with them will let you tick the box for compliance DR. Cool.

But there are other disasters, the sneaky ones that make you think and take a hard look at process and procedures in a way that the "oops we lost everything of [x] type" disasters generally don't.

  • An attacker subverts your laptop management software (JAMF, InTune, etc) and pushes a cryptolocker to all employee laptops
  • 30% of your application secrets got exposed through a server side request forgery (SSRF) attack
  • Nefarious personages get access to your continuous integration environment and inject trojans into your dependency chains
  • A key third party, such as your payment processor, gets ransomwared and goes offline for three weeks
  • A Slack/Teams bot got subverted and has been feeding internal data to unauthorized third parties for months

The above are all kinda "security" disasters, and that's my point. SysAdmins sometimes think of these, but even we are guilty of not having the right mental models to rattle these off the top of our head when asked. Asking about disasters like this list should start conversations that generally don't happen. Or you get the bad case: people shrug and say "that's Security's problem, not ours," which is a sign you have a toxic reliability culture.

Security-type disasters have a phase that merely technical disasters lack: how do we restore trust in production systems? In technical disasters, you can start recovery as soon as you've detected the disaster. For security disasters recovery has to wait until the attacker has been evicted, which can take a while. This security delay means key recovery concepts like Recovery Time and Recovery Point Objectives (RTO/RPO) will be subtly different.

If you're trying to knock loose some ossified DR thinking, these security type disasters can crack open new opportunities to make your job safer.

I've now spent over a decade teaching how alarms are supposed to work (specific, actionable, with the appropriate urgency) and even wrote a book on how to manage metrics systems. One topic I was repeatedly asked to cover in the book, but declined because the topic is big enough for its own book, is how to do metrics right. The desire for an expert to lay down how to do metrics right comes from a number of directions:

  • No one ever looked at ours in a systematic way and our alerts are terrible [This is asking about alerts, not metrics; but they still were indirectly asking about metrics]
  • We keep having incidents and our metrics aren't helping, how do we make them help?
  • Our teams have so many alarms important ones are getting missed [Again, asking about alerts]
  • We've half assed it, and now we're getting a growth spurt. How do we know what we should be looking for?

People really do conflate alarms/alerts with metrics, so any discussion about "how do we do metrics better" is often a "how do we do alarms better" question in disguise. As for the other two points, where people have been using vibes to pick metrics and that's no longer scaling, we actually do have a whole lot of advice; you have a whole menu of "golden signals" to pick from depending on how your application is shaped.

That's only sort of why I'm writing this.

In the mathematical construct of Site Reliability Engineering, where everything is statistics and numerical analysis, metrics are easy. Track the things that affect availability, regularly triage your metrics to ensure continued relevance, and put human processes into place to make sure you're not burning out your worker-units. But the antiseptic concept of SRE only exists in a few places, the rest of us have to pollute the purity of math with human emotions. Let me explain.

Consider your Incident Management process. There are certain questions that commonly arise when people are doing the post incident reviews:

  • Could we have caught this before release? If so, what sort of pre-release checks should we add to catch this earlier?
  • Did we learn about this from metrics or customers? If customers, what metrics do we need to add to catch this earlier? If metrics, what processes or alarms should we tune to catch this earlier?
  • Could we have caught this before the feature flag rolled out to the Emerald users? Do we need to tune the alarm thresholds to catch issues like this in groups with less feature-usage before the high value customers on Emerald plans?

And so on. Note that each question asks about refining or adding metrics. Emotionally, metrics represent anxieties. Metrics are added to catch issues before they hurt us again. Metrics are retained because they're tracking something that used to hurt us and might hurt again. This makes removing metrics hard; the people involved remember why certain metrics are present, intuitively know they needs tracking, which means emotion says to keep.

Metrics are scar tissue, and removing scar tissue is hard, bloody work. How do you reduce the number of metrics, while also not compromising your availability goals? You need the hard math of SRE to work down those emotions, but all it takes is one Engineering Manager to say "this prevented a SEV, keep it" to blow that effort up. This also means you'll have much better luck with a metric reformation effort if teams are already feeling the pinch of alert fatigue or your SaaS metric provider bills are getting big enough that the top of the company is looking at metric usage to reduce costs.

Sometimes, metrics feed into Business Intelligence. That's less about scar tissue and more about optimizing your company's revenue operations. Such metrics are less likely to lead to rapid-response on-call rotations, but still can lead to months long investigations into revenue declines. That's a different but related problem.

I could write a book about making your metrics suck less, but that book by necessity has to cover a lot of human-factors issues and has to account for the role of Incident Management in metrics sprawl. Metrics are scar tissue, keep that in mind.

In a Slack I'm on someone asked a series of questions that boil down to:

Our company has a Reliability team, but another team is ignoring SLA/SLO obligations. What can SRE do to fix this?

I got most of the way through a multi-paragraph answer before noticing my answer was, "This isn't SRE's job, it's management's job." I figured a blog post might help explain this stance better.

The genius behind the Site Reliability Engineer concept at Google is they figured out how to make service uptime and reliability matter to business management. The mathematical framework behind SRE is all about quantizing risk, quantizing impact, and that allows quantizing lost revenue; possibly even quantizing lost sales opportunity. All this quantizing falls squarely into the management you can't manage what you can't measure mindset crossed with if I can't measure it, it's an outside dependency I can ignore subtext. SRE is all about making uptime and reliability a business problem worth spending management cycles on.

In the questioner's case we already have some signal that their management has integrated SRE concepts into management practice:

  • They have a Reliability team, which only happens if someone in management believes reliability is important enough to devote dedicated headcount and a manager to.
  • They have Service Level Agreement and Service Level Objective concepts in place
  • Those SLA/SLO obligations apply to more teams than the Reliability team itself, indicating there is at least some management push to distribute reliability thinking outside of the dedicated Reliability team.

The core problem the questioner is running into is that this non-compliant team is getting away with ignoring SLA/SLO stuff, and the answer to "what can SRE do to fix this" is to be found in why and how that team is getting away with this ignoring. Management is all about making trade-off decisions against competing priorities, clearly something else is becoming a higher priority than compliance with SLA/SLO practices. What are these more important priorities, and are they in alignment with upper management's priorities?

As soon as you start asking questions along the lines of "what can a mere individual contributor do to make another manager pay attention to their own manager," you have identified a pathological power imbalance. The one tool you have is "complain to the higher level manager to make them aware of the non-compliance," and hope that higher level manager will do the needful things. If that higher level manager does not do the needful things, the individual contributor is kind of out of luck.

Under their own authority, that is. In the case of the questioner, there is a Reliability team with a manager. This means there is someone in the management chain who officially cares about this stuff, and can raise concerns higher up the org-chart. Non-compliance with policy is supposed to be a management problem, and should have management solutions. The fact the policy in question was put in place due to SRE thinking is relevant, but not the driving concern here.


The above works for organizations that are hierarchical, which implies deeper management chains. You count the number of managers between the VP of Engineering and the average engineer, and that number is between 1.0 and 2.5, you probably have a short enough org-chart to directly talk to the team in question for direct education (bridging the org-chart to use Dr. Westruum's term.) If the org-chart is >2.5 managers, you're better served going through the org-chart to solve this particular problem.

But if you're in a short org-chart company, and that other team is still refusing to comply with SLA/SLO policies, you're kind of stuck complaining to the VP of Engineering and hoping that individual force alignment through some method. But if the VPofE doesn't, that is a clear signal that Reliability is not as important to management as you thought, and you should go back to the fundamentals of making the case for prioritizing SRE practices generally.

...will never happen more than once at a company.

I say this knowing that chunks of Germany's civil infrastructure managed to standardize on SuSE desktops, and some may still be using SuSE. Some might view this as proof it can be done, I say that Linux desktops not spreading beyond this example is proof of why it didn't happen. The biggest reason we have the German example is because the decision was top down. Government decision making is different than corporate decision making, which is why we're not going to see the same thing happen, a Linux desktop (actually laptop) mandate from on high, more than few times; especially in the tech industry.

it all comes down to management and why Linux laptop users are using Linux in the first place.

You see, corporate laptops (hereafter referred to as "endpoints" to match management lingo) have certain constraints placed upon them when small companies become big companies:

  • You need some form of anti-virus and anti-malware scanning, by policy
  • You need something like either a VPN or other Zero Trust ability to do "device attestation", proving the device (endpoint) is authentic and not a hacker using stolen credentials from a person
  • You need to comply with the vulnerability management process, which means some ability to scan software versions on and endpoint and report up to a dashboard.
  • The previous three points strongly imply an ability to push software to endpoints

Windows has been able to do all four points since the 1990s. Apple came somewhat later, but this is what JAMF is for.

Then there is Linux. It is technically possible to do all of the above. Some tools, like osquery, were built for Linux first because the intended use was on servers. However, there is a big problem with Linux users. Get 10 Linux users in a room, and you're quite likely to get 10 different combination of display manager (xorg or wayland), window manager (gnome, kde, i3, others), and OS package manager. You need to either support heterogeneity or commit to building the Enterprise Linux that has one from each category and forbid others. Enterprise Linux is what the German example did.

Which is when the Linux users revolt, because banning their tiling window manager in favor of Xorg/Gnome ruins their flow -- and similar complaints. The Windows and Apple users forced onto Linux will grumble about their flow changing and why all their favorite apps can't be used, but at least it'll be uniform. If you support all three, you'll get the same 5% Linux users but the self-selected cranky ones who can't use the Linux they actually want. Most of that 5% will "settle" for another Linux before using Windows or Apple, but it's not the same.

And 5% Linux users puts supportability of the platform below the concentration needed to support that platform well. Companies like Alphabet are big enough the 5%  is big enough to make a supportable population. For smaller companies like Atlassian, perhaps not. Which puts Enterprise Linux in that twilight state between outright banned and just barely supported so long as you can tolerate all the jank.

Why tcp-mss-clamp still matters

This is blogging in anger after fighting this over the weekend. Because I'm like that I have a backup cable ISP in case my primary fiber ISP flakes out. I work from home, so the existence of internet is critical to me getting paid, and neither cell phone has good enough service to hotspot reliably. Thus, having two ISPs. It's expensive, but then so would be missing work for a week while I wait for a cable tech to come out to diagnose why their stuff isn't working.

The backup ISP hasn't been working well for a while, but the network card pointing to the second cable modem flaked out two weeks ago and that meant replacement. Which refused to pick up address info (v4 or v6) off of DHCP. Doing a hard reset from the provider side fixed the issue, but left me with the curious circumstance of:

  • I can curl from the router
  • But nothing behind it could curl.
  • Looking at the packet trace of the behind the router case saw the TCP handshake finish, but TLS handshake fail after the initial hello.

What the actual fuck.

What fixed the problem was the following policy added to my firewalld config in /etc/firewalld/policies/backuprouter.xml.

<rule>
  <tcp-mss-clamp value="1448"/>
</rule>

MSS means 'maximum segment size' which is a TCP thing indicating how much the TCP portion of the packet can occupy. For networks with a typical Maximum Transfer Unit (MTU) size of 1500, MSS is typically 1460. Networking over things like VPNs often trims the effective MTU due to VPN overhead, often to 1492 with a corresponding reduction in MSS to 1452. The tcp-mss-clamp setting is telling firewalld to lock MSS to 1448; so if something behind it requests higher, the router will rewrite (and reassemble) segments to conform to the MSS setting.

The tcp-mss-clamp setting can be set to 'pmtu' which will cause firewalld to probe what the effective MTU (and by proxy MSS) number should be so you don't have to hard-code. And yet, here I am, hard-coding because crossing my own router seems to require an extra 4 bytes. I don't know why, and that angers me. Packet traces from the router itself show MSS of 1452 working fine, but that provably doesn't work from behind my router.

Whatever. It works now, which is what matters, and now I'm contributing this nugget back to the internet.