"Columnar databases store data in columns, not rows," says the definition. I made a passing reference to the technology in Software Telemetry, but didn't spend any time on what they are and how they can help telemetry and observability. Over the last six months I worked on converting a centralized logging flow based on Elasticsearch, Kibana, and Logstash, to one based on Logstash and Apache Spark. For this article, I'll be using examples from both methods to illustrate what columnar databases let you do in telemetry systems.

How Elasticsearch is columnar (or not)

First of all, Elasticsearch isn't exactly columnar but it can fake it to a point. You use Elasticsearch when you need full indexing and tokenization of every field in order to accelerate query-time performance. Born as it was in the early part of the 2010s, Elasticsearch balances ingestion-side complexity in order to optimize read-side performance. There is a reason that if you have a search field in your app, there is a good chance that Elasticsearch or OpenSearch is involved in the business logic. While Elasticsearch is "schema-less," schema still matters, and there are clear limits to how many fields you can add to an Elasticsearch index.

Each Elasticsearch index or datastream has defined fields. Fields can be defined at index/datastream creation, or configured to auto-create on first use. Both are quite handy in telemetry contexts. Each document in an index or datastream has a reference for every defined field, even if the contents of that field are null. If you have 30K fields, and one document has only 19 fields defined, the rest will still exist on the document but be nulled; which in turn makes that 19 defined-field document rather larger than the same document in an index/datastream with only 300 defined fields.

Larger average document size slows down search for everything in general, due to the number and size of field-indexes the system has to keep track of. This also balloons index/datastream size, which has operational impacts when it comes to routine operations like patching and maintenance. As I mentioned in Software Telemetry, Elasticsearch's cardinality problem manifests in number of fields, not in unique values in each field.

If you are willing to get complicated in your ingestion pipeline through careful crafting of telemetry shape, and ingestion into multiple index/datastreams to bucket types of telemetry into shards of similarly shaped telemetry, you can mitigate some of the above problems. Create an alias to use as your search endpoint, and populate the alias with the index/datastreams of your various shards. Elasticsearch is smart enough to know where to search, which lets you bucket your field-count cardinality problems in ways that will perform faster and save space. However, this is clearly adding complexity that you have to manage yourself.

How Apache Spark is columnar

Spark is pretty clearly columnar, which is why it's the de facto platform of choice for Business Intelligence operations. You know, telemetry for business ops not engineering ops. A table defined in Spark (and most of its backing databases like Parquet or Hive) can have arbitrary columns defined in it. Data for each column is stored in separate files, which means queries like the following looking to build a histogram of log-entries per hour "COUNT timestamp GROUP BY hour(timestamp)" are extremely efficient as the system only needs to look at a single file out of thousands.

Columnar databases have to do quite a bit of read-time and ingestion-time optimization to truly perform fast, which demonstrates some of the tradeoffs of the style. Where Elasticsearch was trading ingestion-time complexity to speed up read-time performance, columnar databases are tilting the needle more towards increasing read-time complexity in order to optimize overall resource usage. In short columnar databases have better scaling profiles than something like Elasticsearch, but they don't query as fast as a result of the changed priorities. This is a far easier trade-off to make in 2024 than it was in 2014!

Columnar databases also don't tokenize the way Elasticsearch does. Have a free-text field that you want to do sub-string searches on? Elasticsearch is built from the bolts out to make that search as fast as possible. Columnar databases, on the other hand, do all of the string walking and searching at query-time instead of pulling the values out of some b-trees.

Where Elasticsearch suffers performance issues when field-count rises, Spark only encounters this problems if the query is designed to encounter it through use of "select *" or similar constructs. The files hit by the query will only be the ones for columns referenced in the query! Have a table with 30K columns in it? So long as you query right, it should perform quite well; the 19 defined fields in a row problem shouldn't be a problem so long as you're only referencing one of those 19 fields/columns.

Why columnar is neat

A good centralized logging system can stand in for both metrics and traces, and in large part can do so because the backing databases for centralized logging are often columnar or columnar-like. There is nothing stopping you from creating metric_name and metric_value fields in your logging system, and building a bunch of metrics-type queries using those rows.

As for emulating tracing, this isn't done through OpenTelemetry, this is done old-school through hacking. Chapter 5 in Software Telemetry covered how the Presentation Stage uses correlation identifiers:

"A correlation identifier is a string or number that uniquely identifies that specific execution or workflow."

Correlation identifiers allow you to build the charts that tracing systems like Jaeger, Tempo, and Honeycomb are known for. There is nothing stopping you creating an array-of-strings type field named "span_id" where you dump the span-stack for each log-line. Want to see all the logs for a given Span? Here you are. Given a sophisticated enough visualization engine, you can even emulate the waterfall diagrams in dedicated tracing platforms.

The reason we haven't used columnar databases for metrics systems has to do with cost. If you're willing to accept cardinality limits, you can store a far greater number of metrics for the same amount of money as doing it in a columnar database. However, the biggest companies already are using columnar datastores for engineering metrics, and nearly all companies are using columnar for business metrics.

But if you're willing to spend the extra resources to use a columnar-like datasource for metrics, you can start answering questions like "how many 5xx response-codes did accounts with the Iridium subscription encounter on October 19th." Traditional metrics system would consider subscription-type to be too highly cardinal, where columnar databases shrug and move on.

What this means for the future of telemetry and observability

Telemetry over the last 60 years of computing has gone from digging through the SYSLOG printout from one of your two servers, to digging through /var/log/syslog, to the creation of dedicated metrics systems, to the creation of tracing techniques. Every decade's evolution of telemetry has been constrained by the compute and storage performance envelope available to the average system operator.

  • The 1980s saw the proliferation of multi-server architectures as the old mainframe style went out of fashion, so centralized logging had to involve the network. NFS shares for Syslog.
  • The 1990s saw the first big scale systems recognizable as such by people in 2024, and the beginnings of analytics on engineering data. People started sending their web-logs direct to relational databases, getting out of the "tail and grep" realm and into something that kinda looks like metrics if you squint. Distributed processing got its start here, though hardly recognizable today.
  • The 2000s saw the first bespoke metrics systems and protocols, such as statsd and graphite. This era also saw the SaaS revolution begin, with Splunk being a big name in centralized logging, and NewRelic gaining traction for web-based metrics. Distributed processing got more involved, and at the end of the decade the big companies like Google and Microsoft lived and breathed these systems. Storage was still spinning disk, with some limited SSD usage in niche markets.
  • The 2010s saw the first tracing systems and the SaaS revolution ate a good chunk of the telemetry/observability space. The word observability entered wide usage. Distributed processing ended the decade as the default stance for everything, including storage. Storage bifurcated into bulk (spinning disk) and performance (SSD) tiers greatly reducing cost.

We're part way through the 2020s, and it's already clear to me that columnar databases are probably where telemetry systems are going to end up by the end of the decade. Business intelligence is already using them, so most of our companies have them in our infrastructure already. Barriers to adoption are going to be finding ways to handle the different retention and granularity requirements of what we now call the three pillars of observability:

  • Metrics need visibility going back years, and are aggregated not sampled. Observability systems doing metrics will need to allow multi-year retention somehow.
  • Tracing retention is 100% based on cost and sample-rate, which should improve over the decade.
  • Centralized logging is like tracing in that retention is 100% based on cost. True columnar stores scale more economically than Elasticsearch-style databases, which increases retention. How sample rate affects retention is less clear, and would have to involve some measure of aggregation to remain viable over time.

Having columnar databases at the core allows a convergence of the pillars of observability. How far we get in convergence over the next five years remains to be seen, and I look forward to finding out.

Incident response programs

Honeycomb had a nice post where they describe dropping a priority list of incident severities in favor of an attribute list. Their list is still a pick-one list; but instead of using a 1-4 SEV scale, they're using a list of types like "ambiguous," "security," and "internal." The post goes into some detail about the problems with a unified list across a large organization, and the different response-level  needs of different types of incidents. All very true.

A good incident response program needs to be approachable by anyone in the company, meaning anyone looking to open one should have reasonable success in picking incident attributes right. The incident automation industry, tools such as PagerDuty's Jeli and the Rootly platform, has settled on a pick-one list for severity, with sometimes support for additional fields. Unless a company is looking to home build their own incident automation for creating slack channels, managing the post-incident review process, and tracking remediation action items, these de facto conventions constrain the options available to an incident response program.

As Honeycomb pointed out, there are two axis that need to be captured by "severity," and they are: urgency, and level of response. I propose the following pair of attributes:

Urgency

  1. Planning: the problem can be addressed through normal sprint or quarterly planning processes.
  2. Low: the problem has long lead times to either develop or validate the solution, where higher urgency would result in a lot of human resources stuck in wait loops.
  3. Medium: the problem can be addressed in regular business hours operations, waiting overnight or a weekend won't make things worse. Can preempt sprint-level deliverable targets without question
  4. High: the problem needs around the clock response and can preempt quarterly deliverable targets without question
  5. Critical: the problem requires investor notification or other regulated public disclosure, and likely affects annual planning. Rare by definition.

Level of response

  1. Individual: The person who broke it can revert/fix it without much effort, and impact blast-radius is limited to one team. Post-incident review may not be needed beyond the team level.
  2. Team: A single team can manage the full response, such as an issue with a single service. Impact blast radius is likely one team. Post-incident review at the peer-team level.
  3. Peer team: A group of teams in the same department are involved in response due to interdependencies or the nature of the event. Impact blast-radius is clearly multi-team. Post-incident review at the peer-team level, and higher up the org-chart if the management chain is deep enough for it.
  4. Cross-org: Major incident territory, where the issue cuts across more than one functional group. These are rare. Impact blast-radius may be whole-company, but likely whole-product. Post-incident review will be global.
  5. C-level: High executive needs to run it because response is whole company in scope. Will involve multiple post-incident reviews.

Is Private? Yes/No - If yes, only the people involved in the response are notified of the incident and updates. Useful for Security and Compliance type incidents, where discoverability is actually bad. Some incidents qualify as Material Non-Public Information, which matters to companies with stocks being traded.

The combinatorics indicate that 5*5=25 pairs, 50 if you include Is Private, which makes for an unwieldy pick-one list. However, like stellar types there is a kind of main sequence of pairs that are more common, with problematic outliers that make simple solutions a troublesome fit. Let's look at a few pairs that are on the main sequence of event types:

  • Planning + Individual: Probably a feature-flag had to be rolled back real quick. Spend some time digging into the case. Incidents like this sometimes get classified "bug" instead of "incident."
  • Low + Team: Such as a Business Intelligence failure, where revenue attribution was discovered to be incorrect for a new feature, and time is needed to back-correct issues and validate against expectations. May also be classified as "bug" instead of "incident."
  • Medium + Team: Probably the most common incident type that doesn't get classified as a "bug," these are the highway verge grass fires of the incident world; small in scope, over quick, one team can deal with it.
  • Medium + Peer Team: Much like the previous but involving more systems in scope. Likely requires coordinated response between multiple teams to reach a solution. These teams work together a lot, by definition, so it would be a professional and quick response.
  • High + Cross-org: A platform system had a failure that affected how application code responds to platform outages, leading to a complex, multi-org response. Response would include possibly renegotiating SLAs between platform and customer-facing systems. Also, remediating the Log4J vulnerability, which requires touching every usage of java in the company inclusive of vendored usage, counts as this kind of incident.
  • Critical + Cross-org: An event like the Log4J vulnerability, and the Security org has evidence that security probes found something. The same remediation response as the previous, but with added "reestablish trust in the system" work on top of it, and working on regulated customer notices.

Six of 25 combinations. But some of the others are still viable, even if they don't look plausible on the surface. Let's look at a few:

  • Critical + Team: A bug is found in SOX reporting that suggests incorrect data was reported to stock-holders. While the C-levels are interested, they're not in the response loop beyond the 'stakeholder' role and being the signature that stock-holder communications will be issued under.
  • Low + Cross-org: Rapid retirement of a deprecated platform system, forcing the teams still using the old system to crash-migrate to the new one.
  • Planning + Cross-org: The decision to retire a platform system is made as part of an incident, and migrations are inserted into regular planning.

How is an organization supposed to build a pick-one list from this mess that is usable? This is hard work!

Some organizations solve this by bucketing incidents using another field, and allowing the pick-one list to mean different things based on what that other field says. A Security SEV1 gets a different scale of response than a Revenue SEV1, which in turn gets a different type of response than an Availability SEV1. Systems like this have problems with incidents that cross buckets, such as a Security issue that also affects Availability. It's for this reason that Honeycomb has an 'ambiguous' bucket.

A few organizations go so far as to have a fully separate process for the 'High' and 'Critical' urgencies of events, maybe calling them Disaster Recovery events instead of Incidents. DR events need to be rare, which means that process isn't as well exercised as Incident response. However, a separate process makes it abundantly clear that certain urgencies and scopes require different process overall. More on this in a later blog-post.

Other orgs handle the outlier problem differently, taking them out of incidents and into another process all together. Longer flow problems, low urgency above, get called something like a Code Yellow after a Google effort, or Code Red for the Critical + C-Team level to handle long flow big problems.

Honeycomb took the bucketing idea one step further and dropped urgency and level of response entirely, focusing instead on incident type. A process like this still needs ways to manage urgency and response-scope differences, but this is being handled at a layer below incident automation. In my opinion, a setup like this works best when Engineering is around Dunbar's Number or less in size, allowing informal relationships to carry a lot of weight. Companies with deeper management chains, and thus more engineers, will need more formalism to determine cross-org interaction and prioritization.

Another approach is to go super broad with your pick-one list, and make it apply to everyone. While this approach disambiguates pretty well between the SEV 1 highest urgency problems and SEV 2 urgent but not pants on fire urgent, they're less good at disambiguating SEV 3 and SEV 4 incidents. Those incidents tend to only have local scope, so local definitions will prevail, meaning only locals will know how to correctly categorize issues.


There are several simple answers for this problem, but each simplification has it's own problem. Your job is to pick the problems your org will put up with.

  • How much informal structure can you rely on? The smaller the org, the more one size is likely to fit all.
  • Do you need to interoperate with a separate incident response process, perhaps an acquisition or a parent company?
  • How often do product-local vs global incidents happen? For one product companies, these are the same thing. For companies that are truly multi-product, this distinction matters. The answer here influences how items on your pick-one list are dealt with, and whether incident reporters are likely to file cross-product reports.
  • Does your incident automation platform allow decision supports in their reporting workflow? Think of a next, next, next, done wizard; each screen asks clarifying questions. Helpful for folk who are not sure how a given area wants their incidents marked up, less helpful for old hands who know exactly what needs to go in each field.

Rust and the Linux kernel

One of the kernel maintainers made social waves by bad mouthing Rust and the project to rebuild the Linux kernel in Rust. The idea of rebuilding the kernel in "Rust: the memory-safe language" not "The C in CVE stands for C/C++" makes a whole lot of sense. However, there is more to a language than how memory safe it is and whether a well known engineer calls it a "toy" language.

One of the products offered by my employer is written in Elixir, which is built on top of Erlang. Elixir had an 8 or so month period of fame, which is when the decision to write that product was made. We picked Elixir because the Erlang engine gives you a lot of concurrency and async processing for relatively easy. And it worked! That product was a beast in relatively little CPU. We had a few cases of 10x usage from customers, and it just scaled up no muss no fuss.

Where the problems with the product came wasn't in the writing, but in the maintaining and productionizing. Some of the issues we've had over the years, many of which got better as Elixir as an ecosystem matured:

  • The ability to make a repeatable build, needed for CI systems
  • Dependency management in modules
  • Observability ecosystem support, such as OpenTelemetery SDKs
  • Build tooling support usable by our CI systems
  • Maturity of the module ecosystem, meaning we had to DIY certain tasks that our other main product never had to bother with. Or the modules that exist only covered 80% of the use-cases.
  • Managing Erlang VM startup during deploys

My opinion is that the dismissiveness from this particular Linux Kernel Maintainer had to do with this list. The Linux kernel and module ecosystem is massive, with highly complex build processes spanning many organizations, and regression testing frameworks to match. Ecosystem maturity matters way more for CI, regression, and repeatable build problems than language maturity.

Rust has something Elixir never had: durable mindshare. Yeah, the kernel rebuild process has taken many years, and has many years to go. Durable mindshare means that engineers are sticking with it, instead of chasing the next hot new memory safe language.

SysAdmins have no trouble making big lists of what can go wrong and what we're doing to stave that off a little longer. The tricky problem is pushing large organizations to take a harder look at systemic risks and taking them seriously. I mean, the big companies have to have disaster recovery (DR) plans for compliance reasons; but there are a lot of differences between box-ticking DR plans and comprehensive DR plans.

Any company big enough to get past the running out of money is the biggest disaster phase has probably spent some time thinking about what to do if things go wrong. But how do you, the engineer in the room, get the deciders to think about disasters in productive ways?

The really big disasters are obvious:

  • The datacenter catches fire after a hurricane
  • The Region goes dark due to a major earthquake
  • Pandemic flu means 60% of the office is offline at the same time
  • An engineer or automation accidentally:
    • Drops all the tables in the database
    • Deletes all the objects out of the object store
    • Destroys all the clusters/servlets/pods
    • Deconfigures the VPN
  • The above happens and you find your backups haven't worked in months

All obvious stuff, and building to deal with them will let you tick the box for compliance DR. Cool.

But there are other disasters, the sneaky ones that make you think and take a hard look at process and procedures in a way that the "oops we lost everything of [x] type" disasters generally don't.

  • An attacker subverts your laptop management software (JAMF, InTune, etc) and pushes a cryptolocker to all employee laptops
  • 30% of your application secrets got exposed through a server side request forgery (SSRF) attack
  • Nefarious personages get access to your continuous integration environment and inject trojans into your dependency chains
  • A key third party, such as your payment processor, gets ransomwared and goes offline for three weeks
  • A Slack/Teams bot got subverted and has been feeding internal data to unauthorized third parties for months

The above are all kinda "security" disasters, and that's my point. SysAdmins sometimes think of these, but even we are guilty of not having the right mental models to rattle these off the top of our head when asked. Asking about disasters like this list should start conversations that generally don't happen. Or you get the bad case: people shrug and say "that's Security's problem, not ours," which is a sign you have a toxic reliability culture.

Security-type disasters have a phase that merely technical disasters lack: how do we restore trust in production systems? In technical disasters, you can start recovery as soon as you've detected the disaster. For security disasters recovery has to wait until the attacker has been evicted, which can take a while. This security delay means key recovery concepts like Recovery Time and Recovery Point Objectives (RTO/RPO) will be subtly different.

If you're trying to knock loose some ossified DR thinking, these security type disasters can crack open new opportunities to make your job safer.

I've now spent over a decade teaching how alarms are supposed to work (specific, actionable, with the appropriate urgency) and even wrote a book on how to manage metrics systems. One topic I was repeatedly asked to cover in the book, but declined because the topic is big enough for its own book, is how to do metrics right. The desire for an expert to lay down how to do metrics right comes from a number of directions:

  • No one ever looked at ours in a systematic way and our alerts are terrible [This is asking about alerts, not metrics; but they still were indirectly asking about metrics]
  • We keep having incidents and our metrics aren't helping, how do we make them help?
  • Our teams have so many alarms important ones are getting missed [Again, asking about alerts]
  • We've half assed it, and now we're getting a growth spurt. How do we know what we should be looking for?

People really do conflate alarms/alerts with metrics, so any discussion about "how do we do metrics better" is often a "how do we do alarms better" question in disguise. As for the other two points, where people have been using vibes to pick metrics and that's no longer scaling, we actually do have a whole lot of advice; you have a whole menu of "golden signals" to pick from depending on how your application is shaped.

That's only sort of why I'm writing this.

In the mathematical construct of Site Reliability Engineering, where everything is statistics and numerical analysis, metrics are easy. Track the things that affect availability, regularly triage your metrics to ensure continued relevance, and put human processes into place to make sure you're not burning out your worker-units. But the antiseptic concept of SRE only exists in a few places, the rest of us have to pollute the purity of math with human emotions. Let me explain.

Consider your Incident Management process. There are certain questions that commonly arise when people are doing the post incident reviews:

  • Could we have caught this before release? If so, what sort of pre-release checks should we add to catch this earlier?
  • Did we learn about this from metrics or customers? If customers, what metrics do we need to add to catch this earlier? If metrics, what processes or alarms should we tune to catch this earlier?
  • Could we have caught this before the feature flag rolled out to the Emerald users? Do we need to tune the alarm thresholds to catch issues like this in groups with less feature-usage before the high value customers on Emerald plans?

And so on. Note that each question asks about refining or adding metrics. Emotionally, metrics represent anxieties. Metrics are added to catch issues before they hurt us again. Metrics are retained because they're tracking something that used to hurt us and might hurt again. This makes removing metrics hard; the people involved remember why certain metrics are present, intuitively know they needs tracking, which means emotion says to keep.

Metrics are scar tissue, and removing scar tissue is hard, bloody work. How do you reduce the number of metrics, while also not compromising your availability goals? You need the hard math of SRE to work down those emotions, but all it takes is one Engineering Manager to say "this prevented a SEV, keep it" to blow that effort up. This also means you'll have much better luck with a metric reformation effort if teams are already feeling the pinch of alert fatigue or your SaaS metric provider bills are getting big enough that the top of the company is looking at metric usage to reduce costs.

Sometimes, metrics feed into Business Intelligence. That's less about scar tissue and more about optimizing your company's revenue operations. Such metrics are less likely to lead to rapid-response on-call rotations, but still can lead to months long investigations into revenue declines. That's a different but related problem.

I could write a book about making your metrics suck less, but that book by necessity has to cover a lot of human-factors issues and has to account for the role of Incident Management in metrics sprawl. Metrics are scar tissue, keep that in mind.

In a Slack I'm on someone asked a series of questions that boil down to:

Our company has a Reliability team, but another team is ignoring SLA/SLO obligations. What can SRE do to fix this?

I got most of the way through a multi-paragraph answer before noticing my answer was, "This isn't SRE's job, it's management's job." I figured a blog post might help explain this stance better.

The genius behind the Site Reliability Engineer concept at Google is they figured out how to make service uptime and reliability matter to business management. The mathematical framework behind SRE is all about quantizing risk, quantizing impact, and that allows quantizing lost revenue; possibly even quantizing lost sales opportunity. All this quantizing falls squarely into the management you can't manage what you can't measure mindset crossed with if I can't measure it, it's an outside dependency I can ignore subtext. SRE is all about making uptime and reliability a business problem worth spending management cycles on.

In the questioner's case we already have some signal that their management has integrated SRE concepts into management practice:

  • They have a Reliability team, which only happens if someone in management believes reliability is important enough to devote dedicated headcount and a manager to.
  • They have Service Level Agreement and Service Level Objective concepts in place
  • Those SLA/SLO obligations apply to more teams than the Reliability team itself, indicating there is at least some management push to distribute reliability thinking outside of the dedicated Reliability team.

The core problem the questioner is running into is that this non-compliant team is getting away with ignoring SLA/SLO stuff, and the answer to "what can SRE do to fix this" is to be found in why and how that team is getting away with this ignoring. Management is all about making trade-off decisions against competing priorities, clearly something else is becoming a higher priority than compliance with SLA/SLO practices. What are these more important priorities, and are they in alignment with upper management's priorities?

As soon as you start asking questions along the lines of "what can a mere individual contributor do to make another manager pay attention to their own manager," you have identified a pathological power imbalance. The one tool you have is "complain to the higher level manager to make them aware of the non-compliance," and hope that higher level manager will do the needful things. If that higher level manager does not do the needful things, the individual contributor is kind of out of luck.

Under their own authority, that is. In the case of the questioner, there is a Reliability team with a manager. This means there is someone in the management chain who officially cares about this stuff, and can raise concerns higher up the org-chart. Non-compliance with policy is supposed to be a management problem, and should have management solutions. The fact the policy in question was put in place due to SRE thinking is relevant, but not the driving concern here.


The above works for organizations that are hierarchical, which implies deeper management chains. You count the number of managers between the VP of Engineering and the average engineer, and that number is between 1.0 and 2.5, you probably have a short enough org-chart to directly talk to the team in question for direct education (bridging the org-chart to use Dr. Westruum's term.) If the org-chart is >2.5 managers, you're better served going through the org-chart to solve this particular problem.

But if you're in a short org-chart company, and that other team is still refusing to comply with SLA/SLO policies, you're kind of stuck complaining to the VP of Engineering and hoping that individual force alignment through some method. But if the VPofE doesn't, that is a clear signal that Reliability is not as important to management as you thought, and you should go back to the fundamentals of making the case for prioritizing SRE practices generally.

...will never happen more than once at a company.

I say this knowing that chunks of Germany's civil infrastructure managed to standardize on SuSE desktops, and some may still be using SuSE. Some might view this as proof it can be done, I say that Linux desktops not spreading beyond this example is proof of why it didn't happen. The biggest reason we have the German example is because the decision was top down. Government decision making is different than corporate decision making, which is why we're not going to see the same thing happen, a Linux desktop (actually laptop) mandate from on high, more than few times; especially in the tech industry.

it all comes down to management and why Linux laptop users are using Linux in the first place.

You see, corporate laptops (hereafter referred to as "endpoints" to match management lingo) have certain constraints placed upon them when small companies become big companies:

  • You need some form of anti-virus and anti-malware scanning, by policy
  • You need something like either a VPN or other Zero Trust ability to do "device attestation", proving the device (endpoint) is authentic and not a hacker using stolen credentials from a person
  • You need to comply with the vulnerability management process, which means some ability to scan software versions on and endpoint and report up to a dashboard.
  • The previous three points strongly imply an ability to push software to endpoints

Windows has been able to do all four points since the 1990s. Apple came somewhat later, but this is what JAMF is for.

Then there is Linux. It is technically possible to do all of the above. Some tools, like osquery, were built for Linux first because the intended use was on servers. However, there is a big problem with Linux users. Get 10 Linux users in a room, and you're quite likely to get 10 different combination of display manager (xorg or wayland), window manager (gnome, kde, i3, others), and OS package manager. You need to either support heterogeneity or commit to building the Enterprise Linux that has one from each category and forbid others. Enterprise Linux is what the German example did.

Which is when the Linux users revolt, because banning their tiling window manager in favor of Xorg/Gnome ruins their flow -- and similar complaints. The Windows and Apple users forced onto Linux will grumble about their flow changing and why all their favorite apps can't be used, but at least it'll be uniform. If you support all three, you'll get the same 5% Linux users but the self-selected cranky ones who can't use the Linux they actually want. Most of that 5% will "settle" for another Linux before using Windows or Apple, but it's not the same.

And 5% Linux users puts supportability of the platform below the concentration needed to support that platform well. Companies like Alphabet are big enough the 5%  is big enough to make a supportable population. For smaller companies like Atlassian, perhaps not. Which puts Enterprise Linux in that twilight state between outright banned and just barely supported so long as you can tolerate all the jank.

Why tcp-mss-clamp still matters

This is blogging in anger after fighting this over the weekend. Because I'm like that I have a backup cable ISP in case my primary fiber ISP flakes out. I work from home, so the existence of internet is critical to me getting paid, and neither cell phone has good enough service to hotspot reliably. Thus, having two ISPs. It's expensive, but then so would be missing work for a week while I wait for a cable tech to come out to diagnose why their stuff isn't working.

The backup ISP hasn't been working well for a while, but the network card pointing to the second cable modem flaked out two weeks ago and that meant replacement. Which refused to pick up address info (v4 or v6) off of DHCP. Doing a hard reset from the provider side fixed the issue, but left me with the curious circumstance of:

  • I can curl from the router
  • But nothing behind it could curl.
  • Looking at the packet trace of the behind the router case saw the TCP handshake finish, but TLS handshake fail after the initial hello.

What the actual fuck.

What fixed the problem was the following policy added to my firewalld config in /etc/firewalld/policies/backuprouter.xml.

<rule>
  <tcp-mss-clamp value="1448"/>
</rule>

MSS means 'maximum segment size' which is a TCP thing indicating how much the TCP portion of the packet can occupy. For networks with a typical Maximum Transfer Unit (MTU) size of 1500, MSS is typically 1460. Networking over things like VPNs often trims the effective MTU due to VPN overhead, often to 1492 with a corresponding reduction in MSS to 1452. The tcp-mss-clamp setting is telling firewalld to lock MSS to 1448; so if something behind it requests higher, the router will rewrite (and reassemble) segments to conform to the MSS setting.

The tcp-mss-clamp setting can be set to 'pmtu' which will cause firewalld to probe what the effective MTU (and by proxy MSS) number should be so you don't have to hard-code. And yet, here I am, hard-coding because crossing my own router seems to require an extra 4 bytes. I don't know why, and that angers me. Packet traces from the router itself show MSS of 1452 working fine, but that provably doesn't work from behind my router.

Whatever. It works now, which is what matters, and now I'm contributing this nugget back to the internet.

20 years of this nonsense

20 years ago today I published the first post to this blog. It was a non-sequitur post because I didn't want my first post to be "is this thing on?" or similar. I had to look up what the Exchange worm I mentioned was, and it was probably MyDoom. That was a mass mailing worm, because this was before anti-virus was a routine component of email setups. I started a blog because I needed something to host on this new "web pages from your home directory" feature I was asked to create, and this was the first content on that project! I needed something to look at to prove it worked, and having external traffic to demonstrate along side made the demo even. In point of fact, this blog was originally hosted on a NetWare server running NetWare 5.1. On the internet and everything!

At first I used it a bit like how I later used Twitter, small posts sometimes multiple times a day. Micro-blogging in other words. So no wonder that went away once Twitter showed up. The original blog-software was on Blogger, back when they still had a "publish by FTP" feature. When blogger announced the death of FTP publishing, that was when I moved to this domain.

You can see what the original blog looked liked on Archive.org: https://web.archive.org/web/20040609170419/http://myweb.facstaff.wwu.edu/~riedesg/sysadmin1138/archive/2004_01_01_sysadmin.html Back then Blogger didn't create a page per blog-post unless you gave the post a title, which I mostly didn't. The embarrassing misspelling in the header stuck around for way too long.

My posting frequency tapered way off around 2011 for two reasons:

  1. I got a new job in the private sector, which meant that what I was working on was covered by confidentiality policies for the first time. Previously, everything I did could be revealed with the state version of a Freedom of Information Act filing. It took me quite a while to learn the art of talking about work while not talking about work.
  2. Twitter plus the death of Google Reader ended up moving my energies elsewhere.

If you take all of my blog posts and look at the middle post, that post is in 2007 somewhere; in the peak of blogging in general. This blog remains where I post my long form opinions! That isn't going to change any time soon.

Getting the world off Chrome

I'm seeing more and more folk post, "we got out of IE, we can get out of Chrome," on social media, referencing the nigh-monopoly Chrome has on the browsing experience. Unless you're using Firefox or Safari, you're using Chrome or a Chromium-derived browser. For those of you too young to remember what internet life under Internet Explorer was like, here is a short list of why it was not great:

  • Once Microsoft got the browser-share lock in, it kind of stopped innovating the browser. It conquered the market, so they could pull back investment in it.
  • IE didn't follow standards. But then, Microsoft was famous for "embrace and extend," where they adopt (mostly) a standard, then Microsoftify it enough no one considers using the non-MS version of the standard.
  • If you were on a desktop platform that didn't have IE, such as Apple Macintosh, you were kinda screwed.

Google Chrome took over from IE for three big reasons:

  • They actually were standards compliant, more so than the other alt-browsers (Mozilla's browsers, Opera, and Safari)
  • They actually were trying to innovate in the browser
  • Most important: they were a megacorp with a good reputation who wanted everyone to use their browser. Mozilla and Opera were too small for that, and Apple never has been all that comfortable supporting non-Apple platforms. In classic dot-com era thinking, Google saw a dominant market player grow complacent and smelled a business opportunity.

This made Chrome far easier to develop for, and Chrome grew a reputation for being a web developer's browser. This fit in nicely to Google's plan for the future, which they saw as full of web applications. Google understands what they have, and how they got there. They also understand "embrace and extend," but found a way to do that without making it proprietary the way Microsoft did: capture the standards committees.

If you capture the standards committees, meaning what you want is almost guaranteed a rubber stamp from the committee, then you get to define what industry standard is. Microsoft took a capitalist, closed-source approach to embrace and extend where the end state was a place where the only viable way to do a thing was the thing that was patent-locked into Microsoft. Google's approach is more overtly FOSSY in that they're attempting to get internet consensus for their changes, while also making it rather harder for anyone else to do what they do.

Google doesn't always win. Their "web environment integrity" proposal, which would have given web site operators far greater control over browser extensions like ad-blockers, quietly got canned recently after internet outrage. Another area that got a lot of push back from the internet was Chrome's move away from "v2 manifest" extensions, which include ad-blockers, in favor of "v3 manifest" plugins which made ad-blockers nearly impossible to write. The move from v2 to v3 was delayed a lot while Google grudgingly put in abilities for ad-blockers to continue working.

Getting off of Chrome

The circumstances that drove the world off of Internet Explorer aren't there for Chrome.

  • Chrome innovates constantly and in generally user-improving ways (so long as that improvement doesn't majorly affect ad-revenue)
  • Chrome listens, to a point, to outrage when decisions are made
  • Chrome is functionally setting web standards, but doing so through official channels with RFCs, question periods, and all that ritual
  • Chrome continues to consider web-developer experience to be a number one priority
  • Alphabet, Google's parent company, fully understands what happens when the dominant player grows complacent, they get replaced the way Google replaced Microsoft in the browser wars.

One thing has changed since the great IE to Chrome migration began, Google lost its positive reputation. The old "don't be evil" thing was abandoned a long time ago, and everyone knows it. Changes proposed by Google or Google proxies are now viewed skeptically; though, overtly bad ideas still require internet outrage to delay or prevent a proposal from happening.

That said, you lose monopolies through either laziness of the monopolist (Microsoft) or regulatory action, and I'm not seeing any signs of laziness.