...will never happen more than once at a company.

I say this knowing that chunks of Germany's civil infrastructure managed to standardize on SuSE desktops, and some may still be using SuSE. Some might view this as proof it can be done, I say that Linux desktops not spreading beyond this example is proof of why it didn't happen. The biggest reason we have the German example is because the decision was top down. Government decision making is different than corporate decision making, which is why we're not going to see the same thing happen, a Linux desktop (actually laptop) mandate from on high, more than few times; especially in the tech industry.

it all comes down to management and why Linux laptop users are using Linux in the first place.

You see, corporate laptops (hereafter referred to as "endpoints" to match management lingo) have certain constraints placed upon them when small companies become big companies:

  • You need some form of anti-virus and anti-malware scanning, by policy
  • You need something like either a VPN or other Zero Trust ability to do "device attestation", proving the device (endpoint) is authentic and not a hacker using stolen credentials from a person
  • You need to comply with the vulnerability management process, which means some ability to scan software versions on and endpoint and report up to a dashboard.
  • The previous three points strongly imply an ability to push software to endpoints

Windows has been able to do all four points since the 1990s. Apple came somewhat later, but this is what JAMF is for.

Then there is Linux. It is technically possible to do all of the above. Some tools, like osquery, were built for Linux first because the intended use was on servers. However, there is a big problem with Linux users. Get 10 Linux users in a room, and you're quite likely to get 10 different combination of display manager (xorg or wayland), window manager (gnome, kde, i3, others), and OS package manager. You need to either support heterogeneity or commit to building the Enterprise Linux that has one from each category and forbid others. Enterprise Linux is what the German example did.

Which is when the Linux users revolt, because banning their tiling window manager in favor of Xorg/Gnome ruins their flow -- and similar complaints. The Windows and Apple users forced onto Linux will grumble about their flow changing and why all their favorite apps can't be used, but at least it'll be uniform. If you support all three, you'll get the same 5% Linux users but the self-selected cranky ones who can't use the Linux they actually want. Most of that 5% will "settle" for another Linux before using Windows or Apple, but it's not the same.

And 5% Linux users puts supportability of the platform below the concentration needed to support that platform well. Companies like Alphabet are big enough the 5%  is big enough to make a supportable population. For smaller companies like Atlassian, perhaps not. Which puts Enterprise Linux in that twilight state between outright banned and just barely supported so long as you can tolerate all the jank.

Why tcp-mss-clamp still matters

This is blogging in anger after fighting this over the weekend. Because I'm like that I have a backup cable ISP in case my primary fiber ISP flakes out. I work from home, so the existence of internet is critical to me getting paid, and neither cell phone has good enough service to hotspot reliably. Thus, having two ISPs. It's expensive, but then so would be missing work for a week while I wait for a cable tech to come out to diagnose why their stuff isn't working.

The backup ISP hasn't been working well for a while, but the network card pointing to the second cable modem flaked out two weeks ago and that meant replacement. Which refused to pick up address info (v4 or v6) off of DHCP. Doing a hard reset from the provider side fixed the issue, but left me with the curious circumstance of:

  • I can curl from the router
  • But nothing behind it could curl.
  • Looking at the packet trace of the behind the router case saw the TCP handshake finish, but TLS handshake fail after the initial hello.

What the actual fuck.

What fixed the problem was the following policy added to my firewalld config in /etc/firewalld/policies/backuprouter.xml.

  <tcp-mss-clamp value="1448"/>

MSS means 'maximum segment size' which is a TCP thing indicating how much the TCP portion of the packet can occupy. For networks with a typical Maximum Transfer Unit (MTU) size of 1500, MSS is typically 1460. Networking over things like VPNs often trims the effective MTU due to VPN overhead, often to 1492 with a corresponding reduction in MSS to 1452. The tcp-mss-clamp setting is telling firewalld to lock MSS to 1448; so if something behind it requests higher, the router will rewrite (and reassemble) segments to conform to the MSS setting.

The tcp-mss-clamp setting can be set to 'pmtu' which will cause firewalld to probe what the effective MTU (and by proxy MSS) number should be so you don't have to hard-code. And yet, here I am, hard-coding because crossing my own router seems to require an extra 4 bytes. I don't know why, and that angers me. Packet traces from the router itself show MSS of 1452 working fine, but that provably doesn't work from behind my router.

Whatever. It works now, which is what matters, and now I'm contributing this nugget back to the internet.

20 years of this nonsense

20 years ago today I published the first post to this blog. It was a non-sequitur post because I didn't want my first post to be "is this thing on?" or similar. I had to look up what the Exchange worm I mentioned was, and it was probably MyDoom. That was a mass mailing worm, because this was before anti-virus was a routine component of email setups. I started a blog because I needed something to host on this new "web pages from your home directory" feature I was asked to create, and this was the first content on that project! I needed something to look at to prove it worked, and having external traffic to demonstrate along side made the demo even. In point of fact, this blog was originally hosted on a NetWare server running NetWare 5.1. On the internet and everything!

At first I used it a bit like how I later used Twitter, small posts sometimes multiple times a day. Micro-blogging in other words. So no wonder that went away once Twitter showed up. The original blog-software was on Blogger, back when they still had a "publish by FTP" feature. When blogger announced the death of FTP publishing, that was when I moved to this domain.

You can see what the original blog looked liked on Archive.org: https://web.archive.org/web/20040609170419/http://myweb.facstaff.wwu.edu/~riedesg/sysadmin1138/archive/2004_01_01_sysadmin.html Back then Blogger didn't create a page per blog-post unless you gave the post a title, which I mostly didn't. The embarrassing misspelling in the header stuck around for way too long.

My posting frequency tapered way off around 2011 for two reasons:

  1. I got a new job in the private sector, which meant that what I was working on was covered by confidentiality policies for the first time. Previously, everything I did could be revealed with the state version of a Freedom of Information Act filing. It took me quite a while to learn the art of talking about work while not talking about work.
  2. Twitter plus the death of Google Reader ended up moving my energies elsewhere.

If you take all of my blog posts and look at the middle post, that post is in 2007 somewhere; in the peak of blogging in general. This blog remains where I post my long form opinions! That isn't going to change any time soon.

Getting the world off Chrome

I'm seeing more and more folk post, "we got out of IE, we can get out of Chrome," on social media, referencing the nigh-monopoly Chrome has on the browsing experience. Unless you're using Firefox or Safari, you're using Chrome or a Chromium-derived browser. For those of you too young to remember what internet life under Internet Explorer was like, here is a short list of why it was not great:

  • Once Microsoft got the browser-share lock in, it kind of stopped innovating the browser. It conquered the market, so they could pull back investment in it.
  • IE didn't follow standards. But then, Microsoft was famous for "embrace and extend," where they adopt (mostly) a standard, then Microsoftify it enough no one considers using the non-MS version of the standard.
  • If you were on a desktop platform that didn't have IE, such as Apple Macintosh, you were kinda screwed.

Google Chrome took over from IE for three big reasons:

  • They actually were standards compliant, more so than the other alt-browsers (Mozilla's browsers, Opera, and Safari)
  • They actually were trying to innovate in the browser
  • Most important: they were a megacorp with a good reputation who wanted everyone to use their browser. Mozilla and Opera were too small for that, and Apple never has been all that comfortable supporting non-Apple platforms. In classic dot-com era thinking, Google saw a dominant market player grow complacent and smelled a business opportunity.

This made Chrome far easier to develop for, and Chrome grew a reputation for being a web developer's browser. This fit in nicely to Google's plan for the future, which they saw as full of web applications. Google understands what they have, and how they got there. They also understand "embrace and extend," but found a way to do that without making it proprietary the way Microsoft did: capture the standards committees.

If you capture the standards committees, meaning what you want is almost guaranteed a rubber stamp from the committee, then you get to define what industry standard is. Microsoft took a capitalist, closed-source approach to embrace and extend where the end state was a place where the only viable way to do a thing was the thing that was patent-locked into Microsoft. Google's approach is more overtly FOSSY in that they're attempting to get internet consensus for their changes, while also making it rather harder for anyone else to do what they do.

Google doesn't always win. Their "web environment integrity" proposal, which would have given web site operators far greater control over browser extensions like ad-blockers, quietly got canned recently after internet outrage. Another area that got a lot of push back from the internet was Chrome's move away from "v2 manifest" extensions, which include ad-blockers, in favor of "v3 manifest" plugins which made ad-blockers nearly impossible to write. The move from v2 to v3 was delayed a lot while Google grudgingly put in abilities for ad-blockers to continue working.

Getting off of Chrome

The circumstances that drove the world off of Internet Explorer aren't there for Chrome.

  • Chrome innovates constantly and in generally user-improving ways (so long as that improvement doesn't majorly affect ad-revenue)
  • Chrome listens, to a point, to outrage when decisions are made
  • Chrome is functionally setting web standards, but doing so through official channels with RFCs, question periods, and all that ritual
  • Chrome continues to consider web-developer experience to be a number one priority
  • Alphabet, Google's parent company, fully understands what happens when the dominant player grows complacent, they get replaced the way Google replaced Microsoft in the browser wars.

One thing has changed since the great IE to Chrome migration began, Google lost its positive reputation. The old "don't be evil" thing was abandoned a long time ago, and everyone knows it. Changes proposed by Google or Google proxies are now viewed skeptically; though, overtly bad ideas still require internet outrage to delay or prevent a proposal from happening.

That said, you lose monopolies through either laziness of the monopolist (Microsoft) or regulatory action, and I'm not seeing any signs of laziness.

Every time the topic of documentation comes up at work, at multiple workplaces, someone always says a variant of the following:

What we really need is markdown in a git repository. We get version control, there is a lot of tooling to make markdown work good in git, it's great

And every time I have to grit my teeth and hope I don't cause dental damage. My core complaint is that internal documentation has fundamentally different objectives than open source software documentation repositories, and pretending they're the same problem domain means we'll be re-having the documentation discussion in 18 to 24 months.

The examples of OSS projects using markdown or asciidoc as their documentation repository are many, and it works pretty well. Markdown and asciidoc are markup, which allows compilers to turn the marked up doc into rendered sites. This makes accepting contributions from the community much easier, because it follows the same merge-request workflow as code. As most OSS projects are chronically under-staffed, anything that allows reuse of process is a win. Also, markdown and asciidoc are relatively simple formats so you don't need expensive software like Adobe InDesign to make  them.

OSS project docs are focused on several jobs to be done, and questions by readers:

  • How to install the thing
  • How to configure the thing
  • How to upgrade the thing
  • How to build various workflows the thing allows you to do
  • Troubleshooting tips for the thing
  • How often to expect releases of the thing
  • How to integrate with other things, if this thing allows integration
  • How to use the thing's API
  • Where to find the thing's SDK for various languages

Corporate internal documentation repositories need to do all of the above, but generally for a much wider range of things and services. Cool, that's what standards are for. But "markdown in a git repo" goes a bit off the rails when you look at all the other types of documentation internal docs often cover:

  • On-call rotation standards and contacts
  • Pager-playbooks for the page-out alarms
  • Incident Management program procedures and definitions
  • Post incident review documents for each incident
  • Service maturity standards for being allowed in prod
  • Ownership documentation linking services to individual teams (updated or re-created after each reorg)
  • Decision docs for implementing features or updating process
  • Roadmap documentation going out three years (new docs generated quarterly)
  • How to set up your development environment
  • How to access prod, and who is allowed to access prod
  • Protocols for accessing the datacenter hardware or cloud config consoles
  • The entire software development lifecycle (SDLC) including how CI works, what tests are required when, how tests are selected for inclusion, which linters are included, and when it's allowed to ignore all that because of an emergency

And so on. The sneaky part here is that the OSS projects have many of the above as well, but they're kept in things like Google Docs, Etherpads, Wikis, Twillio, Canvases in Slack, many things that are definitely not involving the merge request workflow into git. All of these extra internal documentation repository jobs to be done greatly complicate what solutions count as viable, in large part because this huge list is actually trying to 'simplify' multiple documentation styles into a single monolithic document repository. What styles are these? Well:

  • Product documentation, describing how to install, configure, and maintain the product.
  • Process documentation, describing the ways various people-driven procedures are done, such as the incident management process and the number of review meetings that need to be held before a feature is released to production.
  • Decision documentation, which evolves over time as people work through what an ultimate decision will look like, changing their minds along the way. Post-incident review docs are of this type.
  • Responder runbooks, used by people responding to incidents to use pre-defined (and risk vetted) procedures as part of incident response.
  • Maintenance runbooks, used by operators of the system to do various things, which is often based on a combination of product and process documentation, to create a grand unified procedure in one document.

All of these documentation styles need somewhat different document lifecycles, which in turn drives need to support workflows. A document lifecycle ensures that documentation is valid, up to date, and old information is removed. Sometimes documentation is a key part of compliance with regulation or industry standard-setting bodies, which adds review steps.

  • Product documentation probably needs multi-step reviews to ensure updates are valid. Confluence is terrible for this, git is less bad. Product docs also need regular review for freshness, and pruning of no longer relevant docs.
  • Process documentation less obviously needs multi-step review. Some will, some won't. Freshness is key, since process documentation describes the how of operating the system or accessing human processes, and old docs pollute search results.
  • Decision documentation definitely does not need multi-step review, it needs to be updated by anyone involved, and may be surplus to requirements once the feature is built. In fact, these docs need to allow collaborative editing, like Etherpad or Google Docs, making them fundamentally unsuited for a git-based workflow. However, having such docs still around is occasionally useful later in time when someone tries to figure out "who thought this was a good idea, and why didn't they consider this obvious failure case?"
  • Responder runbooks also can have compliance interactions; if so, these need multi-step review for risk management decisions. If not, they're probably a per-team free for all. As is the way of responder runbooks, rare errors are nigh impossible to check for freshness so these are the least likely to be verifiably up to date.
  • Maintenance runbooks run the gamut from per team free for all to onerous multi-step review process, all depending on the risks of doing the thing and the nature of the business.

Ideally, the high lifecycle docs like product and process documentation would be in one system, with the minimal lifecycle docs like decision review and responder runbooks in another system entirely. This would allow each system to cater to the needs of the styles within, and solve more of the business' problems. I would like a two-system solution very much.


People have spent the last 25 years being trained that how you find documentation is:

  1. Look in the obvious place. If you don't find it....
  2. Search google. If that doesn't work, retry your terms. If after three tries you still haven't found it....
  3. Complain on social media.

A two doc-system solution is not well tolerated, and people will build a "universal search" engine to search both the high and low process repositories. Also, two doc systems seems like a lot of overhead. And how do you make sure the right docs go in the right system? Why not use one doc system that's sort of okay at both jobs and save money? 18 to 24 months later, discontent at how bad the "sort of okay" solution is rises and people advocate to moving to a new thing, and suggest markdown in a git repo.

I've been in Firefox a long time

I intended to write a "history of my browser usage" post as part of a longer piece on the Chrome monoculture, but this blog is nearly 20 years old and it turns out I already did a history.

I can't find when I permanently dropped SeaMonkey, but it was after 2010. I dropped SeaMonkey late 2013 (thank you stale profile directory with a date stamp) when it was clearly abandonware and I learned you actually could launch Firefox in parallel with multiple profiles (the

firefox -p --no-remote

combination was key). I stopped using multiple profiles when the Container plugin came out that did nearly everything separate profiles did. It turns out SeaMonkey is still getting updates, but it seems to be tracking the Firefox and Thunderbird Extended Service Releases.

For those of you too old to remember the original Netscape Navigator, it also came with a few tools beyond the browser:

  • The browser, of course
  • An email client, since this was before GMail and web-editors for email weren't really a thing yet
  • An HTML editor (for pre-CSS versions of editor)

The reason I liked SeaMonkey and Opera is they both still shipped with an email client. It was pretty nice, actually. I kept Opera around as my email client way past when I stopped using it for general browsing. I'm fuzzy on what I did after Opera dropped their mail client, I may have grumpily transitioned onto Gnome Evolution at that point. Also, Gmail was out and I was quite used to web-based email clients.

So yeah, I've been in Firefox for over a decade at this point.

This is a controversial take, but the phrase "it's industry standard" is over-used in technical design discussions of the internal variety.

Yes, there are some actual full up standards. Things like RFCs and ISO-standards are actual standards. There are open standards that are widely adopted, like OpenTelemetry and the Cloud Native Computing Foundation suite, but these are not yet industry standards. The phrase "industry standard" implies consensus, agreement, a uniform way of working in a specific area.

Have you seen the tech industry? Really seen it? It is utterly vast. The same industry includes such categories as:

  • Large software as a service providers like Salesforce and Outlook.com
  • Medium software as a service providers like Box.com and Dr. Chrono
  • Small software as a service providers like every bay area startup made in the last five years
  • Large embedded systems design like the entire automotive industry
  • Highly regulated industries like Health Care and Finance, where how you operate is strongly influenced by the government and similar non-tech organizations
  • The IT departments at all of the above, which is much smaller than they used to be due to the SaaS revolution, but still exist
  • Scientific computing for things like space probes, satellite base systems, and remote sensing platforms floating the oceans
  • Internal services work at companies that don't sell technology, places like UPS, Maersk, Target, and Orange County California.

The only thing the above have any kind of consensus on is "IP-based networking is better than the alternatives," and even that is a bit fragile. Such out there statements like "HTTP is a standard transport" are ones you'd think there would be consensus on, but you'd be wrong. Saying that "kubernetes patterns are industry standard" is a statement of desire, not a statement of fact.

Thing is, the Sysadmin community used this mechanic for self-policing for literal decades. Any time someone comes to the community with a problem, it has to pass a "best practices" smell test before we consider answering the question as asked; otherwise, we'll interrogate the bad decisions that lead to this being a problem in the first place. This mechanic is 100% why ServerFault has a "reasonable business practices" close reason:

Questions should demonstrate reasonable information technology management practices. Questions that relate to unsupported hardware or software platforms or unmaintained environments may not be suitable for Server Fault.

Who sets the "best practices" for the sysadmin community? It's a group consensus of the long time members, which is slightly different between each community. There are no RFCs. There are no ISO standards. The closest we get is ITIL, the IT Infrastructure Library, which we all love to criticize anyway.

Best practices, which is "industry standard" by an older name, have always been an "I know it when I see it" thing. A tool used by industry elders to shame juniors into changing habits. Don't talk to me until you level up to the base norms of our industry, pleeb; and never mind that those norms are not canonicalized anywhere outside of my head.

This is why the phrase "it's industry standard" should not be used in internal technical design conversations

This phrase is shame based policing of concepts. If something is actually a standard, people should be able to look it up and see the history of why we do it this way.

Maybe the "industry" part of that statement is actually relevant; if that's the case, say so.

  • All of the base technology our market segment run on is made by three companies, so we do what they require.
  • Our industry are startups founded in 2010-2015 by ex-Googlers, so our standard is what Google did then.
  • Our industry computerized in the 1960s and has consumers in high tech and high poverty areas, so we need to keep decades of backwards compatibility.
  • Our industry is VC-funded SaaS startups founded after 2018 in the United States, who haven''t exited yet. So we need to stay on top of the latest innovations to ensure our funding rounds are successful.
  • Our industry is dominated by on-prem Java shops, so we have to be Java as well in order to sell into this market.

These are useful, important constraints and context for people to know. The vague phrase "industry standard" does not communicate context or constraints beyond, "your solution is bad, and you should feel bad for suggesting it." Shame is not how we maintain generative cultures.

It's time to drop "it's industry standard" from regular use.

Some engineers at Google have put forth a proposal called Web-Environment-Integrity that has the open source community up in arms. The leading criticisms of this proposal are "Google wants to make DRM for websites" and "Google wants to ban ad-blockers." These are catchy headlines intended to capture attention, they're also mostly true. For the people who don't want to wade through the discourse, this post is about what WEI does and where it came from.

This story begins in the previous decade when Google put forth the "Zero Trust framework" as a way to get rid of the corporate VPN. Zero Trust was a suite of techniques to allow companies to do away with the expensive and annoying to maintain VPN. The core concept behind Zero Trust was something I didn't truly understand until a few years ago: Zero Trust adds device attestation (a machine login) in addition to the user attestation when deciding whether to grant access to a resource, which is a more robust security boundary than a separate VPN login.

In a company context, you can reasonably expect the company to own both the server and the machine employees are accessing internal resources from. Zero Trust also enabled servers to specify a minimum security level that clients must meet in order to have access granted, such as up to date browser and OS versions, as well as a valid device identifier. This ends up being a major security win because when an employee has their user credentials phished, an attacker can't immediately use those stolen credentials to do evil things; the attacker will have to somehow gain a device identity as well.

In companies that use VPNs, phishing the user's credential was often enough to allow creating a VPN connection. If that wasn't enough, Phishing the VPN certificates would do the trick. Full Zero Trust makes the attacker's job harder. Again, in a corporate context where the company reasonably owns both the client and server sides of the conversation.

Back to WEI: Web-Environment-Integrity is a proposal to bring a limited form of device attestation to public surfaces. While WEI won't have a device credential, it will have the ability to attest to OS and Browser versions, presence and state of browser plugins, among other things. In theory this allows bank website operators (for example) to ensure people have a browser within three months of latest, are not operating any plugins known to sniff or record keystrokes and clicks, and are not running plugins with known vulnerabilities.

Unlike Zero Trust, the company does not reasonably own the client side of the conversation in a WEI context. This radically changes the power dynamics between public users and private servers. Under current internet rules, both sides mutually distrust each other and slowly establish trust. Under WEI, the server sets a minimum trust boundary that's far higher than is currently possible, which gives server operators far more power in this conversation than before. A Zero Trust like level of power, in fact.

What does Web-Environment-Integrity allow server operators to do?

As it happens, WEI is a clear example of a technique or standard that needs to have a clear and well thought out answer to the question:

What can a well resourced malicious actor do with this framework? How can they use this safety tool to harm people?

Right now, we don't have those answers. The explainer goes into some detail about how to avoid tracking through WEI techniques, but overlooks the thing that has everyone posting "Google wants to ban ad-blockers" headlines. The WEI proposal allows server operators to prohibit the use of specific browser plugins when accessing the site, which gives ad-networks an in-browser way to say "turn off your ad-blocker in order to see this content."

The well resourced malicious actor here is the internet advertising industry, of which Alphabet (Google's parent company) is one of the biggest members. The proposal writers do not view code injection and people-tracking through advertising to be malicious, they see it as a perfectly legitimate way to pay for services.

"But it's not the server operator doing the banning, it's the attestor; and the attestor has no idea what's on the site!"

The WEI standard involves three parties: The browser making the request, the server hosting the site, and the 'attestor' service the server relies on to judge the integrity of the browser making the request. The "Google wants to ban ad-blockers" headline happens when an advertising-driven site uses an attestor service that considers browsers with ad-block plugins to be insecure. Technically it isn't the server making the "no ad-block" constraint, at least at the time of the request. The Server operator made that choice when they selected an attestor service that prohibits ad-block plugins.

This sort of deniability is all over the tech industry.

I ran into a pretty common attitude regarding workplace diversity the other day. It was on a Q/A site. Paraphrased, the issue is:

Q: How can we improve the diversity of the candidates we hire?

A: That actually hurts the diversity of your hiring pool, because many people see "diversity!" on a hiring page and immediately go somewhere else. Who wants to be hired to a place that'll give a job to an unqualified minority just to meet some numbers?

The mechanism this answerer was assuming, that diversity programs are quota systems, has been explicitly illegal in the US since the 1980s. The Supreme Court at the time ruled that quotas like that were what this answerer said: racism, even if it was intended to correct for systemic biases. If you find your prospective workplace is using quotas, or explicitly hiring following racialized patterns, you have strong grounds for a lawsuit.

Within the last year, news broke of a company that got into hot water by someone posting "I'm hiring for X, give me all of your non-binary, queer, women, and minority leads please!" to Twitter. The strong implication by this statement was that this company was using a racialized hiring process which is illegal.

These days hiring pipelines at large US companies are engineered to avoid getting sued, and therefore don't use quotas. To build a hiring pipeline that furthers a company's diversity goals, while also avoid getting sued, requires several things:

  • You must interview/treat everyone who applies equally.
  • You must assess each application the same way.
  • You must make your final hire/pass decision based on the merits of the application.
  • (US) You must give preference to military veterans, by law.

So far, this is an "equity" argument. But building a system to improve your workplace diversity needs a few more steps.

  • You can change the mix of your applicants by biasing where you advertise the job.
  • You can't hide the job posting and pass out application links to your preferred groups. You still need to post them on your jobs page.
  • Remove biasing language from your posting and job application process.
  • You still have to treat all applicants the same once they've applied.

Equity, in other words.

Furthermore, some companies are beginning to reframe their diversity programs towards a "meet the market" approach. In that they assess their diversity program success based on how well their employee mix matches the potential job market for their roles. If a given position is 82% male in the job market, that's the target they'll push for; not 50%.

Equity, because that's the most legally conservative option if you want to avoid lawsuits for discrimination in US courts.

Mathew Duggan wrote a blog post on June 9th titled, "Monitoring is a pain" where he goes into delicious detail around where observability, monitoring, and telemetry goes wrong inside organizations. I have a vested interest in this, and still agree. Mathew captures a sentiment I didn't highlight enough in my book, that a good observability platform for engineering tends to get enmeshed ever deeper into the whole company's data engineering motion, even though that engineering observability platform isn't resourced enough to really serve that broader goal all that well. This is a subtle point, but absolutely critical for diagnosing criticism of observability platforms.

By not designing your observability platform from the beginning for eventual integration into the overall data motion, you get a lot of hard to reduce misalignment of function, not to mention poorly managed availability assumptions. Did your logging platform hiccup for 20 minutes, thus robbing the business metrics people of 20 minutes of account sign-up/shutdown/churn metrics? All data is noisy, but data engineering folk really like it if the noise is predictable and thus can be modeled out. Did the metrics system have an unexpected reboot, which prevented the daily code-push to production because Delivery Engineering couldn't check their canary deploy metrics? Guess your metrics system is now a production critical system instead of the debugging tool you thought it was.

Data engineering folk like their data to be SQL shaped for a lot of reasons, but few observability and telemetry systems have an SQL interface. Mathew proposed a useful way to provide that:

When you have a log that must be stored for compliance or legal reasons, don't stick it into the same system you use to store every 200 - OK line. Write it to a database (ideally) or an object store outside of the logging pipeline. I've used DynamoDB for this and had it work pretty well by sticking it in an SQS pipeline -> Lambda -> Dynamo. Then your internal application can query this and you don't need to worry about log expiration with DynamoDB TTL.

Dynamo is SQL-like, so this method could work for doing business metrics things like the backing numbers for computing churn rate and monthly active users (MAU) numbers. Or tracking things like password resets, email changes, and gift-card usage for your fraud/abuse department. Or all admin-portal activity for your annual external audits.

Mathew also unintentionally called me out.

Chances are this isn't someones full-time job in your org, they just happened to pick up logging. It's not supposed to be a full-time gig so I totally get it. They installed a few Helm charts, put it behind an OAuth proxy and basically hoped for the best. Instead they get a constant flood of complaints from consumers of the logging system. "Logs are missing, the search doesn't work, my parser doesn't return what I expect".

That's how I got into this. "Go upgrade the Elasticsearch cluster for our logging system" is what started me on the road that lead to the Software Telemetry book. It worked for me because I was at a smaller company. Also our data people stuck their straw directly into the main database, rather than our logging system, which is absolutely a dodged bullet.

Mathew also went into some details around scaling up a metrics system. I spent a chapter and part of an appendix on this very topic. Mathew gives you a solid concrete example of the problem from the point of view of microservices/kubernetes/CNCF architectures. This stuff is frikkin hard, and people want to add things like account tags and commit-hash tags so they can do tight correlation work inside the metrics system; the sort of cardinality problem that many metrics systems aren't designed to support.

All in all Mathew lines out several common deficiencies in how monitoring/observability/telemetry is approached in many companies, especially growing companies:

  • All too often, what systems are in place are there because someone wanted it there and made it. Which means it was built casually, and probably not fit for purpose after other folks realize how useful they are, usage scope creeps, and it becomes mission critical.
  • These systems aren't resourced (in time, money, and people) commensurate with their importance to the organization, suggesting there are some serious misalignments in the org somewhere.
  • "Just pay a SaaS provider" works to a point, but eventually the bill becomes a major point of contention forcing compromise.
  • Getting too good at centralized logging means committing to a resource intensive telemetry style, a habit that's hard to break as you get bigger.
  • No one gets good at tracing if they're using Jaeger. Those that do, are an exception. Just SaaS it, sample the heck out of it, and prepare for annual complaining when the bill comes due.

The last point about Jaeger is a key one, though. Organizations green-fielding a new product often go with tracing only as their telemetry style, since it gives so much high quality data. At the same time, tracing is the single most expensive telemetry style. Unlike centralized logging, which has a high quality self-hosted systems in the form of ELK and Loki, tracing these days only has Jaeger. There is a whole SaaS industry who sell their product based on how much better they are than Jaeger.

There is a lot in the article I didn't cover, go read it.