SysAdmins have no trouble making big lists of what can go wrong and what we're doing to stave that off a little longer. The tricky problem is pushing large organizations to take a harder look at systemic risks and taking them seriously. I mean, the big companies have to have disaster recovery (DR) plans for compliance reasons; but there are a lot of differences between box-ticking DR plans and comprehensive DR plans.

Any company big enough to get past the running out of money is the biggest disaster phase has probably spent some time thinking about what to do if things go wrong. But how do you, the engineer in the room, get the deciders to think about disasters in productive ways?

The really big disasters are obvious:

  • The datacenter catches fire after a hurricane
  • The Region goes dark due to a major earthquake
  • Pandemic flu means 60% of the office is offline at the same time
  • An engineer or automation accidentally:
    • Drops all the tables in the database
    • Deletes all the objects out of the object store
    • Destroys all the clusters/servlets/pods
    • Deconfigures the VPN
  • The above happens and you find your backups haven't worked in months

All obvious stuff, and building to deal with them will let you tick the box for compliance DR. Cool.

But there are other disasters, the sneaky ones that make you think and take a hard look at process and procedures in a way that the "oops we lost everything of [x] type" disasters generally don't.

  • An attacker subverts your laptop management software (JAMF, InTune, etc) and pushes a cryptolocker to all employee laptops
  • 30% of your application secrets got exposed through a server side request forgery (SSRF) attack
  • Nefarious personages get access to your continuous integration environment and inject trojans into your dependency chains
  • A key third party, such as your payment processor, gets ransomwared and goes offline for three weeks
  • A Slack/Teams bot got subverted and has been feeding internal data to unauthorized third parties for months

The above are all kinda "security" disasters, and that's my point. SysAdmins sometimes think of these, but even we are guilty of not having the right mental models to rattle these off the top of our head when asked. Asking about disasters like this list should start conversations that generally don't happen. Or you get the bad case: people shrug and say "that's Security's problem, not ours," which is a sign you have a toxic reliability culture.

Security-type disasters have a phase that merely technical disasters lack: how do we restore trust in production systems? In technical disasters, you can start recovery as soon as you've detected the disaster. For security disasters recovery has to wait until the attacker has been evicted, which can take a while. This security delay means key recovery concepts like Recovery Time and Recovery Point Objectives (RTO/RPO) will be subtly different.

If you're trying to knock loose some ossified DR thinking, these security type disasters can crack open new opportunities to make your job safer.

I've now spent over a decade teaching how alarms are supposed to work (specific, actionable, with the appropriate urgency) and even wrote a book on how to manage metrics systems. One topic I was repeatedly asked to cover in the book, but declined because the topic is big enough for its own book, is how to do metrics right. The desire for an expert to lay down how to do metrics right comes from a number of directions:

  • No one ever looked at ours in a systematic way and our alerts are terrible [This is asking about alerts, not metrics; but they still were indirectly asking about metrics]
  • We keep having incidents and our metrics aren't helping, how do we make them help?
  • Our teams have so many alarms important ones are getting missed [Again, asking about alerts]
  • We've half assed it, and now we're getting a growth spurt. How do we know what we should be looking for?

People really do conflate alarms/alerts with metrics, so any discussion about "how do we do metrics better" is often a "how do we do alarms better" question in disguise. As for the other two points, where people have been using vibes to pick metrics and that's no longer scaling, we actually do have a whole lot of advice; you have a whole menu of "golden signals" to pick from depending on how your application is shaped.

That's only sort of why I'm writing this.

In the mathematical construct of Site Reliability Engineering, where everything is statistics and numerical analysis, metrics are easy. Track the things that affect availability, regularly triage your metrics to ensure continued relevance, and put human processes into place to make sure you're not burning out your worker-units. But the antiseptic concept of SRE only exists in a few places, the rest of us have to pollute the purity of math with human emotions. Let me explain.

Consider your Incident Management process. There are certain questions that commonly arise when people are doing the post incident reviews:

  • Could we have caught this before release? If so, what sort of pre-release checks should we add to catch this earlier?
  • Did we learn about this from metrics or customers? If customers, what metrics do we need to add to catch this earlier? If metrics, what processes or alarms should we tune to catch this earlier?
  • Could we have caught this before the feature flag rolled out to the Emerald users? Do we need to tune the alarm thresholds to catch issues like this in groups with less feature-usage before the high value customers on Emerald plans?

And so on. Note that each question asks about refining or adding metrics. Emotionally, metrics represent anxieties. Metrics are added to catch issues before they hurt us again. Metrics are retained because they're tracking something that used to hurt us and might hurt again. This makes removing metrics hard; the people involved remember why certain metrics are present, intuitively know they needs tracking, which means emotion says to keep.

Metrics are scar tissue, and removing scar tissue is hard, bloody work. How do you reduce the number of metrics, while also not compromising your availability goals? You need the hard math of SRE to work down those emotions, but all it takes is one Engineering Manager to say "this prevented a SEV, keep it" to blow that effort up. This also means you'll have much better luck with a metric reformation effort if teams are already feeling the pinch of alert fatigue or your SaaS metric provider bills are getting big enough that the top of the company is looking at metric usage to reduce costs.

Sometimes, metrics feed into Business Intelligence. That's less about scar tissue and more about optimizing your company's revenue operations. Such metrics are less likely to lead to rapid-response on-call rotations, but still can lead to months long investigations into revenue declines. That's a different but related problem.

I could write a book about making your metrics suck less, but that book by necessity has to cover a lot of human-factors issues and has to account for the role of Incident Management in metrics sprawl. Metrics are scar tissue, keep that in mind.

In a Slack I'm on someone asked a series of questions that boil down to:

Our company has a Reliability team, but another team is ignoring SLA/SLO obligations. What can SRE do to fix this?

I got most of the way through a multi-paragraph answer before noticing my answer was, "This isn't SRE's job, it's management's job." I figured a blog post might help explain this stance better.

The genius behind the Site Reliability Engineer concept at Google is they figured out how to make service uptime and reliability matter to business management. The mathematical framework behind SRE is all about quantizing risk, quantizing impact, and that allows quantizing lost revenue; possibly even quantizing lost sales opportunity. All this quantizing falls squarely into the management you can't manage what you can't measure mindset crossed with if I can't measure it, it's an outside dependency I can ignore subtext. SRE is all about making uptime and reliability a business problem worth spending management cycles on.

In the questioner's case we already have some signal that their management has integrated SRE concepts into management practice:

  • They have a Reliability team, which only happens if someone in management believes reliability is important enough to devote dedicated headcount and a manager to.
  • They have Service Level Agreement and Service Level Objective concepts in place
  • Those SLA/SLO obligations apply to more teams than the Reliability team itself, indicating there is at least some management push to distribute reliability thinking outside of the dedicated Reliability team.

The core problem the questioner is running into is that this non-compliant team is getting away with ignoring SLA/SLO stuff, and the answer to "what can SRE do to fix this" is to be found in why and how that team is getting away with this ignoring. Management is all about making trade-off decisions against competing priorities, clearly something else is becoming a higher priority than compliance with SLA/SLO practices. What are these more important priorities, and are they in alignment with upper management's priorities?

As soon as you start asking questions along the lines of "what can a mere individual contributor do to make another manager pay attention to their own manager," you have identified a pathological power imbalance. The one tool you have is "complain to the higher level manager to make them aware of the non-compliance," and hope that higher level manager will do the needful things. If that higher level manager does not do the needful things, the individual contributor is kind of out of luck.

Under their own authority, that is. In the case of the questioner, there is a Reliability team with a manager. This means there is someone in the management chain who officially cares about this stuff, and can raise concerns higher up the org-chart. Non-compliance with policy is supposed to be a management problem, and should have management solutions. The fact the policy in question was put in place due to SRE thinking is relevant, but not the driving concern here.


The above works for organizations that are hierarchical, which implies deeper management chains. You count the number of managers between the VP of Engineering and the average engineer, and that number is between 1.0 and 2.5, you probably have a short enough org-chart to directly talk to the team in question for direct education (bridging the org-chart to use Dr. Westruum's term.) If the org-chart is >2.5 managers, you're better served going through the org-chart to solve this particular problem.

But if you're in a short org-chart company, and that other team is still refusing to comply with SLA/SLO policies, you're kind of stuck complaining to the VP of Engineering and hoping that individual force alignment through some method. But if the VPofE doesn't, that is a clear signal that Reliability is not as important to management as you thought, and you should go back to the fundamentals of making the case for prioritizing SRE practices generally.

...will never happen more than once at a company.

I say this knowing that chunks of Germany's civil infrastructure managed to standardize on SuSE desktops, and some may still be using SuSE. Some might view this as proof it can be done, I say that Linux desktops not spreading beyond this example is proof of why it didn't happen. The biggest reason we have the German example is because the decision was top down. Government decision making is different than corporate decision making, which is why we're not going to see the same thing happen, a Linux desktop (actually laptop) mandate from on high, more than few times; especially in the tech industry.

it all comes down to management and why Linux laptop users are using Linux in the first place.

You see, corporate laptops (hereafter referred to as "endpoints" to match management lingo) have certain constraints placed upon them when small companies become big companies:

  • You need some form of anti-virus and anti-malware scanning, by policy
  • You need something like either a VPN or other Zero Trust ability to do "device attestation", proving the device (endpoint) is authentic and not a hacker using stolen credentials from a person
  • You need to comply with the vulnerability management process, which means some ability to scan software versions on and endpoint and report up to a dashboard.
  • The previous three points strongly imply an ability to push software to endpoints

Windows has been able to do all four points since the 1990s. Apple came somewhat later, but this is what JAMF is for.

Then there is Linux. It is technically possible to do all of the above. Some tools, like osquery, were built for Linux first because the intended use was on servers. However, there is a big problem with Linux users. Get 10 Linux users in a room, and you're quite likely to get 10 different combination of display manager (xorg or wayland), window manager (gnome, kde, i3, others), and OS package manager. You need to either support heterogeneity or commit to building the Enterprise Linux that has one from each category and forbid others. Enterprise Linux is what the German example did.

Which is when the Linux users revolt, because banning their tiling window manager in favor of Xorg/Gnome ruins their flow -- and similar complaints. The Windows and Apple users forced onto Linux will grumble about their flow changing and why all their favorite apps can't be used, but at least it'll be uniform. If you support all three, you'll get the same 5% Linux users but the self-selected cranky ones who can't use the Linux they actually want. Most of that 5% will "settle" for another Linux before using Windows or Apple, but it's not the same.

And 5% Linux users puts supportability of the platform below the concentration needed to support that platform well. Companies like Alphabet are big enough the 5%  is big enough to make a supportable population. For smaller companies like Atlassian, perhaps not. Which puts Enterprise Linux in that twilight state between outright banned and just barely supported so long as you can tolerate all the jank.

Why tcp-mss-clamp still matters

This is blogging in anger after fighting this over the weekend. Because I'm like that I have a backup cable ISP in case my primary fiber ISP flakes out. I work from home, so the existence of internet is critical to me getting paid, and neither cell phone has good enough service to hotspot reliably. Thus, having two ISPs. It's expensive, but then so would be missing work for a week while I wait for a cable tech to come out to diagnose why their stuff isn't working.

The backup ISP hasn't been working well for a while, but the network card pointing to the second cable modem flaked out two weeks ago and that meant replacement. Which refused to pick up address info (v4 or v6) off of DHCP. Doing a hard reset from the provider side fixed the issue, but left me with the curious circumstance of:

  • I can curl from the router
  • But nothing behind it could curl.
  • Looking at the packet trace of the behind the router case saw the TCP handshake finish, but TLS handshake fail after the initial hello.

What the actual fuck.

What fixed the problem was the following policy added to my firewalld config in /etc/firewalld/policies/backuprouter.xml.

<rule>
  <tcp-mss-clamp value="1448"/>
</rule>

MSS means 'maximum segment size' which is a TCP thing indicating how much the TCP portion of the packet can occupy. For networks with a typical Maximum Transfer Unit (MTU) size of 1500, MSS is typically 1460. Networking over things like VPNs often trims the effective MTU due to VPN overhead, often to 1492 with a corresponding reduction in MSS to 1452. The tcp-mss-clamp setting is telling firewalld to lock MSS to 1448; so if something behind it requests higher, the router will rewrite (and reassemble) segments to conform to the MSS setting.

The tcp-mss-clamp setting can be set to 'pmtu' which will cause firewalld to probe what the effective MTU (and by proxy MSS) number should be so you don't have to hard-code. And yet, here I am, hard-coding because crossing my own router seems to require an extra 4 bytes. I don't know why, and that angers me. Packet traces from the router itself show MSS of 1452 working fine, but that provably doesn't work from behind my router.

Whatever. It works now, which is what matters, and now I'm contributing this nugget back to the internet.

20 years of this nonsense

20 years ago today I published the first post to this blog. It was a non-sequitur post because I didn't want my first post to be "is this thing on?" or similar. I had to look up what the Exchange worm I mentioned was, and it was probably MyDoom. That was a mass mailing worm, because this was before anti-virus was a routine component of email setups. I started a blog because I needed something to host on this new "web pages from your home directory" feature I was asked to create, and this was the first content on that project! I needed something to look at to prove it worked, and having external traffic to demonstrate along side made the demo even. In point of fact, this blog was originally hosted on a NetWare server running NetWare 5.1. On the internet and everything!

At first I used it a bit like how I later used Twitter, small posts sometimes multiple times a day. Micro-blogging in other words. So no wonder that went away once Twitter showed up. The original blog-software was on Blogger, back when they still had a "publish by FTP" feature. When blogger announced the death of FTP publishing, that was when I moved to this domain.

You can see what the original blog looked liked on Archive.org: https://web.archive.org/web/20040609170419/http://myweb.facstaff.wwu.edu/~riedesg/sysadmin1138/archive/2004_01_01_sysadmin.html Back then Blogger didn't create a page per blog-post unless you gave the post a title, which I mostly didn't. The embarrassing misspelling in the header stuck around for way too long.

My posting frequency tapered way off around 2011 for two reasons:

  1. I got a new job in the private sector, which meant that what I was working on was covered by confidentiality policies for the first time. Previously, everything I did could be revealed with the state version of a Freedom of Information Act filing. It took me quite a while to learn the art of talking about work while not talking about work.
  2. Twitter plus the death of Google Reader ended up moving my energies elsewhere.

If you take all of my blog posts and look at the middle post, that post is in 2007 somewhere; in the peak of blogging in general. This blog remains where I post my long form opinions! That isn't going to change any time soon.

Getting the world off Chrome

I'm seeing more and more folk post, "we got out of IE, we can get out of Chrome," on social media, referencing the nigh-monopoly Chrome has on the browsing experience. Unless you're using Firefox or Safari, you're using Chrome or a Chromium-derived browser. For those of you too young to remember what internet life under Internet Explorer was like, here is a short list of why it was not great:

  • Once Microsoft got the browser-share lock in, it kind of stopped innovating the browser. It conquered the market, so they could pull back investment in it.
  • IE didn't follow standards. But then, Microsoft was famous for "embrace and extend," where they adopt (mostly) a standard, then Microsoftify it enough no one considers using the non-MS version of the standard.
  • If you were on a desktop platform that didn't have IE, such as Apple Macintosh, you were kinda screwed.

Google Chrome took over from IE for three big reasons:

  • They actually were standards compliant, more so than the other alt-browsers (Mozilla's browsers, Opera, and Safari)
  • They actually were trying to innovate in the browser
  • Most important: they were a megacorp with a good reputation who wanted everyone to use their browser. Mozilla and Opera were too small for that, and Apple never has been all that comfortable supporting non-Apple platforms. In classic dot-com era thinking, Google saw a dominant market player grow complacent and smelled a business opportunity.

This made Chrome far easier to develop for, and Chrome grew a reputation for being a web developer's browser. This fit in nicely to Google's plan for the future, which they saw as full of web applications. Google understands what they have, and how they got there. They also understand "embrace and extend," but found a way to do that without making it proprietary the way Microsoft did: capture the standards committees.

If you capture the standards committees, meaning what you want is almost guaranteed a rubber stamp from the committee, then you get to define what industry standard is. Microsoft took a capitalist, closed-source approach to embrace and extend where the end state was a place where the only viable way to do a thing was the thing that was patent-locked into Microsoft. Google's approach is more overtly FOSSY in that they're attempting to get internet consensus for their changes, while also making it rather harder for anyone else to do what they do.

Google doesn't always win. Their "web environment integrity" proposal, which would have given web site operators far greater control over browser extensions like ad-blockers, quietly got canned recently after internet outrage. Another area that got a lot of push back from the internet was Chrome's move away from "v2 manifest" extensions, which include ad-blockers, in favor of "v3 manifest" plugins which made ad-blockers nearly impossible to write. The move from v2 to v3 was delayed a lot while Google grudgingly put in abilities for ad-blockers to continue working.

Getting off of Chrome

The circumstances that drove the world off of Internet Explorer aren't there for Chrome.

  • Chrome innovates constantly and in generally user-improving ways (so long as that improvement doesn't majorly affect ad-revenue)
  • Chrome listens, to a point, to outrage when decisions are made
  • Chrome is functionally setting web standards, but doing so through official channels with RFCs, question periods, and all that ritual
  • Chrome continues to consider web-developer experience to be a number one priority
  • Alphabet, Google's parent company, fully understands what happens when the dominant player grows complacent, they get replaced the way Google replaced Microsoft in the browser wars.

One thing has changed since the great IE to Chrome migration began, Google lost its positive reputation. The old "don't be evil" thing was abandoned a long time ago, and everyone knows it. Changes proposed by Google or Google proxies are now viewed skeptically; though, overtly bad ideas still require internet outrage to delay or prevent a proposal from happening.

That said, you lose monopolies through either laziness of the monopolist (Microsoft) or regulatory action, and I'm not seeing any signs of laziness.

Every time the topic of documentation comes up at work, at multiple workplaces, someone always says a variant of the following:

What we really need is markdown in a git repository. We get version control, there is a lot of tooling to make markdown work good in git, it's great

And every time I have to grit my teeth and hope I don't cause dental damage. My core complaint is that internal documentation has fundamentally different objectives than open source software documentation repositories, and pretending they're the same problem domain means we'll be re-having the documentation discussion in 18 to 24 months.

The examples of OSS projects using markdown or asciidoc as their documentation repository are many, and it works pretty well. Markdown and asciidoc are markup, which allows compilers to turn the marked up doc into rendered sites. This makes accepting contributions from the community much easier, because it follows the same merge-request workflow as code. As most OSS projects are chronically under-staffed, anything that allows reuse of process is a win. Also, markdown and asciidoc are relatively simple formats so you don't need expensive software like Adobe InDesign to make  them.

OSS project docs are focused on several jobs to be done, and questions by readers:

  • How to install the thing
  • How to configure the thing
  • How to upgrade the thing
  • How to build various workflows the thing allows you to do
  • Troubleshooting tips for the thing
  • How often to expect releases of the thing
  • How to integrate with other things, if this thing allows integration
  • How to use the thing's API
  • Where to find the thing's SDK for various languages

Corporate internal documentation repositories need to do all of the above, but generally for a much wider range of things and services. Cool, that's what standards are for. But "markdown in a git repo" goes a bit off the rails when you look at all the other types of documentation internal docs often cover:

  • On-call rotation standards and contacts
  • Pager-playbooks for the page-out alarms
  • Incident Management program procedures and definitions
  • Post incident review documents for each incident
  • Service maturity standards for being allowed in prod
  • Ownership documentation linking services to individual teams (updated or re-created after each reorg)
  • Decision docs for implementing features or updating process
  • Roadmap documentation going out three years (new docs generated quarterly)
  • How to set up your development environment
  • How to access prod, and who is allowed to access prod
  • Protocols for accessing the datacenter hardware or cloud config consoles
  • The entire software development lifecycle (SDLC) including how CI works, what tests are required when, how tests are selected for inclusion, which linters are included, and when it's allowed to ignore all that because of an emergency

And so on. The sneaky part here is that the OSS projects have many of the above as well, but they're kept in things like Google Docs, Etherpads, Wikis, Twillio, Canvases in Slack, many things that are definitely not involving the merge request workflow into git. All of these extra internal documentation repository jobs to be done greatly complicate what solutions count as viable, in large part because this huge list is actually trying to 'simplify' multiple documentation styles into a single monolithic document repository. What styles are these? Well:

  • Product documentation, describing how to install, configure, and maintain the product.
  • Process documentation, describing the ways various people-driven procedures are done, such as the incident management process and the number of review meetings that need to be held before a feature is released to production.
  • Decision documentation, which evolves over time as people work through what an ultimate decision will look like, changing their minds along the way. Post-incident review docs are of this type.
  • Responder runbooks, used by people responding to incidents to use pre-defined (and risk vetted) procedures as part of incident response.
  • Maintenance runbooks, used by operators of the system to do various things, which is often based on a combination of product and process documentation, to create a grand unified procedure in one document.

All of these documentation styles need somewhat different document lifecycles, which in turn drives need to support workflows. A document lifecycle ensures that documentation is valid, up to date, and old information is removed. Sometimes documentation is a key part of compliance with regulation or industry standard-setting bodies, which adds review steps.

  • Product documentation probably needs multi-step reviews to ensure updates are valid. Confluence is terrible for this, git is less bad. Product docs also need regular review for freshness, and pruning of no longer relevant docs.
  • Process documentation less obviously needs multi-step review. Some will, some won't. Freshness is key, since process documentation describes the how of operating the system or accessing human processes, and old docs pollute search results.
  • Decision documentation definitely does not need multi-step review, it needs to be updated by anyone involved, and may be surplus to requirements once the feature is built. In fact, these docs need to allow collaborative editing, like Etherpad or Google Docs, making them fundamentally unsuited for a git-based workflow. However, having such docs still around is occasionally useful later in time when someone tries to figure out "who thought this was a good idea, and why didn't they consider this obvious failure case?"
  • Responder runbooks also can have compliance interactions; if so, these need multi-step review for risk management decisions. If not, they're probably a per-team free for all. As is the way of responder runbooks, rare errors are nigh impossible to check for freshness so these are the least likely to be verifiably up to date.
  • Maintenance runbooks run the gamut from per team free for all to onerous multi-step review process, all depending on the risks of doing the thing and the nature of the business.

Ideally, the high lifecycle docs like product and process documentation would be in one system, with the minimal lifecycle docs like decision review and responder runbooks in another system entirely. This would allow each system to cater to the needs of the styles within, and solve more of the business' problems. I would like a two-system solution very much.

Except.

People have spent the last 25 years being trained that how you find documentation is:

  1. Look in the obvious place. If you don't find it....
  2. Search google. If that doesn't work, retry your terms. If after three tries you still haven't found it....
  3. Complain on social media.

A two doc-system solution is not well tolerated, and people will build a "universal search" engine to search both the high and low process repositories. Also, two doc systems seems like a lot of overhead. And how do you make sure the right docs go in the right system? Why not use one doc system that's sort of okay at both jobs and save money? 18 to 24 months later, discontent at how bad the "sort of okay" solution is rises and people advocate to moving to a new thing, and suggest markdown in a git repo.

I've been in Firefox a long time

I intended to write a "history of my browser usage" post as part of a longer piece on the Chrome monoculture, but this blog is nearly 20 years old and it turns out I already did a history.

I can't find when I permanently dropped SeaMonkey, but it was after 2010. I dropped SeaMonkey late 2013 (thank you stale profile directory with a date stamp) when it was clearly abandonware and I learned you actually could launch Firefox in parallel with multiple profiles (the

firefox -p --no-remote

combination was key). I stopped using multiple profiles when the Container plugin came out that did nearly everything separate profiles did. It turns out SeaMonkey is still getting updates, but it seems to be tracking the Firefox and Thunderbird Extended Service Releases.

For those of you too old to remember the original Netscape Navigator, it also came with a few tools beyond the browser:

  • The browser, of course
  • An email client, since this was before GMail and web-editors for email weren't really a thing yet
  • An HTML editor (for pre-CSS versions of editor)

The reason I liked SeaMonkey and Opera is they both still shipped with an email client. It was pretty nice, actually. I kept Opera around as my email client way past when I stopped using it for general browsing. I'm fuzzy on what I did after Opera dropped their mail client, I may have grumpily transitioned onto Gnome Evolution at that point. Also, Gmail was out and I was quite used to web-based email clients.

So yeah, I've been in Firefox for over a decade at this point.

This is a controversial take, but the phrase "it's industry standard" is over-used in technical design discussions of the internal variety.

Yes, there are some actual full up standards. Things like RFCs and ISO-standards are actual standards. There are open standards that are widely adopted, like OpenTelemetry and the Cloud Native Computing Foundation suite, but these are not yet industry standards. The phrase "industry standard" implies consensus, agreement, a uniform way of working in a specific area.

Have you seen the tech industry? Really seen it? It is utterly vast. The same industry includes such categories as:

  • Large software as a service providers like Salesforce and Outlook.com
  • Medium software as a service providers like Box.com and Dr. Chrono
  • Small software as a service providers like every bay area startup made in the last five years
  • Large embedded systems design like the entire automotive industry
  • Highly regulated industries like Health Care and Finance, where how you operate is strongly influenced by the government and similar non-tech organizations
  • The IT departments at all of the above, which is much smaller than they used to be due to the SaaS revolution, but still exist
  • Scientific computing for things like space probes, satellite base systems, and remote sensing platforms floating the oceans
  • Internal services work at companies that don't sell technology, places like UPS, Maersk, Target, and Orange County California.

The only thing the above have any kind of consensus on is "IP-based networking is better than the alternatives," and even that is a bit fragile. Such out there statements like "HTTP is a standard transport" are ones you'd think there would be consensus on, but you'd be wrong. Saying that "kubernetes patterns are industry standard" is a statement of desire, not a statement of fact.

Thing is, the Sysadmin community used this mechanic for self-policing for literal decades. Any time someone comes to the community with a problem, it has to pass a "best practices" smell test before we consider answering the question as asked; otherwise, we'll interrogate the bad decisions that lead to this being a problem in the first place. This mechanic is 100% why ServerFault has a "reasonable business practices" close reason:

Questions should demonstrate reasonable information technology management practices. Questions that relate to unsupported hardware or software platforms or unmaintained environments may not be suitable for Server Fault.

Who sets the "best practices" for the sysadmin community? It's a group consensus of the long time members, which is slightly different between each community. There are no RFCs. There are no ISO standards. The closest we get is ITIL, the IT Infrastructure Library, which we all love to criticize anyway.

Best practices, which is "industry standard" by an older name, have always been an "I know it when I see it" thing. A tool used by industry elders to shame juniors into changing habits. Don't talk to me until you level up to the base norms of our industry, pleeb; and never mind that those norms are not canonicalized anywhere outside of my head.

This is why the phrase "it's industry standard" should not be used in internal technical design conversations

This phrase is shame based policing of concepts. If something is actually a standard, people should be able to look it up and see the history of why we do it this way.

Maybe the "industry" part of that statement is actually relevant; if that's the case, say so.

  • All of the base technology our market segment run on is made by three companies, so we do what they require.
  • Our industry are startups founded in 2010-2015 by ex-Googlers, so our standard is what Google did then.
  • Our industry computerized in the 1960s and has consumers in high tech and high poverty areas, so we need to keep decades of backwards compatibility.
  • Our industry is VC-funded SaaS startups founded after 2018 in the United States, who haven''t exited yet. So we need to stay on top of the latest innovations to ensure our funding rounds are successful.
  • Our industry is dominated by on-prem Java shops, so we have to be Java as well in order to sell into this market.

These are useful, important constraints and context for people to know. The vague phrase "industry standard" does not communicate context or constraints beyond, "your solution is bad, and you should feel bad for suggesting it." Shame is not how we maintain generative cultures.

It's time to drop "it's industry standard" from regular use.