Getting the world off Chrome

I'm seeing more and more folk post, "we got out of IE, we can get out of Chrome," on social media, referencing the nigh-monopoly Chrome has on the browsing experience. Unless you're using Firefox or Safari, you're using Chrome or a Chromium-derived browser. For those of you too young to remember what internet life under Internet Explorer was like, here is a short list of why it was not great:

  • Once Microsoft got the browser-share lock in, it kind of stopped innovating the browser. It conquered the market, so they could pull back investment in it.
  • IE didn't follow standards. But then, Microsoft was famous for "embrace and extend," where they adopt (mostly) a standard, then Microsoftify it enough no one considers using the non-MS version of the standard.
  • If you were on a desktop platform that didn't have IE, such as Apple Macintosh, you were kinda screwed.

Google Chrome took over from IE for three big reasons:

  • They actually were standards compliant, more so than the other alt-browsers (Mozilla's browsers, Opera, and Safari)
  • They actually were trying to innovate in the browser
  • Most important: they were a megacorp with a good reputation who wanted everyone to use their browser. Mozilla and Opera were too small for that, and Apple never has been all that comfortable supporting non-Apple platforms. In classic dot-com era thinking, Google saw a dominant market player grow complacent and smelled a business opportunity.

This made Chrome far easier to develop for, and Chrome grew a reputation for being a web developer's browser. This fit in nicely to Google's plan for the future, which they saw as full of web applications. Google understands what they have, and how they got there. They also understand "embrace and extend," but found a way to do that without making it proprietary the way Microsoft did: capture the standards committees.

If you capture the standards committees, meaning what you want is almost guaranteed a rubber stamp from the committee, then you get to define what industry standard is. Microsoft took a capitalist, closed-source approach to embrace and extend where the end state was a place where the only viable way to do a thing was the thing that was patent-locked into Microsoft. Google's approach is more overtly FOSSY in that they're attempting to get internet consensus for their changes, while also making it rather harder for anyone else to do what they do.

Google doesn't always win. Their "web environment integrity" proposal, which would have given web site operators far greater control over browser extensions like ad-blockers, quietly got canned recently after internet outrage. Another area that got a lot of push back from the internet was Chrome's move away from "v2 manifest" extensions, which include ad-blockers, in favor of "v3 manifest" plugins which made ad-blockers nearly impossible to write. The move from v2 to v3 was delayed a lot while Google grudgingly put in abilities for ad-blockers to continue working.

Getting off of Chrome

The circumstances that drove the world off of Internet Explorer aren't there for Chrome.

  • Chrome innovates constantly and in generally user-improving ways (so long as that improvement doesn't majorly affect ad-revenue)
  • Chrome listens, to a point, to outrage when decisions are made
  • Chrome is functionally setting web standards, but doing so through official channels with RFCs, question periods, and all that ritual
  • Chrome continues to consider web-developer experience to be a number one priority
  • Alphabet, Google's parent company, fully understands what happens when the dominant player grows complacent, they get replaced the way Google replaced Microsoft in the browser wars.

One thing has changed since the great IE to Chrome migration began, Google lost its positive reputation. The old "don't be evil" thing was abandoned a long time ago, and everyone knows it. Changes proposed by Google or Google proxies are now viewed skeptically; though, overtly bad ideas still require internet outrage to delay or prevent a proposal from happening.

That said, you lose monopolies through either laziness of the monopolist (Microsoft) or regulatory action, and I'm not seeing any signs of laziness.

Every time the topic of documentation comes up at work, at multiple workplaces, someone always says a variant of the following:

What we really need is markdown in a git repository. We get version control, there is a lot of tooling to make markdown work good in git, it's great

And every time I have to grit my teeth and hope I don't cause dental damage. My core complaint is that internal documentation has fundamentally different objectives than open source software documentation repositories, and pretending they're the same problem domain means we'll be re-having the documentation discussion in 18 to 24 months.

The examples of OSS projects using markdown or asciidoc as their documentation repository are many, and it works pretty well. Markdown and asciidoc are markup, which allows compilers to turn the marked up doc into rendered sites. This makes accepting contributions from the community much easier, because it follows the same merge-request workflow as code. As most OSS projects are chronically under-staffed, anything that allows reuse of process is a win. Also, markdown and asciidoc are relatively simple formats so you don't need expensive software like Adobe InDesign to makeĀ  them.

OSS project docs are focused on several jobs to be done, and questions by readers:

  • How to install the thing
  • How to configure the thing
  • How to upgrade the thing
  • How to build various workflows the thing allows you to do
  • Troubleshooting tips for the thing
  • How often to expect releases of the thing
  • How to integrate with other things, if this thing allows integration
  • How to use the thing's API
  • Where to find the thing's SDK for various languages

Corporate internal documentation repositories need to do all of the above, but generally for a much wider range of things and services. Cool, that's what standards are for. But "markdown in a git repo" goes a bit off the rails when you look at all the other types of documentation internal docs often cover:

  • On-call rotation standards and contacts
  • Pager-playbooks for the page-out alarms
  • Incident Management program procedures and definitions
  • Post incident review documents for each incident
  • Service maturity standards for being allowed in prod
  • Ownership documentation linking services to individual teams (updated or re-created after each reorg)
  • Decision docs for implementing features or updating process
  • Roadmap documentation going out three years (new docs generated quarterly)
  • How to set up your development environment
  • How to access prod, and who is allowed to access prod
  • Protocols for accessing the datacenter hardware or cloud config consoles
  • The entire software development lifecycle (SDLC) including how CI works, what tests are required when, how tests are selected for inclusion, which linters are included, and when it's allowed to ignore all that because of an emergency

And so on. The sneaky part here is that the OSS projects have many of the above as well, but they're kept in things like Google Docs, Etherpads, Wikis, Twillio, Canvases in Slack, many things that are definitely not involving the merge request workflow into git. All of these extra internal documentation repository jobs to be done greatly complicate what solutions count as viable, in large part because this huge list is actually trying to 'simplify' multiple documentation styles into a single monolithic document repository. What styles are these? Well:

  • Product documentation, describing how to install, configure, and maintain the product.
  • Process documentation, describing the ways various people-driven procedures are done, such as the incident management process and the number of review meetings that need to be held before a feature is released to production.
  • Decision documentation, which evolves over time as people work through what an ultimate decision will look like, changing their minds along the way. Post-incident review docs are of this type.
  • Responder runbooks, used by people responding to incidents to use pre-defined (and risk vetted) procedures as part of incident response.
  • Maintenance runbooks, used by operators of the system to do various things, which is often based on a combination of product and process documentation, to create a grand unified procedure in one document.

All of these documentation styles need somewhat different document lifecycles, which in turn drives need to support workflows. A document lifecycle ensures that documentation is valid, up to date, and old information is removed. Sometimes documentation is a key part of compliance with regulation or industry standard-setting bodies, which adds review steps.

  • Product documentation probably needs multi-step reviews to ensure updates are valid. Confluence is terrible for this, git is less bad. Product docs also need regular review for freshness, and pruning of no longer relevant docs.
  • Process documentation less obviously needs multi-step review. Some will, some won't. Freshness is key, since process documentation describes the how of operating the system or accessing human processes, and old docs pollute search results.
  • Decision documentation definitely does not need multi-step review, it needs to be updated by anyone involved, and may be surplus to requirements once the feature is built. In fact, these docs need to allow collaborative editing, like Etherpad or Google Docs, making them fundamentally unsuited for a git-based workflow. However, having such docs still around is occasionally useful later in time when someone tries to figure out "who thought this was a good idea, and why didn't they consider this obvious failure case?"
  • Responder runbooks also can have compliance interactions; if so, these need multi-step review for risk management decisions. If not, they're probably a per-team free for all. As is the way of responder runbooks, rare errors are nigh impossible to check for freshness so these are the least likely to be verifiably up to date.
  • Maintenance runbooks run the gamut from per team free for all to onerous multi-step review process, all depending on the risks of doing the thing and the nature of the business.

Ideally, the high lifecycle docs like product and process documentation would be in one system, with the minimal lifecycle docs like decision review and responder runbooks in another system entirely. This would allow each system to cater to the needs of the styles within, and solve more of the business' problems. I would like a two-system solution very much.

Except.

People have spent the last 25 years being trained that how you find documentation is:

  1. Look in the obvious place. If you don't find it....
  2. Search google. If that doesn't work, retry your terms. If after three tries you still haven't found it....
  3. Complain on social media.

A two doc-system solution is not well tolerated, and people will build a "universal search" engine to search both the high and low process repositories. Also, two doc systems seems like a lot of overhead. And how do you make sure the right docs go in the right system? Why not use one doc system that's sort of okay at both jobs and save money? 18 to 24 months later, discontent at how bad the "sort of okay" solution is rises and people advocate to moving to a new thing, and suggest markdown in a git repo.

I've been in Firefox a long time

I intended to write a "history of my browser usage" post as part of a longer piece on the Chrome monoculture, but this blog is nearly 20 years old and it turns out I already did a history.

I can't find when I permanently dropped SeaMonkey, but it was after 2010. I dropped SeaMonkey late 2013 (thank you stale profile directory with a date stamp) when it was clearly abandonware and I learned you actually could launch Firefox in parallel with multiple profiles (the

firefox -p --no-remote

combination was key). I stopped using multiple profiles when the Container plugin came out that did nearly everything separate profiles did. It turns out SeaMonkey is still getting updates, but it seems to be tracking the Firefox and Thunderbird Extended Service Releases.

For those of you too old to remember the original Netscape Navigator, it also came with a few tools beyond the browser:

  • The browser, of course
  • An email client, since this was before GMail and web-editors for email weren't really a thing yet
  • An HTML editor (for pre-CSS versions of editor)

The reason I liked SeaMonkey and Opera is they both still shipped with an email client. It was pretty nice, actually. I kept Opera around as my email client way past when I stopped using it for general browsing. I'm fuzzy on what I did after Opera dropped their mail client, I may have grumpily transitioned onto Gnome Evolution at that point. Also, Gmail was out and I was quite used to web-based email clients.

So yeah, I've been in Firefox for over a decade at this point.

This is a controversial take, but the phrase "it's industry standard" is over-used in technical design discussions of the internal variety.

Yes, there are some actual full up standards. Things like RFCs and ISO-standards are actual standards. There are open standards that are widely adopted, like OpenTelemetry and the Cloud Native Computing Foundation suite, but these are not yet industry standards. The phrase "industry standard" implies consensus, agreement, a uniform way of working in a specific area.

Have you seen the tech industry? Really seen it? It is utterly vast. The same industry includes such categories as:

  • Large software as a service providers like Salesforce and Outlook.com
  • Medium software as a service providers like Box.com and Dr. Chrono
  • Small software as a service providers like every bay area startup made in the last five years
  • Large embedded systems design like the entire automotive industry
  • Highly regulated industries like Health Care and Finance, where how you operate is strongly influenced by the government and similar non-tech organizations
  • The IT departments at all of the above, which is much smaller than they used to be due to the SaaS revolution, but still exist
  • Scientific computing for things like space probes, satellite base systems, and remote sensing platforms floating the oceans
  • Internal services work at companies that don't sell technology, places like UPS, Maersk, Target, and Orange County California.

The only thing the above have any kind of consensus on is "IP-based networking is better than the alternatives," and even that is a bit fragile. Such out there statements like "HTTP is a standard transport" are ones you'd think there would be consensus on, but you'd be wrong. Saying that "kubernetes patterns are industry standard" is a statement of desire, not a statement of fact.

Thing is, the Sysadmin community used this mechanic for self-policing for literal decades. Any time someone comes to the community with a problem, it has to pass a "best practices" smell test before we consider answering the question as asked; otherwise, we'll interrogate the bad decisions that lead to this being a problem in the first place. This mechanic is 100% why ServerFault has a "reasonable business practices" close reason:

Questions should demonstrate reasonable information technology management practices. Questions that relate to unsupported hardware or software platforms or unmaintained environments may not be suitable for Server Fault.

Who sets the "best practices" for the sysadmin community? It's a group consensus of the long time members, which is slightly different between each community. There are no RFCs. There are no ISO standards. The closest we get is ITIL, the IT Infrastructure Library, which we all love to criticize anyway.

Best practices, which is "industry standard" by an older name, have always been an "I know it when I see it" thing. A tool used by industry elders to shame juniors into changing habits. Don't talk to me until you level up to the base norms of our industry, pleeb; and never mind that those norms are not canonicalized anywhere outside of my head.

This is why the phrase "it's industry standard" should not be used in internal technical design conversations

This phrase is shame based policing of concepts. If something is actually a standard, people should be able to look it up and see the history of why we do it this way.

Maybe the "industry" part of that statement is actually relevant; if that's the case, say so.

  • All of the base technology our market segment run on is made by three companies, so we do what they require.
  • Our industry are startups founded in 2010-2015 by ex-Googlers, so our standard is what Google did then.
  • Our industry computerized in the 1960s and has consumers in high tech and high poverty areas, so we need to keep decades of backwards compatibility.
  • Our industry is VC-funded SaaS startups founded after 2018 in the United States, who haven''t exited yet. So we need to stay on top of the latest innovations to ensure our funding rounds are successful.
  • Our industry is dominated by on-prem Java shops, so we have to be Java as well in order to sell into this market.

These are useful, important constraints and context for people to know. The vague phrase "industry standard" does not communicate context or constraints beyond, "your solution is bad, and you should feel bad for suggesting it." Shame is not how we maintain generative cultures.

It's time to drop "it's industry standard" from regular use.

Some engineers at Google have put forth a proposal called Web-Environment-Integrity that has the open source community up in arms. The leading criticisms of this proposal are "Google wants to make DRM for websites" and "Google wants to ban ad-blockers." These are catchy headlines intended to capture attention, they're also mostly true. For the people who don't want to wade through the discourse, this post is about what WEI does and where it came from.

This story begins in the previous decade when Google put forth the "Zero Trust framework" as a way to get rid of the corporate VPN. Zero Trust was a suite of techniques to allow companies to do away with the expensive and annoying to maintain VPN. The core concept behind Zero Trust was something I didn't truly understand until a few years ago: Zero Trust adds device attestation (a machine login) in addition to the user attestation when deciding whether to grant access to a resource, which is a more robust security boundary than a separate VPN login.

In a company context, you can reasonably expect the company to own both the server and the machine employees are accessing internal resources from. Zero Trust also enabled servers to specify a minimum security level that clients must meet in order to have access granted, such as up to date browser and OS versions, as well as a valid device identifier. This ends up being a major security win because when an employee has their user credentials phished, an attacker can't immediately use those stolen credentials to do evil things; the attacker will have to somehow gain a device identity as well.

In companies that use VPNs, phishing the user's credential was often enough to allow creating a VPN connection. If that wasn't enough, Phishing the VPN certificates would do the trick. Full Zero Trust makes the attacker's job harder. Again, in a corporate context where the company reasonably owns both the client and server sides of the conversation.

Back to WEI: Web-Environment-Integrity is a proposal to bring a limited form of device attestation to public surfaces. While WEI won't have a device credential, it will have the ability to attest to OS and Browser versions, presence and state of browser plugins, among other things. In theory this allows bank website operators (for example) to ensure people have a browser within three months of latest, are not operating any plugins known to sniff or record keystrokes and clicks, and are not running plugins with known vulnerabilities.

Unlike Zero Trust, the company does not reasonably own the client side of the conversation in a WEI context. This radically changes the power dynamics between public users and private servers. Under current internet rules, both sides mutually distrust each other and slowly establish trust. Under WEI, the server sets a minimum trust boundary that's far higher than is currently possible, which gives server operators far more power in this conversation than before. A Zero Trust like level of power, in fact.

What does Web-Environment-Integrity allow server operators to do?

As it happens, WEI is a clear example of a technique or standard that needs to have a clear and well thought out answer to the question:

What can a well resourced malicious actor do with this framework? How can they use this safety tool to harm people?

Right now, we don't have those answers. The explainer goes into some detail about how to avoid tracking through WEI techniques, but overlooks the thing that has everyone posting "Google wants to ban ad-blockers" headlines. The WEI proposal allows server operators to prohibit the use of specific browser plugins when accessing the site, which gives ad-networks an in-browser way to say "turn off your ad-blocker in order to see this content."

The well resourced malicious actor here is the internet advertising industry, of which Alphabet (Google's parent company) is one of the biggest members. The proposal writers do not view code injection and people-tracking through advertising to be malicious, they see it as a perfectly legitimate way to pay for services.

"But it's not the server operator doing the banning, it's the attestor; and the attestor has no idea what's on the site!"

The WEI standard involves three parties: The browser making the request, the server hosting the site, and the 'attestor' service the server relies on to judge the integrity of the browser making the request. The "Google wants to ban ad-blockers" headline happens when an advertising-driven site uses an attestor service that considers browsers with ad-block plugins to be insecure. Technically it isn't the server making the "no ad-block" constraint, at least at the time of the request. The Server operator made that choice when they selected an attestor service that prohibits ad-block plugins.

This sort of deniability is all over the tech industry.

I ran into a pretty common attitude regarding workplace diversity the other day. It was on a Q/A site. Paraphrased, the issue is:

Q: How can we improve the diversity of the candidates we hire?

A: That actually hurts the diversity of your hiring pool, because many people see "diversity!" on a hiring page and immediately go somewhere else. Who wants to be hired to a place that'll give a job to an unqualified minority just to meet some numbers?

The mechanism this answerer was assuming, that diversity programs are quota systems, has been explicitly illegal in the US since the 1980s. The Supreme Court at the time ruled that quotas like that were what this answerer said: racism, even if it was intended to correct for systemic biases. If you find your prospective workplace is using quotas, or explicitly hiring following racialized patterns, you have strong grounds for a lawsuit.

Within the last year, news broke of a company that got into hot water by someone posting "I'm hiring for X, give me all of your non-binary, queer, women, and minority leads please!" to Twitter. The strong implication by this statement was that this company was using a racialized hiring process which is illegal.

These days hiring pipelines at large US companies are engineered to avoid getting sued, and therefore don't use quotas. To build a hiring pipeline that furthers a company's diversity goals, while also avoid getting sued, requires several things:

  • You must interview/treat everyone who applies equally.
  • You must assess each application the same way.
  • You must make your final hire/pass decision based on the merits of the application.
  • (US) You must give preference to military veterans, by law.

So far, this is an "equity" argument. But building a system to improve your workplace diversity needs a few more steps.

  • You can change the mix of your applicants by biasing where you advertise the job.
  • You can't hide the job posting and pass out application links to your preferred groups. You still need to post them on your jobs page.
  • Remove biasing language from your posting and job application process.
  • You still have to treat all applicants the same once they've applied.

Equity, in other words.

Furthermore, some companies are beginning to reframe their diversity programs towards a "meet the market" approach. In that they assess their diversity program success based on how well their employee mix matches the potential job market for their roles. If a given position is 82% male in the job market, that's the target they'll push for; not 50%.

Equity, because that's the most legally conservative option if you want to avoid lawsuits for discrimination in US courts.

Mathew Duggan wrote a blog post on June 9th titled, "Monitoring is a pain" where he goes into delicious detail around where observability, monitoring, and telemetry goes wrong inside organizations. I have a vested interest in this, and still agree. Mathew captures a sentiment I didn't highlight enough in my book, that a good observability platform for engineering tends to get enmeshed ever deeper into the whole company's data engineering motion, even though that engineering observability platform isn't resourced enough to really serve that broader goal all that well. This is a subtle point, but absolutely critical for diagnosing criticism of observability platforms.

By not designing your observability platform from the beginning for eventual integration into the overall data motion, you get a lot of hard to reduce misalignment of function, not to mention poorly managed availability assumptions. Did your logging platform hiccup for 20 minutes, thus robbing the business metrics people of 20 minutes of account sign-up/shutdown/churn metrics? All data is noisy, but data engineering folk really like it if the noise is predictable and thus can be modeled out. Did the metrics system have an unexpected reboot, which prevented the daily code-push to production because Delivery Engineering couldn't check their canary deploy metrics? Guess your metrics system is now a production critical system instead of the debugging tool you thought it was.

Data engineering folk like their data to be SQL shaped for a lot of reasons, but few observability and telemetry systems have an SQL interface. Mathew proposed a useful way to provide that:

When you have a log that must be stored for compliance or legal reasons, don't stick it into the same system you use to store every 200 - OK line. Write it to a database (ideally) or an object store outside of the logging pipeline. I've used DynamoDB for this and had it work pretty well by sticking it in an SQS pipeline -> Lambda -> Dynamo. Then your internal application can query this and you don't need to worry about log expiration with DynamoDB TTL.

Dynamo is SQL-like, so this method could work for doing business metrics things like the backing numbers for computing churn rate and monthly active users (MAU) numbers. Or tracking things like password resets, email changes, and gift-card usage for your fraud/abuse department. Or all admin-portal activity for your annual external audits.

Mathew also unintentionally called me out.

Chances are this isn't someones full-time job in your org, they just happened to pick up logging. It's not supposed to be a full-time gig so I totally get it. They installed a few Helm charts, put it behind an OAuth proxy and basically hoped for the best. Instead they get a constant flood of complaints from consumers of the logging system. "Logs are missing, the search doesn't work, my parser doesn't return what I expect".

That's how I got into this. "Go upgrade the Elasticsearch cluster for our logging system" is what started me on the road that lead to the Software Telemetry book. It worked for me because I was at a smaller company. Also our data people stuck their straw directly into the main database, rather than our logging system, which is absolutely a dodged bullet.

Mathew also went into some details around scaling up a metrics system. I spent a chapter and part of an appendix on this very topic. Mathew gives you a solid concrete example of the problem from the point of view of microservices/kubernetes/CNCF architectures. This stuff is frikkin hard, and people want to add things like account tags and commit-hash tags so they can do tight correlation work inside the metrics system; the sort of cardinality problem that many metrics systems aren't designed to support.

All in all Mathew lines out several common deficiencies in how monitoring/observability/telemetry is approached in many companies, especially growing companies:

  • All too often, what systems are in place are there because someone wanted it there and made it. Which means it was built casually, and probably not fit for purpose after other folks realize how useful they are, usage scope creeps, and it becomes mission critical.
  • These systems aren't resourced (in time, money, and people) commensurate with their importance to the organization, suggesting there are some serious misalignments in the org somewhere.
  • "Just pay a SaaS provider" works to a point, but eventually the bill becomes a major point of contention forcing compromise.
  • Getting too good at centralized logging means committing to a resource intensive telemetry style, a habit that's hard to break as you get bigger.
  • No one gets good at tracing if they're using Jaeger. Those that do, are an exception. Just SaaS it, sample the heck out of it, and prepare for annual complaining when the bill comes due.

The last point about Jaeger is a key one, though. Organizations green-fielding a new product often go with tracing only as their telemetry style, since it gives so much high quality data. At the same time, tracing is the single most expensive telemetry style. Unlike centralized logging, which has a high quality self-hosted systems in the form of ELK and Loki, tracing these days only has Jaeger. There is a whole SaaS industry who sell their product based on how much better they are than Jaeger.

There is a lot in the article I didn't cover, go read it.

The trajectory of "AI" features

I've been working for bay area style SaaS providers for a bit over a decade now, which means I've seen some product cycles come and go. Something struck me today that I want to share, and it relates to watching product decisions get made and adjusting to market pressures.

All the AI-branded features being pushed out by everyone that isn't ChatGPT, Google, or Microsoft all depend on one of the models from those three companies. That, or they're rebranding existing machine learning features as "AI" to catch the marketing moment. Even so, all the features getting pushed out come from the same base capabilities of ChatGPT style large language models:

  • Give it a prompt, it generates text.

Okay, that's the only capability. But this one capability is driving things like:

  • Rewriting your thing to be grammatical for you!
  • Rewriting your thing to be more concise!
  • Suggesting paragraphs for your monthly newsletter!
  • Answering general knowledge questions!

That's about it. We've had about half a year of time to react to ChatGPT and start generating products based on that foundation, and the above is what we're seeing. A lot of these are one to three engineering teams worth of effort over the course of a few months, with a month or three of internal and external testing. These are the basic features of LLM-based machine learning, also known right now as AI.

We don't yet know what features we'll have in two years. "Suggest text from a prompt" went from hot new toy to absolute commodity table-stakes in 8 months, I've never seen it before. Maybe we've fully explored what these models are capable of, but we sure haven't explored the regulatory impacts yet.

The regulatory impacts are the largest risk to these tools. We've already seen news stories of lawyers using these tools to write briefs with fictional citations, and efforts to create deep fake stories "by" prolific authors. The European Union is already considering regulation to require "cite your sources", which GPT isn't great at. Neither are GPT-based models good about the Right To Be Forgotten enshrined in EU law by GDPR.

The true innovation we'll see in coming years will be in creating models that can comply with source-citing and enable targeted excludes. That's going to take years, and training those models is going to keep GPU makers in profit margins.

Jeff Martins at New Stack had a beautiful take-down of the SaaS provider practice of not sharing internal status, and how that affects down-stream reliabiilty programs. Jeff is absolutely right, each SaaS provider (or IaaS provider) you put in your critical path decreases the absolute maximum availability your system can provide. This also isn't helped by SaaS providers using manual processes to update status pages. We would all provide better services to customers if we shared status with each other in a timely way.

Spot on. I've felt this frustration too. I've been in the after-action review following an AWS incident when an engineering manager asked the question:

Can we set up alerting for when AWS updates their status page?

As a way to improve our speed of response to AWS issues. We had to politely let that manager down by saying we'd know there was a problem before AWS updates their page, which is entirely true. Status pages shouldn't be used for real-time alerting. This lesson was further hammered home after we gained access to AWS Technical Account Managers, and started getting the status page updates delivered 10 or so minutes early, directly from the TAMs themselves. Status Pages are corporate communications, not technical.

That's where the problem is, status pages are corporate communication. In a system where admitting fault opens you up to expensive legal actions, corporate communication programs will optimize for not admitting fault. For Status Pages, it means they only get an outage notice after it is unavoidably obvious that an outage is already happening. Status Pages are strictly reactive. For a true real time alerting system, you need to tolerate the occasional false positive; for a corporate communication platform designed to admit fault, false positives must be avoided at all costs.

How do we fix this? How do we, a SaaS provider, enable our downstream reliability programs to get the signal they need to react to our state?

There aren't any good ways, only hard or dodgy ones.

The first and best way is to do away with US style adversarial capitalism, which reduces the risks of saying "we're currently fucking up." The money behind the tech industry won't let this happen, though.

The next best way is to provide an alert stream to customers, so long as they are willing to sign some form of "If we give this to you, then you agree to not sue us for what it tells you" amendment to the usage contract. Even that is risky, because some rights can't be waved that way.

What's left is what AWS did for us, have our Technical Acount Managers or Customer Success Managers hand notify customers that there is a problem. This takes the fault admission out of the public space, and limits it to customers who are giving us enough money to have a TAM/CSM. This is the opposite of real time, but at least you get notices faster? It's still not equivalent to instrumenting your own infrastructure for alerts, and is mostly useful for writing the after-action report.

Yeah, the SaaSification of systems is introducing certain communitarian drives; but the VC-backed capitalistic system we're in prevents us from really doing this well.

For many reasons, discussions at work have turned to recovering/handling layoffs. A coworker asked a good question:

Have you ever seen a company culture recover after things went bad?

I have to say no, I haven't. To unpack this a bit, what my coworker was referring to is something loooong time readers of my blog have seen me talk about before, budget problems. What happens to a culture when the money isn't there, and then management starts cutting jobs?

In the for-profit sector of American companies, you get a heck of a lot of trauma from the sudden-death style layoffs (or to be technical reductions in force because there is no call-back mechanism in the SaaS industry). Sudden loss of coworkers with zero notice, scrambling to cover for their work, scrambling to deal with the flash reorganization that almost always comes with a layoff, scrambling to worry about if you'll be next, it all adds up to a lot of insecurity in the workplace. Or a lack of psychological safety, and I've talked about what happens then.

This coworker was also asking about some of the other side effects of a chronic budget crunch. For publicly traded companies, you can enter this budget crunch well before actual cashflow is a problem due to the forcing effects of the stock market. Big Tech is doing a lot of layoffs right now, but nearly every one of the majors is still profitable, just not profitable enough for the stock market. Living inside one of these companies means dealing with layoffs, but also things like:

  • Travel budget getting hard to approve, as they crank down spend.
  • Back-filling departures takes much longer, if you get it at all.
  • Not getting approval to hire people to fill needs.
  • Innovation pressure: have to hit bolder targets with the same resources, and the same overhead.
  • Adding a new SaaS vendor becomes nearly impossible.
  • Changing the performance review framework to emphasize "business value" over "growth".
  • Reduced support for employee groups like those supporting disabled and racial employees.
  • Reorgs. So many reorgs.
  • Hard turns on long term strategy.

If a company hasn't had to live through this before the culture doesn't yet bear the scars. The employees certainly do, especially those of us who were in the market in 2007-2011. Many of us left companies to get away from this kind of thing. That said, the last major down period in tech was over ten years ago; there are a lot of people in the market right now who've never seen it get actually bad for a long period of time. "Crypto Winter", the fall of many crypto-currency based companies, was as close as we've gotten as an industry.

When the above trail of suck starts happening in a company that hasn't had to live through it before, it leaves scar tissue. The scars are represented by the old-timers, who remember what upper management is like when the financial headwinds are blowing strong enough. The old salts never regain their trust of upper management, because they know first-hand that they're face eating leopards.

Even if upper management turns the milk and honey taps on, brings out the bread and circuses, and undoes all the list-of-suck from above, the old-timers (which means the people in middle management or senior IC roles) will still remember upper management is capable of much worse. That changes a culture. As I talked about before, trust is never full again, it's always conditional.

So no, it'll never go back to the way it was before. New people may think so, but the old timers will know otherwise.