Recently in opinion Category

Getting the world off Chrome

I'm seeing more and more folk post, "we got out of IE, we can get out of Chrome," on social media, referencing the nigh-monopoly Chrome has on the browsing experience. Unless you're using Firefox or Safari, you're using Chrome or a Chromium-derived browser. For those of you too young to remember what internet life under Internet Explorer was like, here is a short list of why it was not great:

  • Once Microsoft got the browser-share lock in, it kind of stopped innovating the browser. It conquered the market, so they could pull back investment in it.
  • IE didn't follow standards. But then, Microsoft was famous for "embrace and extend," where they adopt (mostly) a standard, then Microsoftify it enough no one considers using the non-MS version of the standard.
  • If you were on a desktop platform that didn't have IE, such as Apple Macintosh, you were kinda screwed.

Google Chrome took over from IE for three big reasons:

  • They actually were standards compliant, more so than the other alt-browsers (Mozilla's browsers, Opera, and Safari)
  • They actually were trying to innovate in the browser
  • Most important: they were a megacorp with a good reputation who wanted everyone to use their browser. Mozilla and Opera were too small for that, and Apple never has been all that comfortable supporting non-Apple platforms. In classic dot-com era thinking, Google saw a dominant market player grow complacent and smelled a business opportunity.

This made Chrome far easier to develop for, and Chrome grew a reputation for being a web developer's browser. This fit in nicely to Google's plan for the future, which they saw as full of web applications. Google understands what they have, and how they got there. They also understand "embrace and extend," but found a way to do that without making it proprietary the way Microsoft did: capture the standards committees.

If you capture the standards committees, meaning what you want is almost guaranteed a rubber stamp from the committee, then you get to define what industry standard is. Microsoft took a capitalist, closed-source approach to embrace and extend where the end state was a place where the only viable way to do a thing was the thing that was patent-locked into Microsoft. Google's approach is more overtly FOSSY in that they're attempting to get internet consensus for their changes, while also making it rather harder for anyone else to do what they do.

Google doesn't always win. Their "web environment integrity" proposal, which would have given web site operators far greater control over browser extensions like ad-blockers, quietly got canned recently after internet outrage. Another area that got a lot of push back from the internet was Chrome's move away from "v2 manifest" extensions, which include ad-blockers, in favor of "v3 manifest" plugins which made ad-blockers nearly impossible to write. The move from v2 to v3 was delayed a lot while Google grudgingly put in abilities for ad-blockers to continue working.

Getting off of Chrome

The circumstances that drove the world off of Internet Explorer aren't there for Chrome.

  • Chrome innovates constantly and in generally user-improving ways (so long as that improvement doesn't majorly affect ad-revenue)
  • Chrome listens, to a point, to outrage when decisions are made
  • Chrome is functionally setting web standards, but doing so through official channels with RFCs, question periods, and all that ritual
  • Chrome continues to consider web-developer experience to be a number one priority
  • Alphabet, Google's parent company, fully understands what happens when the dominant player grows complacent, they get replaced the way Google replaced Microsoft in the browser wars.

One thing has changed since the great IE to Chrome migration began, Google lost its positive reputation. The old "don't be evil" thing was abandoned a long time ago, and everyone knows it. Changes proposed by Google or Google proxies are now viewed skeptically; though, overtly bad ideas still require internet outrage to delay or prevent a proposal from happening.

That said, you lose monopolies through either laziness of the monopolist (Microsoft) or regulatory action, and I'm not seeing any signs of laziness.

Every time the topic of documentation comes up at work, at multiple workplaces, someone always says a variant of the following:

What we really need is markdown in a git repository. We get version control, there is a lot of tooling to make markdown work good in git, it's great

And every time I have to grit my teeth and hope I don't cause dental damage. My core complaint is that internal documentation has fundamentally different objectives than open source software documentation repositories, and pretending they're the same problem domain means we'll be re-having the documentation discussion in 18 to 24 months.

The examples of OSS projects using markdown or asciidoc as their documentation repository are many, and it works pretty well. Markdown and asciidoc are markup, which allows compilers to turn the marked up doc into rendered sites. This makes accepting contributions from the community much easier, because it follows the same merge-request workflow as code. As most OSS projects are chronically under-staffed, anything that allows reuse of process is a win. Also, markdown and asciidoc are relatively simple formats so you don't need expensive software like Adobe InDesign to makeĀ  them.

OSS project docs are focused on several jobs to be done, and questions by readers:

  • How to install the thing
  • How to configure the thing
  • How to upgrade the thing
  • How to build various workflows the thing allows you to do
  • Troubleshooting tips for the thing
  • How often to expect releases of the thing
  • How to integrate with other things, if this thing allows integration
  • How to use the thing's API
  • Where to find the thing's SDK for various languages

Corporate internal documentation repositories need to do all of the above, but generally for a much wider range of things and services. Cool, that's what standards are for. But "markdown in a git repo" goes a bit off the rails when you look at all the other types of documentation internal docs often cover:

  • On-call rotation standards and contacts
  • Pager-playbooks for the page-out alarms
  • Incident Management program procedures and definitions
  • Post incident review documents for each incident
  • Service maturity standards for being allowed in prod
  • Ownership documentation linking services to individual teams (updated or re-created after each reorg)
  • Decision docs for implementing features or updating process
  • Roadmap documentation going out three years (new docs generated quarterly)
  • How to set up your development environment
  • How to access prod, and who is allowed to access prod
  • Protocols for accessing the datacenter hardware or cloud config consoles
  • The entire software development lifecycle (SDLC) including how CI works, what tests are required when, how tests are selected for inclusion, which linters are included, and when it's allowed to ignore all that because of an emergency

And so on. The sneaky part here is that the OSS projects have many of the above as well, but they're kept in things like Google Docs, Etherpads, Wikis, Twillio, Canvases in Slack, many things that are definitely not involving the merge request workflow into git. All of these extra internal documentation repository jobs to be done greatly complicate what solutions count as viable, in large part because this huge list is actually trying to 'simplify' multiple documentation styles into a single monolithic document repository. What styles are these? Well:

  • Product documentation, describing how to install, configure, and maintain the product.
  • Process documentation, describing the ways various people-driven procedures are done, such as the incident management process and the number of review meetings that need to be held before a feature is released to production.
  • Decision documentation, which evolves over time as people work through what an ultimate decision will look like, changing their minds along the way. Post-incident review docs are of this type.
  • Responder runbooks, used by people responding to incidents to use pre-defined (and risk vetted) procedures as part of incident response.
  • Maintenance runbooks, used by operators of the system to do various things, which is often based on a combination of product and process documentation, to create a grand unified procedure in one document.

All of these documentation styles need somewhat different document lifecycles, which in turn drives need to support workflows. A document lifecycle ensures that documentation is valid, up to date, and old information is removed. Sometimes documentation is a key part of compliance with regulation or industry standard-setting bodies, which adds review steps.

  • Product documentation probably needs multi-step reviews to ensure updates are valid. Confluence is terrible for this, git is less bad. Product docs also need regular review for freshness, and pruning of no longer relevant docs.
  • Process documentation less obviously needs multi-step review. Some will, some won't. Freshness is key, since process documentation describes the how of operating the system or accessing human processes, and old docs pollute search results.
  • Decision documentation definitely does not need multi-step review, it needs to be updated by anyone involved, and may be surplus to requirements once the feature is built. In fact, these docs need to allow collaborative editing, like Etherpad or Google Docs, making them fundamentally unsuited for a git-based workflow. However, having such docs still around is occasionally useful later in time when someone tries to figure out "who thought this was a good idea, and why didn't they consider this obvious failure case?"
  • Responder runbooks also can have compliance interactions; if so, these need multi-step review for risk management decisions. If not, they're probably a per-team free for all. As is the way of responder runbooks, rare errors are nigh impossible to check for freshness so these are the least likely to be verifiably up to date.
  • Maintenance runbooks run the gamut from per team free for all to onerous multi-step review process, all depending on the risks of doing the thing and the nature of the business.

Ideally, the high lifecycle docs like product and process documentation would be in one system, with the minimal lifecycle docs like decision review and responder runbooks in another system entirely. This would allow each system to cater to the needs of the styles within, and solve more of the business' problems. I would like a two-system solution very much.

Except.

People have spent the last 25 years being trained that how you find documentation is:

  1. Look in the obvious place. If you don't find it....
  2. Search google. If that doesn't work, retry your terms. If after three tries you still haven't found it....
  3. Complain on social media.

A two doc-system solution is not well tolerated, and people will build a "universal search" engine to search both the high and low process repositories. Also, two doc systems seems like a lot of overhead. And how do you make sure the right docs go in the right system? Why not use one doc system that's sort of okay at both jobs and save money? 18 to 24 months later, discontent at how bad the "sort of okay" solution is rises and people advocate to moving to a new thing, and suggest markdown in a git repo.

I've been in Firefox a long time

I intended to write a "history of my browser usage" post as part of a longer piece on the Chrome monoculture, but this blog is nearly 20 years old and it turns out I already did a history.

I can't find when I permanently dropped SeaMonkey, but it was after 2010. I dropped SeaMonkey late 2013 (thank you stale profile directory with a date stamp) when it was clearly abandonware and I learned you actually could launch Firefox in parallel with multiple profiles (the

firefox -p --no-remote

combination was key). I stopped using multiple profiles when the Container plugin came out that did nearly everything separate profiles did. It turns out SeaMonkey is still getting updates, but it seems to be tracking the Firefox and Thunderbird Extended Service Releases.

For those of you too old to remember the original Netscape Navigator, it also came with a few tools beyond the browser:

  • The browser, of course
  • An email client, since this was before GMail and web-editors for email weren't really a thing yet
  • An HTML editor (for pre-CSS versions of editor)

The reason I liked SeaMonkey and Opera is they both still shipped with an email client. It was pretty nice, actually. I kept Opera around as my email client way past when I stopped using it for general browsing. I'm fuzzy on what I did after Opera dropped their mail client, I may have grumpily transitioned onto Gnome Evolution at that point. Also, Gmail was out and I was quite used to web-based email clients.

So yeah, I've been in Firefox for over a decade at this point.

The trajectory of "AI" features

I've been working for bay area style SaaS providers for a bit over a decade now, which means I've seen some product cycles come and go. Something struck me today that I want to share, and it relates to watching product decisions get made and adjusting to market pressures.

All the AI-branded features being pushed out by everyone that isn't ChatGPT, Google, or Microsoft all depend on one of the models from those three companies. That, or they're rebranding existing machine learning features as "AI" to catch the marketing moment. Even so, all the features getting pushed out come from the same base capabilities of ChatGPT style large language models:

  • Give it a prompt, it generates text.

Okay, that's the only capability. But this one capability is driving things like:

  • Rewriting your thing to be grammatical for you!
  • Rewriting your thing to be more concise!
  • Suggesting paragraphs for your monthly newsletter!
  • Answering general knowledge questions!

That's about it. We've had about half a year of time to react to ChatGPT and start generating products based on that foundation, and the above is what we're seeing. A lot of these are one to three engineering teams worth of effort over the course of a few months, with a month or three of internal and external testing. These are the basic features of LLM-based machine learning, also known right now as AI.

We don't yet know what features we'll have in two years. "Suggest text from a prompt" went from hot new toy to absolute commodity table-stakes in 8 months, I've never seen it before. Maybe we've fully explored what these models are capable of, but we sure haven't explored the regulatory impacts yet.

The regulatory impacts are the largest risk to these tools. We've already seen news stories of lawyers using these tools to write briefs with fictional citations, and efforts to create deep fake stories "by" prolific authors. The European Union is already considering regulation to require "cite your sources", which GPT isn't great at. Neither are GPT-based models good about the Right To Be Forgotten enshrined in EU law by GDPR.

The true innovation we'll see in coming years will be in creating models that can comply with source-citing and enable targeted excludes. That's going to take years, and training those models is going to keep GPU makers in profit margins.

...and why this is different than blockchain/cryptocurrency/web3.

Unlike the earlier crazes, AI is obviously useful to the layperson. ChatGPT finished what tools like Midjourney started, and made the average person in front of a browser go, "oh, I get it now." That is something Blockchain, Crypto currencies, and Web3 never managed. The older fads were cool to tech nerds and finance people, but not the average 20 year old trying to make ends meet through three gig-economy jobs (except as a get-rich-quick scheme).

Disclaimer: This post is all about the emotional journey of AI-tech, and isn't diving into the ethics. We are in late stage capitalism, ethics is imposed on a technology well after it has been on the market. For a more technical take-down of generative AI, read my post from April titled "Cognitive biases and LLM/AI". ChatGPT-like technologies are exploiting human cognitive biases baked into our very genome.

For those who have avoided it, the art of marketing is all about emotional manipulation. What emotions do your brand colors evoke? What keywords inspire feelings of trust and confidence? The answers to these questions are why every 'security' page on a SaaS product's site has the phrase "bank-like security" on it; because banks evoke feelings of safe stewardship and security. This is relevant to the AI gold rush because before Midjourney and ChatGPT, AI was perceived as "fancy recommendation algorithms" such as those found on Amazon and the old Twitter "for you" timeline; after Midjourney and ChatGPT AI became "the thing that can turn my broken English into fluent English" and was much more interesting.

The perception change caused by Midjourney and ChatGPT is why you see every tech company everywhere trying to slather AI on their offerings. People see AI as useful now, and all these tech companies want to be seen as selling the best useful on the market. If you don't have AI, you're not useful, and companies who are not useful won't grow, and tech companies that aren't growing are bad tech companies. QED, late stage capitalism strikes again.

It's just a fad

Probably not. This phase of the hype cycle is a fad, but we've reached the point where if you have a content database 10% the size of the internet you can algorithmically generate human-seeming text (or audio, or video) without paying a human to do it; this isn't going to change when the hype fades, the tech is here already and will continue to improve so long as it isn't regulated into the grave. This tech is an existential threat to the content-creation business, which includes such fun people as:

  • People who write news articles
  • People who write editorials
  • People who write fiction
  • People who answer questions for others on the internet
  • People who write HOW TO articles
  • People who write blog posts (hello there)
  • People who do voice-over work
  • People who create bed-track music for podcasts
  • People who create image libraries (think Getty Images)
  • People who create cover art for books
  • People who create fan art for commission

The list goes on. The impact here will be similar to how streaming services affected musician and actor income streams: profound.

AI is going to fundamentally change the game for a number of industries. It may be a fad, but for people working in the affected industries this fad is changing the nature of work. I still say AI itself isn't the fad, the fad is all the starry-eyed possibilities people dream of using AI for.

It's a bullshit generator, it's not real

Doesn't matter. AI is right often enough to fit squarely into human cognitive biases of trustworthy. Not all engines are the same, Google Bard and Microsoft Bing have some famous failures here, but this will change over the next two years. AI answers are right often enough, and helpful often enough, that such answers are worth looking into. Again, I refer you to my post from April titled "Cognitive biases and LLM/AI".

Today (May 1, 2023) ChatGPT is the Apple iPhone to Microsoft and Google's feature-phones. Everyone knows what happened when Apple created the smartphone market, and the money doesn't want to be on the not Apple side of that event. You're going to see extreme innovation in this space to try and knock ChatGPT off its perch (first mover is not a guarantee to be the best mover) and the success metric is going to be "doesn't smell like bullshit."

Note: "Doesn't smell like bullshit," not, "is not bullshit". Key, key difference.

Generative AI is based on theft

This sentiment is based on the training sets used for these learning models, and also on a liberal interpretation of copyright fair use. Content creators are beginning to create content under new licenses that specifically exclude use in training-sets. To my knowledge, these licenses have yet to be tested in court.

That said, this complaint about theft is the biggest threat to the AI gold rush. People don't like thieves, and if AI gets a consensus definition of thievery, trust will drop. Companies following an AI at all costs playbook to try and not get left behind will have to pay close attention to user perceptions of thievery. Companies with vast troves of user-generated data that already have a reputation for remixing, such as Facebook and Google, will have an easier time of this because users already expect such behavior from them (even if they disapprove of it). Companies that have high trust for being safe guardians of user created data will have a much harder time unless they're clear from the outset about the role of user created data in training models.

The perception of thievery is the thing most likely to halt the fad-period of AI, not being a bullshit generator.

Any company that ships AI features is losing my business

The fad phase of AI means just about everyone will be doing it, so you're going to have some hard choices to make. The people who can stick to this are the kind of people that are already self-hosting a bunch of things, and are fine with adding a few more. For the rest of us we have harm reduction techniques like using zero-knowledge encryption for whatever service we use for file-sync and email. That said, even the hold-out companies may reach for AI if it looks to have real legs in the marketplace.


Yeah. Like it or not, AI development is going to dominate the next few years of big-tech innovation.

I wrote this because I keep having this conversation with people, and this makes a handy place to point folk at.

Working for face-eating leopards

There is a meme that started somewhere during the post-Brexit period when the reality of what the United Kingdom's departure from the European Union really meant started hitting home. It got picked up by people in the US to talk about voter regret for electing Trump. It goes like this:

"I didn't think the leopard would eat my face," said the woman who voted for the Let Leopards Eat Faces party.

https://knowyourmeme.com/memes/leopards-eating-peoples-faces-party

I'm thinking about this right now with regards to the reductions in force (RIFs) happening in US-based big tech. We don't know for sure why this is happening, but there are several competing (and probably overlapping) theories:

  • A major hedge fund is pushing the majors to cut staffing in a cynical bid to reduce salary inflation in the industry, and make their job of investing in growth companies easier by reducing their payroll expenses.
  • Inflation-related softening in consumer and business consumption hitting the growth percentages of these companies hard enough they have to make up the gap in profitability, which means cuts.
  • Seeing the majors cut staffing means you can avoid a public relations hit by me too-ing your own reduction in force in a cynical bid to reduce salary inflation and make hiring less expensive over the next year or so.
  • Seeing several members of the peer companies you picked in your Radford salary survey do RIFs, so you cut some high cost staff to further reduce salary inflation and make hiring less expensive over the next year or so.

You might notice a theme here, which is what reminded me of the face eating leopard parable. There is a piece of advice I tell people at work that relates to this:

(DayJob) is a publicly traded SaaS company in the United States, forget this at your peril.

This is a nonspecific warning, but I mean it in the sense that we all are still working for a face-eating leopard; just one that's more domesticated than many of the others. When push comes to shove, it'll still eat faces. Yes, the benefits are pretty good and they haven't done a RIF yet; but do not mistake this for a sign that they will not ever perform a RIF. In case you need it specified, the theme in the above list is "reduce the pace of salary inflation and make hiring less expensive over the next year or so," which can be accomplished several ways, one of which is a RIF.

There are other ways to reduce the pace of salary inflation and make hiring cheaper:

  • Stop focus-hiring in "high cost metros" like San Francisco, New York, and London and instead focus hiring in cheaper metros like Atlanta and Dublin. This is great for two reasons. First, it makes the base-salary of most of your new talent rather lower than the base salary of the high cost metro talent; second, salary inflation is compressed in these lesser metros so your talent stays cheaper. The pandemic made this option more palatable due to the move to remote working styles.
  • Start hiring in friendly foreign markets that are cheaper than US markets. This takes significant investment, but is a move well known to the tech industry: off-shoring. Eastern Europe is +8 hours from San Francisco much of the year, which makes it a great place to start adding talent to work towards "follow the sun" support of your systems. Eastern Europe's cost of living is relatively low, which means the talent comes cheaper. -8 hours from SF is the middle of the Pacific, it's a big ocean, but is part of why you see Australian and Chinese centers. Some opt to go the India route and deal with a 10 or 14 hour time difference instead.
  • Close foreign development centers to bring more of the workforce into the friendlier US labor relations regime. Lots of European companies have mandatory notice periods for layoffs and RIFs, which really slows down how fast you can cut costs when you need to. "At-will" employment laws in the US means you can almost always do a same-day termination. This option is best for companies who did the previous point during past contractions.
  • Move more work to contractors. This moves the benefits problem to another company, and if you need to cut contractors that's rather cheaper. Microsoft famously got in trouble for this one and ended up under a court judgement to offer full time benefits for contractors serving longer than 18 months. Which meant contractors never served more than 18 months, ever.

All four of these moves are things you do when you have some warning that winds are shifting, or you already know you need to add percentages to your profitability line-items in your quarterly/annual reports. A domesticated leopard will do more of the above slow-shifting before biting faces off through a RIF. That leopard will still bite faces off if the above doesn't move the profitability needle fast enough.


What does this mean for those of us working for face-eating leopards?

First and foremost, it means being defensive with your finances. If you are working for this particular variety of face-eating leopard (publicly traded US tech company compensating in part through stocks) then you are probably in the top 5% of US income. Before I got a job with one of these leopards I didn't understand how friends in the industry could say things like:

I left Company X today. I plan to take a few months off before seriously looking for my next thing.

Who has that kind of money laying around? I sure didn't. Then I started working for one of them and got stock-compensation, then I understood. Those Restricted Stock Units meant quarterly infusions of cash-equivalents I had to do something with, so I saved them. It turns out a lot of us do that too, since the savings-rate of the top 1% of the US income list is over 30%. That gives us a lot of leeway to build up an emergency fund, which we all still need to do. If you're living hand to mouth, which is actually easy when your rent is $4000/mo, then you're at risk of having a really bad time if its your turn to have your face eaten.

Second, if you're living in a high cost metro and are working for a company with a sizable remote workforce, you are at elevated risk of getting your face eaten and having them repost your job somewhere like New Orleans. Being more at risk means you need to be more diligent about making sure your finances can survive 5 months of unemployment. The US technical job market is getting realigned now that the money has figured out you can have a successful business by hiring outside the major metros. More and more, when companies are facing equally qualified candidates in New York City and Cincinnati, they pick the Cinci candidate because they're cheaper.

Third, and much less helpful, work towards changing the US labor market (or relocating to a labor market where employees are treated better by policy) to make face-eating harder to do. A majority of the European Union has laws on the books requiring a notice period for reductions of any kind, same day terminations are rare and shocking. By proxy, a lot of the places colonized by EU members have similar protections. Getting three weeks warning that you will be out of a job means you can say goodbye, work on a transition plan, and otherwise have time to mourn. It still sucks to lose your job, but it'll hurt less.

Allowing 'root cause analysis'

"Root cause analysis" as a term invokes a strong response from the (software) reliability industry. The most common complaint:

There's no such thing as a "root" cause. They're always complex failures. There's this nice book about complex failures and the Three Mile Island incident I recommend you read.

This is the correct take. In any software system, even software+hardware ones, what triggered the incident is almost never the sole and complete cause of it. An effective technique for figuring out the causal chain is the "five whys" method. To take a fictional example:

  • Why did Icarus die?
    • Because he flew too close to the sun.
  • Why did he fly too close to the sun?
    • Because of the hubris of man.
  • Why was Icarus full of hubris?
    • Because his low altitude tests passed, and he made incorrect assumptions.
  • Why did the low altitude tests pass, but not the high altitude ones?
    • Because the material used to attach the feathers became more pliable the closer to the sun he was.
  • What happened after the material became more pliable?
    • It lost feathers, which compromised the flight profile, leading to a fatal incident.

    From this exercise we can determine the causal chain:

    Structural integrity monitoring was not included in the testing regime, which lead to a failure to detect reduced binding efficiency at higher ambient temperatures. The decision to move to high elevation testing was done in the absence of a data driven testing framework. The combination of the lack of data driven testing, and lack of comprehensive materials surveillance, allowed the lead investigator to make a fatal decision.

    This is a bit more comprehensive than the parable's typical, 'hubris of man,' moral. There are whole books written about building a post-incident review process, with the goal of maximizing the learning earned from the incident, in software systems. There are no root causes, only complex failures; and you reason about complex failures differently than attempting to find a lone root cause.

    Except.

    Except.

    The phrase 'root cause analysis' is freaking everywhere, and this is in spite of a decade of SREs pushing against the term. There are a few reasons for this, but to start the explanation here is another example from my history. My current manager knows better than to call incident reviews a "root cause analysis." Yet, when we have a vendor of ours shit the bed fantastically enough we get in trouble with our own customers, they are the first to press our account managers for an RCA Report. Why?

    Because an RCA Report is also a customer relations tool. My manager is code-switching between our internal engineer-driven incident review processes, which don't use the term, for the customer relations concept which manifestly does use it. Not at all coincidentally, other SREs grind their teeth any time a customer asks for an RCA Report, because what we do isn't Root Cause Analysis.

    Aside: For all that we as an SRE community focus on availability and build customer-centered metrics to base our SLOs on, SRE as a job function is often highly disconnected from the actual people-to-people interface with customers. Some companies will allow a senior SRE onto a customer call to better explain a failure chain, but my understanding is this practice is rare; most companies are more concerned that the senior SRE will over-share in some way that will compromise the company's liability stances.

    At the end of the day, customers want answers to three questions about the incident they're concerned over, all so they can reassess the risk of continuing to do business with us:

    1. What happened?
    2. What did you do to fix it?
    3. What are you doing to prevent this happening again?

    "What happened?" isn't supposed to be a single causative action, customers want to know if we understand the causal chain. 'Root cause' in this context is less a technical term meaning 'single', and more a term of art meaning 'failure'.

    The other reason that 'RCA' shows up as often as it does is that the term itself shows up in general safety engineering literature. DayJob has had a few availability incidents lately, after one of them a customer asked for a type of report I'd never heard of before: a CAPA report. I had to google that one. CAPA means corrective and preventive actions. Also known as questions 2 and 3 above. My industry has been building blameless post-mortem processes for a decade plus now, and never used CAPA. This concept was instantly familiar, even if I hadn't heard the acronym before.

    I found a blog post from a firm specializing in safety inside the beverage industry that describes how an RCA interacts with a CAPA. The beverage industry operates machine plants with bottle fillers and everything else involved in food handling. The software industry, um, doesn't (usually). Because beverage manufacturing and software manufacturing are both industrial processes, the same concepts apply to both. If you read into what an RCA is for them, it reads a lot like a complex failure report.

    This lead me to a realization: "Root cause analysis" is a term of art, not a technical term.

    Engineers look at that phrase and cringe, because what it says is not what it means and we find that kind of ambiguity to be a bug. This is probably why we're not allowed near customers unless we have close supervision or experience in customer-facing technical writing.

    Now a days I'm hearing internal folk decry "root cause analysis" as the wrong way to think about problems, and I nod and tell them they're right. While also telling them that we'll continue to use that term with customers because that's what customers are asking for, and we'll write those RCA reports like the complex failure analyses they are. We'll even give them a CAPA report and not call it a CAPA (unless they ask for a CAPA by name).

24/7 availability and oncall

There is another meme going around OpsTwitter the past few days. This is a familiar refrain in discussions about on-call and quality of life. But the essence is:

If you need 24/7 availability, you also need follow-the-sun support. That way any crisis is in someone's day-time, regular-work day.

I agree, this is the standard you need to judge your solution against. However, this solution has some assumptions baked into it. Here are a few:

  • You have three teams operating 8 timezones from their neighbors (or two timezones spanning 12)
  • No one set of employment laws spans 24 timezones, so these teams will each be under different labor and national holiday laws.
  • Each timezone needs an on-call rotation.
  • The minimum viable on-call rotation per timezone is 3 people, but 6 is far more friendly to the people supporting the site.
  • Due to staffing reasons, your global on-call team needs 9 to 18 people on it (or 6 to 12 for a 12 timezone spread).
  • Due to the timezone spread, each team will have minimal coordination with each other. What coordination there is will involve one team being on a video-call at o-dark-thirty.
  • You need enough work to keep 9 to 18 people busy in addition to their fire-watch duties.

You know who can pull that off? Really big companies.

You know who can't pull that off? Companies employing in a single labor market, such as the US.

I mean, Guam is a US holding (UTC+10). Theoretically if you had a team in Guam and a team in New York City (UTC-4) you would have a 10 hour difference between them. You could sort of make this work while staying inside the US tax and legal domains, but you're reliant on the technical base of people in Guam which has a population a bit smaller than Des Moines, Iowa. Colonialism means people will think about hiring in Ireland or India before Guam. To do this you need to go international.

Most smaller companies won't go international, way too much paperwork involved at a time when you're supposed to be lean and fast.

I have worked with follow-the-sun exactly once in my career. We had Ops teams in the US East Coast, Poland, and China. It wasn't a true 8/8/8 split, but it was enough of a split that "after hours maintenance" always happened in someone's daytime. It was pretty dang nice. Then we had a layoff round and the Poland office went away. And we fired our Chinese Ops folk to save money, which meant we were waking the US staff up at o-dark-thirty to do maintenance.


I'm conflicted on this advice. On the surface, I totally get the sentiment: keep the annoying shit in everyone's daytime and don't force people to work overnights.

As an industry, we have history with split-shifting and incident response. The night operator used to be a common feature of any company with a computer, the person (or team of people) responsible for loading/unloading tapes, swapping paper for the printer, collating and packaging print-jobs, making sure the batch-jobs ran, calling the SYSOP when things smelled off, and a bunch of other now-forgotten tasks. Most organizations have gotten rid of the night operator for a lot of reasons. The two biggest being:

  1. We've mostly automated the job out of existence. Tapes (if tapes are still in use) are handled by robots. Print-jobs now show up as a PDF in your email. Batch-schedulers are really fancy now, so getting those batch-jobs run is highly automated. Monitoring systems keep track of way more things than we could track in the night operator era.
  2. No one wants to work overnights. Like, no one. At least not enough to easily find a replacement when the one person who does like it decides to leave/retire.

(The second point hit WWU while I was there)

As an industry we no longer have tradition of doing shift-work. The robust expectation is we'll have a day-time job and go home in the evenings. If you offer me an overnight job at +30% pay -- I'll take it for a while, but I'm still job-hunting for a real daytime job. Not sustainable, which is why on-call is how we're solving the one night operator task we couldn't automate out of existence: incident response.

Everyone needs some way to do incident response, even if they're 15 people with a big idea and a website -- far too small to be doing follow-the-sun rotations. Are they supposed to make it clear that you only guarantee availability certain hours? I think there is some legs in that idea, but the site will be negatively compared with the site next door that offers 24/7 availability (at the cost of little sleep for their few engineers).

Forcing change to the idea that Ops-type work is always done with a pager attached with unknown extra hours will take a shit-ton of work. Sea changes like that don't happen naturally. We cross-faded from night operators to on-call rotations due to the changing nature of the role: there wasn't enough work to do on the 11pm to 7am shift to keep someone fully occupied, so we tacked those duties onto the 7am-3pm crew (who now work a nicer 9am to 5pm schedule).

The only way to break the need for on-call for Ops-type roles is to stop making availability promises when you're not staffed to support it with people responding as part of their normal working hours. If your support-desk isn't answering the phone, site availability shouldn't be promised.

It's that or unionizing the entire sector.

Staff Engineer

Last August I was promoted to Staff Engineer at HelloSign (Dropbox). That parenthetical is important, which I'll get to in a bit. If you've been closely following my career, you noticed that I was promoted to Staff Engineer at HelloSign back in July 2018. So how am I getting promoted again? To the same title? Well, the bit you missed is that HelloSign got bought by Dropbox in February of 2019. Now that we're two years past the merger, here is what happened to my title in the last three years:

  1. July 2018 - promoted to Staff Engineer at HelloSign! I honestly hadn't heard the term 'Staff engineer' before then. It was a welcome surprise.
  2. January 2019 - word breaks about the merger. Eek.
  3. February 2019 - my merger-packet sees me move from Staff Engineer to IC4 (Lead). Apparently, Staff Engineers and Vice Presidents were the two titles Dropbox wasn't letting in. My pay get a 5% bump, and I get my first-ever Restricted Stock Unit grant (they really wanted me to stick around).
  4. January 2020 - Dropbox has their performance review cycle. Because all of the HelloSigners have less than 12 months tenure (merger was in February, remember), none of us get promotions. I start actively working towards a Dropbox Staff Engineer, by clawing my way into cross-organizational meetings where I can.
  5. March 2020 - pandemic, flash crash, market turmoil, suddenly working from home for everyone, mass uncertainty.
  6. July 2020 - mid-cycle performance reviews, with upper management telling everyone that due to budget reasons we'll be promoting half the people we normally do.
  7. Late July 2020 - promotion lists come out, and a lot of HelloSigners are on it! Yay! Including me, to Staff Engineer. This surprised me a lot because I was expecting the January 2021 cycle to be the earliest that could happen. This happened because I was functionally a foundational engineer for the platform/devops side of things, had incredible system intuition (you get that after 5 years working for a place), and knew how to communicate. Promotion comes with another 5% pay bump (geobanding; I'm not in the Bay Area, I'm in the low-cost middle of America) and a sizable RSU grant.

Which brings us to now, half a year after the promotion. I'm asking myself what has changed?


First off, being Staff at a company with 50 engineers (HelloSign 2018) is quite different than being Staff at a company with over 10x that many (Dropbox 2019). Those two engineering organizations operate incredibly differently. When Dropbox said that our Staff engineers wouldn't get Staff, this is why: to be Staff at Dropbox you have to have cross-org impact, and by definition freshly merged HelloSign had no cross-org impact. QED.

When our team finally hooked up with Dropbox SRE (they don't do Devops) we learned what 10x scale means. The role our team of less than 10 people played in the HelloSign infrastructure was filled by 10+ teams in the Dropbox infrastructure. For individual Dropbox engineers it meant most were in a tiny, well-constrained box as compared to the wide scope each of my teammates enjoyed.

More importantly, 10x scale means that Dropbox infrastructure-engineering was writing new distributed systems, where HelloSign infrastructure-engineering was wiring together of-the-shelf distributed systems. Very different scopes and job duties, and why this new IC5 probably couldn't pass their IC2 coding pre-screen well enough to get in front of an actual person.


Second, HelloSign hasn't had a Dropbox style Staff Engineer before, so I'm kind of inventing the role as I go. Functionally, I've taken the title as official sanction to voice my opinions and gather people together on my own authority. Before, I was more likely to work through channels and try to get various managers to assemble a process to solve a problem. It's now my actual job to influence strategy, rather than focused on how to implement strategy.

Doing these things on my own authority has worked solely because the managers involved are letting me. Without that support I'd be doing the things I was doing before the promotion, but with a bit better pay. This applies to managers on the Dropbox side as well, without their invites to process-meetings IC5 would be a promotion in name only.

I was involved in some strategy work in 2019, along with the other former-Staff engineers. But since August I've leveled up, out of my exclusive focus on the HelloSign org and starting to take on strategy work inside the whole Dropbox context. In key ways, I'm now in the room where it happens. I've wanted to be here for years.


Third, where I fit in the overall multi-year strategy for HelloSign is not well defined. We had a terrible run of luck in October 2020 that resulted in several highly visible availability incidents, which resulted in a tiger-team to fix that shit and also make sure we don't get that bad again. This has been working, our SLOs have been 100% passing for the last three months. The work of rebuilding trust in our users takes far longer though, so we still need to be sure our future design focuses on availability.

The interaction between Product Management and Engineering Management is still somewhat opaque to me, and I need to fix that. My team only works with Product Management indirectly, when Engineering comes to us for support building something Product is pushing for. My role in software features is a bit iffy, since rare features involve my domain. But when they do, I should be there. I still don't know the correct attention-split there. Currently our pre-merger CTO is doing all of that strategy work.


Fourth, our org has room for another Staff Engineer on the software side. I recently asked myself if I'd left any marked trails for HelloSign's second Staff Engineer, and found that I hadn't. My HelloSign-specific strategy work is a bit elevated from where it was pre-promotion, but not by much. My Dropbox side is rather different, but that's more a reflection of Dropbox SRE - something a software SRE wouldn't be involved in anyway.

I'm a bit uncomfortable with that, but that's part of what comes with doing something no one else has done before. My management now knows about the lack, and says they're going to work on it. Meanwhile, I'll be pushing my nose into more strategy work.

We already have a law for this, a whole bunch of them, and you were there when they were passed. We hated it when it arrived, because we know the all-seeing-eye was getting new glasses.

Our problem this week was not that the all-seeing-eye lacked standing for intelligence-gathering and charging. Our problem this week was that the all-seeing-eye was looking the other way.

That is our problem. Systemic racism. And systemic racism isn't solved through passing crime-and-punishment legislation.

Use the tools you already have. You have way more than you need already, you just need to use them differently.