Recently in opinion Category

...and why this is different than blockchain/cryptocurrency/web3.

Unlike the earlier crazes, AI is obviously useful to the layperson. ChatGPT finished what tools like Midjourney started, and made the average person in front of a browser go, "oh, I get it now." That is something Blockchain, Crypto currencies, and Web3 never managed. The older fads were cool to tech nerds and finance people, but not the average 20 year old trying to make ends meet through three gig-economy jobs (except as a get-rich-quick scheme).

Disclaimer: This post is all about the emotional journey of AI-tech, and isn't diving into the ethics. We are in late stage capitalism, ethics is imposed on a technology well after it has been on the market. For a more technical take-down of generative AI, read my post from April titled "Cognitive biases and LLM/AI". ChatGPT-like technologies are exploiting human cognitive biases baked into our very genome.

For those who have avoided it, the art of marketing is all about emotional manipulation. What emotions do your brand colors evoke? What keywords inspire feelings of trust and confidence? The answers to these questions are why every 'security' page on a SaaS product's site has the phrase "bank-like security" on it; because banks evoke feelings of safe stewardship and security. This is relevant to the AI gold rush because before Midjourney and ChatGPT, AI was perceived as "fancy recommendation algorithms" such as those found on Amazon and the old Twitter "for you" timeline; after Midjourney and ChatGPT AI became "the thing that can turn my broken English into fluent English" and was much more interesting.

The perception change caused by Midjourney and ChatGPT is why you see every tech company everywhere trying to slather AI on their offerings. People see AI as useful now, and all these tech companies want to be seen as selling the best useful on the market. If you don't have AI, you're not useful, and companies who are not useful won't grow, and tech companies that aren't growing are bad tech companies. QED, late stage capitalism strikes again.

It's just a fad

Probably not. This phase of the hype cycle is a fad, but we've reached the point where if you have a content database 10% the size of the internet you can algorithmically generate human-seeming text (or audio, or video) without paying a human to do it; this isn't going to change when the hype fades, the tech is here already and will continue to improve so long as it isn't regulated into the grave. This tech is an existential threat to the content-creation business, which includes such fun people as:

  • People who write news articles
  • People who write editorials
  • People who write fiction
  • People who answer questions for others on the internet
  • People who write HOW TO articles
  • People who write blog posts (hello there)
  • People who do voice-over work
  • People who create bed-track music for podcasts
  • People who create image libraries (think Getty Images)
  • People who create cover art for books
  • People who create fan art for commission

The list goes on. The impact here will be similar to how streaming services affected musician and actor income streams: profound.

AI is going to fundamentally change the game for a number of industries. It may be a fad, but for people working in the affected industries this fad is changing the nature of work. I still say AI itself isn't the fad, the fad is all the starry-eyed possibilities people dream of using AI for.

It's a bullshit generator, it's not real

Doesn't matter. AI is right often enough to fit squarely into human cognitive biases of trustworthy. Not all engines are the same, Google Bard and Microsoft Bing have some famous failures here, but this will change over the next two years. AI answers are right often enough, and helpful often enough, that such answers are worth looking into. Again, I refer you to my post from April titled "Cognitive biases and LLM/AI".

Today (May 1, 2023) ChatGPT is the Apple iPhone to Microsoft and Google's feature-phones. Everyone knows what happened when Apple created the smartphone market, and the money doesn't want to be on the not Apple side of that event. You're going to see extreme innovation in this space to try and knock ChatGPT off its perch (first mover is not a guarantee to be the best mover) and the success metric is going to be "doesn't smell like bullshit."

Note: "Doesn't smell like bullshit," not, "is not bullshit". Key, key difference.

Generative AI is based on theft

This sentiment is based on the training sets used for these learning models, and also on a liberal interpretation of copyright fair use. Content creators are beginning to create content under new licenses that specifically exclude use in training-sets. To my knowledge, these licenses have yet to be tested in court.

That said, this complaint about theft is the biggest threat to the AI gold rush. People don't like thieves, and if AI gets a consensus definition of thievery, trust will drop. Companies following an AI at all costs playbook to try and not get left behind will have to pay close attention to user perceptions of thievery. Companies with vast troves of user-generated data that already have a reputation for remixing, such as Facebook and Google, will have an easier time of this because users already expect such behavior from them (even if they disapprove of it). Companies that have high trust for being safe guardians of user created data will have a much harder time unless they're clear from the outset about the role of user created data in training models.

The perception of thievery is the thing most likely to halt the fad-period of AI, not being a bullshit generator.

Any company that ships AI features is losing my business

The fad phase of AI means just about everyone will be doing it, so you're going to have some hard choices to make. The people who can stick to this are the kind of people that are already self-hosting a bunch of things, and are fine with adding a few more. For the rest of us we have harm reduction techniques like using zero-knowledge encryption for whatever service we use for file-sync and email. That said, even the hold-out companies may reach for AI if it looks to have real legs in the marketplace.


Yeah. Like it or not, AI development is going to dominate the next few years of big-tech innovation.

I wrote this because I keep having this conversation with people, and this makes a handy place to point folk at.

Working for face-eating leopards

There is a meme that started somewhere during the post-Brexit period when the reality of what the United Kingdom's departure from the European Union really meant started hitting home. It got picked up by people in the US to talk about voter regret for electing Trump. It goes like this:

"I didn't think the leopard would eat my face," said the woman who voted for the Let Leopards Eat Faces party.

https://knowyourmeme.com/memes/leopards-eating-peoples-faces-party

I'm thinking about this right now with regards to the reductions in force (RIFs) happening in US-based big tech. We don't know for sure why this is happening, but there are several competing (and probably overlapping) theories:

  • A major hedge fund is pushing the majors to cut staffing in a cynical bid to reduce salary inflation in the industry, and make their job of investing in growth companies easier by reducing their payroll expenses.
  • Inflation-related softening in consumer and business consumption hitting the growth percentages of these companies hard enough they have to make up the gap in profitability, which means cuts.
  • Seeing the majors cut staffing means you can avoid a public relations hit by me too-ing your own reduction in force in a cynical bid to reduce salary inflation and make hiring less expensive over the next year or so.
  • Seeing several members of the peer companies you picked in your Radford salary survey do RIFs, so you cut some high cost staff to further reduce salary inflation and make hiring less expensive over the next year or so.

You might notice a theme here, which is what reminded me of the face eating leopard parable. There is a piece of advice I tell people at work that relates to this:

(DayJob) is a publicly traded SaaS company in the United States, forget this at your peril.

This is a nonspecific warning, but I mean it in the sense that we all are still working for a face-eating leopard; just one that's more domesticated than many of the others. When push comes to shove, it'll still eat faces. Yes, the benefits are pretty good and they haven't done a RIF yet; but do not mistake this for a sign that they will not ever perform a RIF. In case you need it specified, the theme in the above list is "reduce the pace of salary inflation and make hiring less expensive over the next year or so," which can be accomplished several ways, one of which is a RIF.

There are other ways to reduce the pace of salary inflation and make hiring cheaper:

  • Stop focus-hiring in "high cost metros" like San Francisco, New York, and London and instead focus hiring in cheaper metros like Atlanta and Dublin. This is great for two reasons. First, it makes the base-salary of most of your new talent rather lower than the base salary of the high cost metro talent; second, salary inflation is compressed in these lesser metros so your talent stays cheaper. The pandemic made this option more palatable due to the move to remote working styles.
  • Start hiring in friendly foreign markets that are cheaper than US markets. This takes significant investment, but is a move well known to the tech industry: off-shoring. Eastern Europe is +8 hours from San Francisco much of the year, which makes it a great place to start adding talent to work towards "follow the sun" support of your systems. Eastern Europe's cost of living is relatively low, which means the talent comes cheaper. -8 hours from SF is the middle of the Pacific, it's a big ocean, but is part of why you see Australian and Chinese centers. Some opt to go the India route and deal with a 10 or 14 hour time difference instead.
  • Close foreign development centers to bring more of the workforce into the friendlier US labor relations regime. Lots of European companies have mandatory notice periods for layoffs and RIFs, which really slows down how fast you can cut costs when you need to. "At-will" employment laws in the US means you can almost always do a same-day termination. This option is best for companies who did the previous point during past contractions.
  • Move more work to contractors. This moves the benefits problem to another company, and if you need to cut contractors that's rather cheaper. Microsoft famously got in trouble for this one and ended up under a court judgement to offer full time benefits for contractors serving longer than 18 months. Which meant contractors never served more than 18 months, ever.

All four of these moves are things you do when you have some warning that winds are shifting, or you already know you need to add percentages to your profitability line-items in your quarterly/annual reports. A domesticated leopard will do more of the above slow-shifting before biting faces off through a RIF. That leopard will still bite faces off if the above doesn't move the profitability needle fast enough.


What does this mean for those of us working for face-eating leopards?

First and foremost, it means being defensive with your finances. If you are working for this particular variety of face-eating leopard (publicly traded US tech company compensating in part through stocks) then you are probably in the top 5% of US income. Before I got a job with one of these leopards I didn't understand how friends in the industry could say things like:

I left Company X today. I plan to take a few months off before seriously looking for my next thing.

Who has that kind of money laying around? I sure didn't. Then I started working for one of them and got stock-compensation, then I understood. Those Restricted Stock Units meant quarterly infusions of cash-equivalents I had to do something with, so I saved them. It turns out a lot of us do that too, since the savings-rate of the top 1% of the US income list is over 30%. That gives us a lot of leeway to build up an emergency fund, which we all still need to do. If you're living hand to mouth, which is actually easy when your rent is $4000/mo, then you're at risk of having a really bad time if its your turn to have your face eaten.

Second, if you're living in a high cost metro and are working for a company with a sizable remote workforce, you are at elevated risk of getting your face eaten and having them repost your job somewhere like New Orleans. Being more at risk means you need to be more diligent about making sure your finances can survive 5 months of unemployment. The US technical job market is getting realigned now that the money has figured out you can have a successful business by hiring outside the major metros. More and more, when companies are facing equally qualified candidates in New York City and Cincinnati, they pick the Cinci candidate because they're cheaper.

Third, and much less helpful, work towards changing the US labor market (or relocating to a labor market where employees are treated better by policy) to make face-eating harder to do. A majority of the European Union has laws on the books requiring a notice period for reductions of any kind, same day terminations are rare and shocking. By proxy, a lot of the places colonized by EU members have similar protections. Getting three weeks warning that you will be out of a job means you can say goodbye, work on a transition plan, and otherwise have time to mourn. It still sucks to lose your job, but it'll hurt less.

Allowing 'root cause analysis'

"Root cause analysis" as a term invokes a strong response from the (software) reliability industry. The most common complaint:

There's no such thing as a "root" cause. They're always complex failures. There's this nice book about complex failures and the Three Mile Island incident I recommend you read.

This is the correct take. In any software system, even software+hardware ones, what triggered the incident is almost never the sole and complete cause of it. An effective technique for figuring out the causal chain is the "five whys" method. To take a fictional example:

  • Why did Icarus die?
    • Because he flew too close to the sun.
  • Why did he fly too close to the sun?
    • Because of the hubris of man.
  • Why was Icarus full of hubris?
    • Because his low altitude tests passed, and he made incorrect assumptions.
  • Why did the low altitude tests pass, but not the high altitude ones?
    • Because the material used to attach the feathers became more pliable the closer to the sun he was.
  • What happened after the material became more pliable?
    • It lost feathers, which compromised the flight profile, leading to a fatal incident.

    From this exercise we can determine the causal chain:

    Structural integrity monitoring was not included in the testing regime, which lead to a failure to detect reduced binding efficiency at higher ambient temperatures. The decision to move to high elevation testing was done in the absence of a data driven testing framework. The combination of the lack of data driven testing, and lack of comprehensive materials surveillance, allowed the lead investigator to make a fatal decision.

    This is a bit more comprehensive than the parable's typical, 'hubris of man,' moral. There are whole books written about building a post-incident review process, with the goal of maximizing the learning earned from the incident, in software systems. There are no root causes, only complex failures; and you reason about complex failures differently than attempting to find a lone root cause.

    Except.

    Except.

    The phrase 'root cause analysis' is freaking everywhere, and this is in spite of a decade of SREs pushing against the term. There are a few reasons for this, but to start the explanation here is another example from my history. My current manager knows better than to call incident reviews a "root cause analysis." Yet, when we have a vendor of ours shit the bed fantastically enough we get in trouble with our own customers, they are the first to press our account managers for an RCA Report. Why?

    Because an RCA Report is also a customer relations tool. My manager is code-switching between our internal engineer-driven incident review processes, which don't use the term, for the customer relations concept which manifestly does use it. Not at all coincidentally, other SREs grind their teeth any time a customer asks for an RCA Report, because what we do isn't Root Cause Analysis.

    Aside: For all that we as an SRE community focus on availability and build customer-centered metrics to base our SLOs on, SRE as a job function is often highly disconnected from the actual people-to-people interface with customers. Some companies will allow a senior SRE onto a customer call to better explain a failure chain, but my understanding is this practice is rare; most companies are more concerned that the senior SRE will over-share in some way that will compromise the company's liability stances.

    At the end of the day, customers want answers to three questions about the incident they're concerned over, all so they can reassess the risk of continuing to do business with us:

    1. What happened?
    2. What did you do to fix it?
    3. What are you doing to prevent this happening again?

    "What happened?" isn't supposed to be a single causative action, customers want to know if we understand the causal chain. 'Root cause' in this context is less a technical term meaning 'single', and more a term of art meaning 'failure'.

    The other reason that 'RCA' shows up as often as it does is that the term itself shows up in general safety engineering literature. DayJob has had a few availability incidents lately, after one of them a customer asked for a type of report I'd never heard of before: a CAPA report. I had to google that one. CAPA means corrective and preventive actions. Also known as questions 2 and 3 above. My industry has been building blameless post-mortem processes for a decade plus now, and never used CAPA. This concept was instantly familiar, even if I hadn't heard the acronym before.

    I found a blog post from a firm specializing in safety inside the beverage industry that describes how an RCA interacts with a CAPA. The beverage industry operates machine plants with bottle fillers and everything else involved in food handling. The software industry, um, doesn't (usually). Because beverage manufacturing and software manufacturing are both industrial processes, the same concepts apply to both. If you read into what an RCA is for them, it reads a lot like a complex failure report.

    This lead me to a realization: "Root cause analysis" is a term of art, not a technical term.

    Engineers look at that phrase and cringe, because what it says is not what it means and we find that kind of ambiguity to be a bug. This is probably why we're not allowed near customers unless we have close supervision or experience in customer-facing technical writing.

    Now a days I'm hearing internal folk decry "root cause analysis" as the wrong way to think about problems, and I nod and tell them they're right. While also telling them that we'll continue to use that term with customers because that's what customers are asking for, and we'll write those RCA reports like the complex failure analyses they are. We'll even give them a CAPA report and not call it a CAPA (unless they ask for a CAPA by name).

24/7 availability and oncall

There is another meme going around OpsTwitter the past few days. This is a familiar refrain in discussions about on-call and quality of life. But the essence is:

If you need 24/7 availability, you also need follow-the-sun support. That way any crisis is in someone's day-time, regular-work day.

I agree, this is the standard you need to judge your solution against. However, this solution has some assumptions baked into it. Here are a few:

  • You have three teams operating 8 timezones from their neighbors (or two timezones spanning 12)
  • No one set of employment laws spans 24 timezones, so these teams will each be under different labor and national holiday laws.
  • Each timezone needs an on-call rotation.
  • The minimum viable on-call rotation per timezone is 3 people, but 6 is far more friendly to the people supporting the site.
  • Due to staffing reasons, your global on-call team needs 9 to 18 people on it (or 6 to 12 for a 12 timezone spread).
  • Due to the timezone spread, each team will have minimal coordination with each other. What coordination there is will involve one team being on a video-call at o-dark-thirty.
  • You need enough work to keep 9 to 18 people busy in addition to their fire-watch duties.

You know who can pull that off? Really big companies.

You know who can't pull that off? Companies employing in a single labor market, such as the US.

I mean, Guam is a US holding (UTC+10). Theoretically if you had a team in Guam and a team in New York City (UTC-4) you would have a 10 hour difference between them. You could sort of make this work while staying inside the US tax and legal domains, but you're reliant on the technical base of people in Guam which has a population a bit smaller than Des Moines, Iowa. Colonialism means people will think about hiring in Ireland or India before Guam. To do this you need to go international.

Most smaller companies won't go international, way too much paperwork involved at a time when you're supposed to be lean and fast.

I have worked with follow-the-sun exactly once in my career. We had Ops teams in the US East Coast, Poland, and China. It wasn't a true 8/8/8 split, but it was enough of a split that "after hours maintenance" always happened in someone's daytime. It was pretty dang nice. Then we had a layoff round and the Poland office went away. And we fired our Chinese Ops folk to save money, which meant we were waking the US staff up at o-dark-thirty to do maintenance.


I'm conflicted on this advice. On the surface, I totally get the sentiment: keep the annoying shit in everyone's daytime and don't force people to work overnights.

As an industry, we have history with split-shifting and incident response. The night operator used to be a common feature of any company with a computer, the person (or team of people) responsible for loading/unloading tapes, swapping paper for the printer, collating and packaging print-jobs, making sure the batch-jobs ran, calling the SYSOP when things smelled off, and a bunch of other now-forgotten tasks. Most organizations have gotten rid of the night operator for a lot of reasons. The two biggest being:

  1. We've mostly automated the job out of existence. Tapes (if tapes are still in use) are handled by robots. Print-jobs now show up as a PDF in your email. Batch-schedulers are really fancy now, so getting those batch-jobs run is highly automated. Monitoring systems keep track of way more things than we could track in the night operator era.
  2. No one wants to work overnights. Like, no one. At least not enough to easily find a replacement when the one person who does like it decides to leave/retire.

(The second point hit WWU while I was there)

As an industry we no longer have tradition of doing shift-work. The robust expectation is we'll have a day-time job and go home in the evenings. If you offer me an overnight job at +30% pay -- I'll take it for a while, but I'm still job-hunting for a real daytime job. Not sustainable, which is why on-call is how we're solving the one night operator task we couldn't automate out of existence: incident response.

Everyone needs some way to do incident response, even if they're 15 people with a big idea and a website -- far too small to be doing follow-the-sun rotations. Are they supposed to make it clear that you only guarantee availability certain hours? I think there is some legs in that idea, but the site will be negatively compared with the site next door that offers 24/7 availability (at the cost of little sleep for their few engineers).

Forcing change to the idea that Ops-type work is always done with a pager attached with unknown extra hours will take a shit-ton of work. Sea changes like that don't happen naturally. We cross-faded from night operators to on-call rotations due to the changing nature of the role: there wasn't enough work to do on the 11pm to 7am shift to keep someone fully occupied, so we tacked those duties onto the 7am-3pm crew (who now work a nicer 9am to 5pm schedule).

The only way to break the need for on-call for Ops-type roles is to stop making availability promises when you're not staffed to support it with people responding as part of their normal working hours. If your support-desk isn't answering the phone, site availability shouldn't be promised.

It's that or unionizing the entire sector.

Staff Engineer

Last August I was promoted to Staff Engineer at HelloSign (Dropbox). That parenthetical is important, which I'll get to in a bit. If you've been closely following my career, you noticed that I was promoted to Staff Engineer at HelloSign back in July 2018. So how am I getting promoted again? To the same title? Well, the bit you missed is that HelloSign got bought by Dropbox in February of 2019. Now that we're two years past the merger, here is what happened to my title in the last three years:

  1. July 2018 - promoted to Staff Engineer at HelloSign! I honestly hadn't heard the term 'Staff engineer' before then. It was a welcome surprise.
  2. January 2019 - word breaks about the merger. Eek.
  3. February 2019 - my merger-packet sees me move from Staff Engineer to IC4 (Lead). Apparently, Staff Engineers and Vice Presidents were the two titles Dropbox wasn't letting in. My pay get a 5% bump, and I get my first-ever Restricted Stock Unit grant (they really wanted me to stick around).
  4. January 2020 - Dropbox has their performance review cycle. Because all of the HelloSigners have less than 12 months tenure (merger was in February, remember), none of us get promotions. I start actively working towards a Dropbox Staff Engineer, by clawing my way into cross-organizational meetings where I can.
  5. March 2020 - pandemic, flash crash, market turmoil, suddenly working from home for everyone, mass uncertainty.
  6. July 2020 - mid-cycle performance reviews, with upper management telling everyone that due to budget reasons we'll be promoting half the people we normally do.
  7. Late July 2020 - promotion lists come out, and a lot of HelloSigners are on it! Yay! Including me, to Staff Engineer. This surprised me a lot because I was expecting the January 2021 cycle to be the earliest that could happen. This happened because I was functionally a foundational engineer for the platform/devops side of things, had incredible system intuition (you get that after 5 years working for a place), and knew how to communicate. Promotion comes with another 5% pay bump (geobanding; I'm not in the Bay Area, I'm in the low-cost middle of America) and a sizable RSU grant.

Which brings us to now, half a year after the promotion. I'm asking myself what has changed?


First off, being Staff at a company with 50 engineers (HelloSign 2018) is quite different than being Staff at a company with over 10x that many (Dropbox 2019). Those two engineering organizations operate incredibly differently. When Dropbox said that our Staff engineers wouldn't get Staff, this is why: to be Staff at Dropbox you have to have cross-org impact, and by definition freshly merged HelloSign had no cross-org impact. QED.

When our team finally hooked up with Dropbox SRE (they don't do Devops) we learned what 10x scale means. The role our team of less than 10 people played in the HelloSign infrastructure was filled by 10+ teams in the Dropbox infrastructure. For individual Dropbox engineers it meant most were in a tiny, well-constrained box as compared to the wide scope each of my teammates enjoyed.

More importantly, 10x scale means that Dropbox infrastructure-engineering was writing new distributed systems, where HelloSign infrastructure-engineering was wiring together of-the-shelf distributed systems. Very different scopes and job duties, and why this new IC5 probably couldn't pass their IC2 coding pre-screen well enough to get in front of an actual person.


Second, HelloSign hasn't had a Dropbox style Staff Engineer before, so I'm kind of inventing the role as I go. Functionally, I've taken the title as official sanction to voice my opinions and gather people together on my own authority. Before, I was more likely to work through channels and try to get various managers to assemble a process to solve a problem. It's now my actual job to influence strategy, rather than focused on how to implement strategy.

Doing these things on my own authority has worked solely because the managers involved are letting me. Without that support I'd be doing the things I was doing before the promotion, but with a bit better pay. This applies to managers on the Dropbox side as well, without their invites to process-meetings IC5 would be a promotion in name only.

I was involved in some strategy work in 2019, along with the other former-Staff engineers. But since August I've leveled up, out of my exclusive focus on the HelloSign org and starting to take on strategy work inside the whole Dropbox context. In key ways, I'm now in the room where it happens. I've wanted to be here for years.


Third, where I fit in the overall multi-year strategy for HelloSign is not well defined. We had a terrible run of luck in October 2020 that resulted in several highly visible availability incidents, which resulted in a tiger-team to fix that shit and also make sure we don't get that bad again. This has been working, our SLOs have been 100% passing for the last three months. The work of rebuilding trust in our users takes far longer though, so we still need to be sure our future design focuses on availability.

The interaction between Product Management and Engineering Management is still somewhat opaque to me, and I need to fix that. My team only works with Product Management indirectly, when Engineering comes to us for support building something Product is pushing for. My role in software features is a bit iffy, since rare features involve my domain. But when they do, I should be there. I still don't know the correct attention-split there. Currently our pre-merger CTO is doing all of that strategy work.


Fourth, our org has room for another Staff Engineer on the software side. I recently asked myself if I'd left any marked trails for HelloSign's second Staff Engineer, and found that I hadn't. My HelloSign-specific strategy work is a bit elevated from where it was pre-promotion, but not by much. My Dropbox side is rather different, but that's more a reflection of Dropbox SRE - something a software SRE wouldn't be involved in anyway.

I'm a bit uncomfortable with that, but that's part of what comes with doing something no one else has done before. My management now knows about the lack, and says they're going to work on it. Meanwhile, I'll be pushing my nose into more strategy work.

We already have a law for this, a whole bunch of them, and you were there when they were passed. We hated it when it arrived, because we know the all-seeing-eye was getting new glasses.

Our problem this week was not that the all-seeing-eye lacked standing for intelligence-gathering and charging. Our problem this week was that the all-seeing-eye was looking the other way.

That is our problem. Systemic racism. And systemic racism isn't solved through passing crime-and-punishment legislation.

Use the tools you already have. You have way more than you need already, you just need to use them differently.

Recession is nigh

Back in August of last year I posted an article in the wake of the inverted bond-yield curve sparking recession worries, titled, 'Recession is coming'. My intent was to provide a survival guide for people in the tech industry who have never had to live through one. In there I made a prediction about what the coming recession could look like:

Do we know what kind of recession it'll be?

Not yet. The crystal-ball at this stage is suggesting two different kinds of recession:

  • A stock-market panic. Sharp, deep, but ultimately short since the fundamentals are still OKish. Will topple the most leveraged companies.
  • The Trump trade-wars trigger a global sell-off. Long, but shallow. Will affect the whole industry, but won't be a rerun of the 2008-2010 disaster.

It turns out the first one was the right one, triggered by an epidemic/pandemic virus with a death-rate higher to much higher than the flu. The market is sliding hard, everyone is losing lots of money, triggered by fears of what prevention, containment, and remediation impacts will have on the overall economy. From this part of the cycle it's hard to be sure if this will be a short or long one, a lot depends on what the true impacts of COVID-19 are worldwide.

Complicating matters now is the oil price-war being waged by Russia and Saudi Arabia against US shale oil. Also, the US Presidential Election can still spook investors. Once it feels like the worst of COVID-19 is behind us, a recovery is likely. However, any recession of any kind will trigger a financial reckoning for companies holding a lot of corporate debt (Halliburton) and companies that exist on continual large investment (Uber).

Obsolescence

Obsolescence is a strange thing in the tech-industry. A software library like ImageMagick can become obsolete in two weeks. An Operating System can last 10 years, but a patch-level may only last a few days. A given CPU microarchitecture can drive compute for a decade or more.

Apple and iOS has the upgrade forcing mechanism of dropping older models from iOS updates, paired with App makers publishing minimum iOS versions. Stay on an old iOS, get old apps, and eventually the ones you need stop working as APIs shift. Time to upgrade.

There are times when it feels like my whole job is to keep adding floors to a sky-scraper that is slowly getting flooded a floor at a time. To stay still is to drown. To embrace older tech requires building water-tight compartments. My job involves enough welding to keep the older parts of the tower dry while we work to upgrade ourselves to a new floor.

Right now I'm personally feeling it in my pocket. My phone is running Android 8.0 and last got a patch-release in January of 2018. And yet, my apps are running fine and still get updates. My apps also perform fine. What I use my phone for also performs fine.

As of this blog-post it'll be two years out of patches, which is pathologically out of date. Even though the hardware is just fine. It burns me that there isn't much recyclable in here, and that the carbon costs of getting a new phone are such that you have to use it for two or more years to 'pay back' what it cost to manufacture.

I've long had a problem with tying something with a 10 year replacement cycle like a refrigerator to something with a 2 year replacement cycle like a tablet. Need to update the in-door display surface for your fridge? Get a new fridge. Great for the refrigerator manufacturer, not so great for my capital budget.

The same goes for your car-computer. General Motors doesn't support Android Auto on your dash-unit and Spotify won't let you use that ancient version? Get a new car. Now, I suspect we'll see subscription services to upgrade the in-dash computer every three or so years, but it's early yet. We don't know how the used car market will adapt to this sort of obsolescence. Cars used to last twenty years, but you can guarantee Android Auto or Apple Car won't support twenty year old computing hardware. Maybe the car companies will be successful in turning cars into subscription services, and they'll do sensor transplants every 7 years to keep their rolling stock earning. Who knows.

Software is in everything these days, and that means everything is following the software obsolescence schedule. Including former durable goods like automobiles, major kitchen appliances, and in-home security goods. This could be a WastePocalypse until we figure out how to maintain a computing platform over a decade plus.

Supporting 10 year old 'mobile' hardware feels nigh impossible, but we're doing it on the server-computing side and have for the last 30 years. You can compile the latest OpenSSL libraries for an Intel CPU that was made in 2010, and it will support Elliptic Curve certificates running 2kb keys over TLS1.3. No, the long-term problem isn't supporting the encryption needed to talk to the mother-ship. The problem is supporting the driver for the cellular modem, the driver for the display devices, and whatever artisanal hand-crafted batch of ARM CPU is driving the head unit.

Truth to tell, figuring out how to get 20 years out of a car chock full of ARM/x86 CPUs will be what allows us to support smartphones over similar periods. Unfortunately, we're probably 10 years from that. Until then, just upgrade and hope your ewaste doesn't end up polluting the water of whatever country is too poor to police the illegal dumpers.

Who would win in a fight

Who would win in a fight? An Imperial Star Destroyer or the Enterprise? Show your work.

That was an actual question on a job-application I filled out sometime around 2010. I saw it for the silly it was and gave a completely serious answer. Not too surprisingly, I didn't even get a phone-screen for that one.

Because I keep bringing this oddball question up, and the answer causes some minor debate, I figured I'd write down my answer here to save myself from having to repeat myself a lot.

Immutable infrastructure

This concept has confused me for years, but I'm beginning to get caught-up on enough of the nuance to hold opinions on it that I'm willing to defend.

This is why I have a blog.

What is immutable infrastructure?

This is a movement of systems design that holds to a few principles:

  • You shouldn't have SSH/Ansible enabled on your machines.
  • Once a box/image/instance is deployed, you don't touch it again until it dies.
  • Yes, even for troubleshooting purposes.

Pretty simple on the face of it. Don't like how an instance is behaving? Build a new one and replace the bad instance. QED.

Refining the definition

The yes, even for troubleshooting purposes concept encodes another concept rolling through the industry right now: observability.

You can't do true immutable infrastructure until after you've already gotten robust observability tools in place. Otherwise, the SSHing and monkey-patching will still be happening.

So, Immutable Infrastructure and Observability. That makes a bit more sense to this old-timer.

Example systems

There are two design-patterns that structurally force you into taking observability principles into account, due to how they're built:

  • Kubernetes/Docker-anything
  • Serverless

Both of these make traditional log-file management somewhat more complex, so if engineering wants their Kibana interface into system telemetry, they're going to have to come up with ways to get that telemetry off of the logic and into the central-place using something other than log-files. Telemetry is the first step towards observability, and one most companies do instinctively.

Additionally, the (theoretically) rapid iterability of containers/functions mean much less temptation to monkey-patch. Slow iteration means more incentive to SSH or monkey-patch because that's faster than waiting for an AMI or template-image to bake.

The concept so many seem to miss

This is pretty simple.

Immutable infrastructure only applies to the pieces of your infrastructure that hold no state.

And its corollary:

If you want immutable infrastructure, you have to design your logic layers to not assume local state for any reason.

Which is to say, immutable infrastructure needs to be a DevOps thing, not just an Ops thing. Dev needs to care about it as well. If that means in-progress file-transforms get saved to a memcache/gluster/redis cluster instead of the local filesystem, so be it.

This also means that you will have both immutable and bastion infrastructures in the same overall system. Immutable for your logic, bastion for your databases and data-stores. Serverless for your NodeJS code, maintenance-windows and patching-cycles for your Postgress clusters. Applying immutable patterns to components that take literal hours to recover/re-replicate introduces risk in ways that treating them for what they are, mutable, would not.

Yeahbut, public cloud! I don't run any instances!

So, you've gone full Serverless, and all of your state is sitting in something like AWS RDS, ElasticCache, and DynamoDB, and using Workspaces for your 'inside' operations. No SSHing, to be sure. That said, this is about as automated as you can get. Even so, there are still some state operations you are subject to:

  • RDS DB failovers still yield several to many seconds of "The database told me to bugger off" errors.
  • RDS DB version upgrades still require a carefully choreographed dance to ensure your site continues to function, if glitchy in short periods.
  • ElasticCache failovers still cause extensive latency as your underlying SDKs catch up to the new read/write replica location.

You're still not purely immutable, but you're as close as you can get in this modern age. Be proud.