In defense of job titles

| No Comments

I've noticed that the startup-flavored tech industry has certain preferences when it comes to your job-title. They like them flat. A job tree can look like this:

  1. Intern (write software as a student)
  2. Software Engineer (write software as a full time salaried employee)
  3. Lead Software Engineer (does manager things in addition to software things)
  4. Manager (mostly does manager things; if they used to be a Software Engineer, maybe some of that if there is time)

Short and to the point. The argument in favor of this is pretty well put by:

A flat hierarchy keeps us from having to rank everyone against some arbitrary rules. What, really, is the quantifiable difference between a 'junior' and a 'senior' engineer? We are all engineers. If you do manager things, you're a lead. When you put Eclipse/Vim/VisualStudio behind you, then you're a manager.

No need to judge some engineers as better than other engineers. Easy. Simple. Understandable.

Over in the part of the tech-industry that isn't dominated by startups, but is dominated by, say, US Federal contracting rules you have a very different hierarchy.

  1. Associate Systems Engineer
  2. Junior Systems Engineer
  3. Systems Engineer
  4. Senior Systems Engineer
  5. Lead Systems Engineer (may do some managery things, may not)
  6. Principal Systems Engineer (the top title for technical stuff)

Because civil service is like that, each of those has a defined job title, with responsibilities, and skill requirements. Such job-reqs read similar to:

Diagnose and troubleshoots problems involving multiple interconnected systems. Proposes complete systems and integrates them. Work is highly independent, and is effective in coordinating work with other separate systems teams. May assume a team-lead role.

Or for a more junior role:

Diagnose and troubleshoot problems for a single system in an interconnected ecosystem. Proposes changes to specific systems and integrates them. Follows direction when implementing new systems. Work is somewhat independent, guided by senior engineers.

Due to the different incentive, win US government contracting agreements versus not having to judge engineers as better/worse than each other, having multiple classes of 'systems engineer' makes sense for the non-startup case.


I'm arguing that the startup-stance (flat) is more unfair. Yes, you don't have to judge people as 'better-than'.

On the job-title, at least.

Salaries are another story. Those work very much like Enterprise Pricing Agreements, where no two agreements look the same. List-price is only the opening bid of a protracted negotiation, after all. This makes sense, as hiring a tech-person is a 6-figure annual recurring cost in most large US job-markets (after you factor in fringe benefits, employer-side taxes etc). That's an Enterprise contract right there, no wonder each one is a unique snowflake of specialness.

I guarantee that the person deciding what a potential hire's salary is going to be is going to consider time-in-the-field, experience with our given technologies, ability to operate in a fast paced & changing environment, and ability to make change as the factors in the initial offer. All things that were involved in the job-req example I posted above. Sub-consciously certain unconscious biases factor in, such as race and gender.

By the time a new Software Engineer walks in the door for their first day they've already been judged better/worse than their peers. Just, no one knows it because it isn't in the job title.

If the company is one that bases annual compensation improvements on the previous year's performance, this judgment happens every year and compounds. Which is how you can get a hypothetical 7 person team that looks like this:

  1. Lead Software Engineer, $185,000/yr
  2. Software Engineer, $122,000/yr
  3. Software Engineer, $105,000/yr
  4. Software Engineer, $170,000/yr
  5. Software Engineer, $150,000/yr
  6. Software Engineer, $135,000/yr
  7. Software Engineer, $130,000/yr

Why is Engineer 4 paid so much more? Probably because they were the second hire after the Lead, meaning they have more years of increase under their belts, and possibly a guilt-raise when Engineer 1 was picked for Lead when they weren't after the 3rd hire happened and the team suddenly needed a Lead.

One job-title, $65,000 spread in annual compensation. Obviously, no one has been judged better or worse than each other.

Riiiiight.

Then something like #TalkPay happens. Engineer number 4 says in Slack, "I'm making 170K. #TalkPay". Engineer number 3 chokes on her coffee. Suddenly, five engineers are now hammering to get raises because they had no idea the company was willing to pay that much for a non-Lead.

Now, if that same series were done but with a Fed-style job series?

  1. Lead Software Engineer, $185,000/yr
  2. Junior Software Engineer, $122,000/yr
  3. Associate Software Engineer, $105,000/yr
  4. Senior Software Engineer, $170,000/yr
  5. Senior Software Engineer, $150,000/yr
  6. Software Engineer, $135,000/yr
  7. Software Engineer, $130,000/yr

Only one person will be banging on doors, Engineer number 5. Having a job-series allows you to have overt pay disparity without having to pretend everyone is equal to everyone else. It makes overt the judgment that is already being made, which makes the system more fair.


Is this the best of all possible worlds?

Heck no. Balancing unconscious bias mitigation (rigid salary scheduled and titles) versus compensating your high performers (individualized salary negotiations) is a fundamentally hard problem with unhappy people no matter what you pick. But not pretending we're all the same helps keep things somewhat more transparent. It also makes certain kinds of people not getting promotions somewhat more obvious than certain kinds of people getting half the annual raises of everyone else.

Digital Doorknobs

| No Comments

Doorknobs are entering the Internet of (unsecured) Things.

However, they've been there for quite some time already. As anyone who has been in a modern hotel any time in the last 30 years knows, metal keys are very much a thing of the past. The Hotel industry made this move for a lot of reasons, a big one being that plastic card is a lot easier to replace than an actual key.

They've also been there for office-access for probably longer, as anyone who has ever had to waive their butt or purse at a scan-pad beside a door knows. Modern versions are beginning to get smartphone hookups, allowing the an expensive (but employee-owned) smartphone with an app on it and enabled Bluetooth to replace that cheap company-owned prox-pass.

They're now moving into residences, and I'm not a fan of this trend. Most of my objection comes from being in Operations for as long as I have. The convenience argument for internet-enabling your doorknob is easy to make:

  • Need emergency maintenance when you're on vacation? Allow the maintenance crew in from your phone!
  • Assign digital keys to family members you can revoke when they piss you off!
  • Kid get their phone stolen? Revoke the stolen key and don't bother with a locksmith to change the locks!
  • Want the door to unlock just by walking up to it? Enable Bluetooth on your phone, and the door will unlock itself when you get close!

This is why these systems are selling.

Security

I'm actually mostly OK with the security model on these things. The internals I've looked at involved PKI and client-certificates. When a device like a phone gets a key, that signed client-cert is allowed to access a thingy. If that phone gets stolen, revoke the cert at the CA and the entire thing is toast. The conversation between device and the mothership is done over a TLS connection using client-certificate authentication, which is actually more secure than most banks website logins.

The handshake over Bluetooth is similarly cryptoed, making it less vulnerable to replay attacks.

Where we run into problems is the intersection of life-safety and the flaky nature of most residential internet connections. These things need to be able to let people in the door even when CentryLink is doing that thing it does. If you err on the side of getting in the door, you end up caching valid certs on the lock-devices themselves, opening them up to offline attacks if you can jam their ability to phone home. If you err on the side of security, an internet outage is a denial of access attack.

The Real Objection

It comes down to the differences in the hardware and software replacement cycles, as well as certain rare but significant events like a change of ownership. The unpowered deadbolt in your front door could be 20 years old. It may be vulnerable to things like bump-keys, but you can give the pointy bits of metal (keys) to the next residents on your way to your new place and never have to worry about it. The replacement cycle on the whole deadbolt is probably the same as the replacement cycle of the owners, which is to say 'many years'. The pin settings inside the deadbolt may get changed more often, but the whole thing doesn't get changed much at all.

Contrast this with the modern software ecosystem, where if your security product hasn't had an update in 6 months it's considered horribly out of date. At the same time, due to the iterative nature of most SaaS providers and the APIs they maintain, an API version may get 5 years of support before getting shut down. Build a hardware fleet based on that API, and you have a hardware fleet that ages at the rate of software. Suddenly, that deadbolt needs a complete replacement every 5 years, and costs about 4x what the unpowered one did.

Most folks aren't used to that. In fact, they'll complain about it. A lot.

There is another argument to make about embedded system (that smart deadbolt), and their ability to handle the constantly more computationally expensive crypto-space. Not to mention changing radio-specs like Bluetooth and WiFi that will render old doorknobs unable to speak to the newest iPhone. Which is to say, definitely expect Google and Apple to put out doorknobs in the not too distant future. Amazon is already trying.

All of this make doorknob makers salivate, since it means more doorknobs will be sold per year. Also the analytics over how people use their doors? Priceless. Capitalism!

It also means that doorknob operators, like homeowners, are going to be in for a lot more maintenance work to keep them running. Work that didn't used to be there before. Losing a phone is pretty clear, but what happens when you sell your house?

You can't exactly 'turn over the keys' if they're 100% digital and locked into your Google or Apple identities. Doorknob makers are going to have to have voluntary ownership-transfer protocols.

Involuntary transfer protocols are going to be a big thing. If the old owners didn't transfer, you could be locked out of the house. That could mean a locksmith coming in to break in to your house, and having to replace every deadbolt in the place with brand new. Or it could mean arguing with Google over who owns your home and how to prove it.

Doing it wrong has nasty side-effects. If you've pissed off the wrong people on the internet, you could have griefers coming after your doorknob provider, and you could find yourself completely locked out of your house. The more paranoid will have to get Enterprise contracts and manage their doorknobs themselves so they have full control over the authentication and auth-bypass routes.

Personally, I don't like that added risk-exposure. I don't want my front door able to be socially engineered out of my control. I'll be sticking with direct-interaction token based authentication methods instead of digitally mediated digital token auth methods.

Terraforming in prod

| No Comments

Terraform from HashiCorp is something we've been using in prod for a while now. Simply ages in terraform-years, which means we have some experience with it.

It also means we have some seriously embedded legacy problems, even though it's less than two years old. That's the problem with rapidly iterating infrastructure projects that don't build in production usecases from the very start. You see, a screwdriver is useful in production! It turns screws and occasionally opens paint cans. You'd think that would be enough. But production screwdrivers conform to external standards of screwdrivers, are checked in and checked out of central tooling because quality control saves everyone from screwdriver related injuries, and have support for either outright replacement or refacing of the toolface.

Charity.wtf has a great screed on this you should read. But I wanted to share how we use it, and our pains.

In the beginning

I did our initial Terraform work in the Terraform version 6 days. The move to production happened about when TF 7 came out. You can see how painfully long ago that was (or wasn't). It was a different product back then.

Terraform Modules were pretty new back when I did the initial build. I tried them, but couldn't get them to work right. At the time I told my coworkers:

They seem to work like puppet includes, not puppet defines. I need them to be defines, so I'm not using them.

I don't know if I had a fundamental misunderstanding back then or if that's how they really worked. But they're defines now, and all correctly formatted TF infrastructures use them or be seen as terribly unstylish. It means there is a lot of repeating-myself in our infrastructure.

Because we already had an AMI baking pipeline that worked pretty well, we never bothered with Terraform Provisioners. We build ours entirely on making the AWS assets versonable. We tried with CloudFormation, but gave that up due to the terrible horrible no good very bad edge cases that break iterations. Really, if you have to write Support to unfuck an infrastructure because CF can't figure out the backout plan (and that backout is obvious to you), then you have a broken product. When Terraform gets stuck, it just throws up its hands and says HALP! Which is fine by us.

Charity asked a question in that blog-post:

Lots of people seem to eventually end up wrapping terraform with a script.  Why?

I wrote it for two big, big reasons.

  1. In TF6, there was zero support for sharing a Terraform statefile between multiple people (without paying for Atlas), and that critically needs to be done. So my wrapper implemented the sharing layer. Terraform now supports several methods for this out of the box, it didn't used to.
  2. Terraform is a fucking foot gun with a flimsy safety on the commit career suicide button. It's called 'terraform destroy' and has no business being enabled for any reason in a production environment ever. My wrapper makes getting at this deadly command require a minute or two of intentionally circumventing the safety mechanism. Which is a damned sight better than the routine. "Are you sure? y/n" prompt we're all conditioned to just click past. Of course I'm in the right directory! Yes!

And then there was legacy.

We're still using that wrapper-script. Partly because reimplementing it for the built-in statefile sharing is, like, work and what we have is working. But also because I need those welded on fire-stops on the foot-gun.

But, we're not using modules, and really should. However, integrating modules is a long and laborious process we haven't seen enough benefits to outweight the risk. To explain why, I need to explain a bit about how terraform works.

You define resources in Terraform, like a security group with rules. When you do an 'apply', Terraform checks the statefile to see if this has been created yet, and what state it was seen last. It then compares the last known state with the current state in the infrastructure to determine what changes need to be made. Pretty simple. The name of the resource in the statefile is a clear format. For non-module resources the string is "resource_type.resource_name", so our security group example would be "aws_security_group.prod_gitlab". For module resources it gets more complicated, such as "module_name.resource_type.resource_name". That's not the exact format -- which is definitely not bulk-sed friendly -- but it works for the example I'm about to share. If you change the name of a resource, Terraform's diff shows the old resource disappearing and a brand new one appearing and treats it as such. Sometimes this is what you want. Other times, like if they're your production load-balancers where delete-and-recreate will create a multi-minute outage, you don't.

To do a module conversion, this is the general workflow.

  1. Import the module and make your changes, but don't apply them yet.
  2. Use 'terraform statefile ls' to get a list of the resources in your statefile, note the names of the resources to be moved into modules.
  3. Use 'terraform statefile rm' to remove the old resources.
  4. Use 'terraform import' to import the existing resources into the statefile under their now module-based names.
  5. Use 'terraform plan' to make sure there are zero changes.
  6. Commit your changes to the terraform repo and apply.

Seems easy, except.

  • You need to lock the statefile so no one else can make changes when you're doing this. Critically important if you have automation that does terraform actions.
  • This lock could last a couple of hours depending on how many resources need to be modified.
  • This assumes you know how terraform statefiles work with resource naming, so you need someone experienced with terraform to do this work.
  • Your modules may do some settings subtly different than you did before, so it may not be a complete null-change.
  • Some resources, like Application Load Balancers, require a heartbreaking number of resources to define, which makes for a lot of import work.
  • Not all resources even have an import developed. Those resources will have to be deleted and recreated.
  • Step 1 is much larger than you think, due to dependencies from other resources that will need updating for the new names. Which means you may visit step 1 a few times by the time you get a passing step 5.
  • This requires a local working Terraform setup, outside of your wrapper scripts. If your wrapper is a chatbot and no one has a local TF setup, this will need to be done on the chatbot instance. The fact of the matter is that you'll have to point the foot-gun at your feet for a while when you do this.
  • This is not a change that can be done through Terraform's new change-management-friendly way of packaging changes for use in change-management workflows, so it will be a 'comprehensive' change-request when it comes.

Try coding that into a change-request that will pass auditorial muster. In theory it is possible to code up a bash-script that will perform the needed statefile changes automatically, but it will be incredibly fragile in the face of other changes to the statefile as the CR works its way through the process. This is why we haven't converted to a more stylish infrastructure; the intellectual purity of being stylish isn't yet outweighing the need to not break prod.

What it's good for

Charity's opinion is close to my own:

Terraform is fantastic for defining the bones of your infrastructure.  Your networking, your NAT, autoscaling groups, the bits that are robust and rarely change.  Or spinning up replicas of production on every changeset via Travis-CI or Jenkins -- yay!  Do that!

But I would not feel safe making TF changes to production every day.  And you should delegate any kind of reactive scaling to ASGs or containers+scheduler or whatever.  I would never want terraform to interfere with those decisions on some arbitrary future run.

Yes. Terraform is best used in cases where doing an apply won't cause immediate outages or instabilities. Even using it the way we are, without provisioners, means following some rules:

  • Only define 'aws_instance' resources if we're fine with those suddenly disappearing and not coming back for a couple of minutes. Because if you change the AMI, or the userdata, or any number of other details, Terraform will terminate the existing one and make a new one.
    • Instead, use autoscaling-groups and a process outside of Terraform to manage the instance rotations.
  • It's fine to encode scheduled-scaling events on autoscaling groups, and even dynamic-scaling triggers on them.
  • Rotating instances in an autoscaling-group is best done in automation outside of terraform.
  • Playing pass-the-IP for Elastic-IP addresses is buggy and may require a few 'applies' before they fully move to the new instances.
  • Cache-invalidation on the global Internet's DNS caches is still buggy as fuck, though getting better. Plan around that.
  • Making some changes may require multiple phases. That's fine, plan for that.

The biggest strength of Terraform is that it looks a lot like Puppet, but for your AWS config. Our auditors immediately grasped that concept and embraced it like they've known about Terraform since forever. Because if some engineer cowboys in a change outside of the CR process, Terraform will back that out the next time someone does an apply, much the way puppet will back out a change to a file it manages. That's incredibly powerful, and something CloudFormation only sort of does.

The next biggest strength is that it is being very actively maintained and tracks AWS API changes pretty closely. When Amazon announces a new service, Terraform will generally have support for it within a month (not always, but most of the time). If the aws-cli can do it, Terraform will also be able to do it; if not now, then very soon.

While there are some patterns it won't let you do, like have two security-groups point to each other on ingress/egress lists because that is a dependency loop, there is huge scope in the zones of what it will let you do.

This is a good tool and I plan to keep using it. Eventually we'll do a module conversion somewhere, but that may wait until they have a better workflow for it. Which may be in a month, or half a year. This project is moving fast.

Or,

The Problem of Twitter, Hatemobs, and Denial of Service

The topic of shared blocklists is hot right now. For those who avoid the blue bird, a shared blocklist is much like a shared killfile from Ye Olde Usenet, or an RBL for spam. Subscribe to one, and get a curated feed of people you never want to hear from again. It's an idea that's been around for decades, applied to a new platform.

However, internet-scale has caught up with the technique.

Usenet

A Usenet killfile was a feature of NNTP clients where posts meeting a regex would not even get displayed. If you've ever wondered what the vengeful:

*Plonk!*

...was about? This is what it was referring to. It was a public way of saying:

I have put you into my killfile, and I am now telling you I have done so, you asshole.

This worked because in the Usenet days, the internet was a much smaller place. Once in a while you'd get waves of griefers swarming a newsgroup, but that was pretty rare. You legitimately could remove most content you didn't want to see from your notice. The *Plonk!* usage still exists today, and I'm seeing some twitter users use that to indicate a block is being deployed. I presume these are veterans of many a Usenet flame-war.

RBLs

The Realtime Blackhole Lists (RBL) were pioneered as an anti-spam technique. Mail administrators could subscribe to these, and all incoming mail-connections could be checked against it. If it was listed, the SMTP connection could be outright rejected. The assumption here was that spam comes from insecured or outright evil hosts, and that blocking them outright is better for everyone.

This was a true democratic solution in the spirit of free software: Anyone could run one.

That same sprit means that each RBL had a different criteria for listing. Some were zero tolerance, and even one Unsolicited Commercial Email was enough to get listed. Others, simply listed whole netblocks, so you could block all Cable ISPs, or entire countries.

Unlike killfiles, RBLs were designed to be a distributed system from the outset.

Like killfiles, RBLs are in effect a Book of Grudges. Subscribing to one, means subscribing to someone else's grudges. If you shared grudge-worthy viewpoints, this was incredibly labor saving. If you didn't, sometimes things got blocked that shouldn't have.

As a solution to the problem of spam, RBLs were not the silver bullet. That came with the advent of commercial providers deploying surveillance networks and offering IP reputation services as part of their paid service. The commercial providers were typically able to deploy far wider surveillance than the all-volunteer RBLs did, and as such saw a wider sample of overall email traffic. A wider sample means that they were less likely to ban a legitimate site for a single offense.

This is still the case today, though email-as-a-service providers like Google and Microsoft are now hosting the entire email stack themselves. Since Google handles a majority of all email on the planet, their surveillance is pretty good.

Compounding the problem for the volunteer-lead RBL efforts is IPv6. IPv4 was small enough you can legitimately tag the entire internet with spam/not-spam without undue resources. IPv6 is vastly larger and very hard to do comprehensively without resorting to netblock blocking. And even then, there are enough possible netblocks that scale is a real issue.

Twitter Blocklists

Which brings us to today, and twitter. Shared blocklists are in this tradition of killfiles and RBLs. However, there are a few structural barriers to this being the solution it was with Usenet:

  • No Netblocks. Which means each user has to be blocked individually, you can't block all of a network, or a country-of-origin
  • The number of accounts. Active-users is not the same as total-users. In 2013, the estimated registered-account number was around 810 million. Four years later, this is likely over a billion. It's rapidly approaching the size of the IPv4 address space.
  • Ease of setting up a new account. Changing your IP address Changing your username is very, very easy.

The lack of a summarization technique, the size of the problem space, and the ease of bypassing a block by changing your identifier mean that a shared-blocklist is a poor tool to fight off a determined hatemob. It's still good for casual griefing, where the parties aren't invested enough to break a blocklist.

The idea of the commercialized RBL, where a large company sells account-reputation services, is less possible. First of all, such an offering would likely be against the Twitter terms of service. Second, the target market is not mail-server-admins, but individual account-holders. A far harder market to monitize.

The true solution will have to come from Twitter itself. Either by liberalizing their ToS to allow those commercial services to develop, or developing their own reputation markets and content tools.

The two sides of this story are:

  • Software requirements are often... mutable over time.
  • Developer work-estimates are kind of iffy.

Here is a true story about the later. Many have similar tales to tell, but this one is mind. It's about metrics, agile, and large organizations. If you're expecting this to turn into a frAgile post, you're not wrong. But it's about failure-modes.

The setup

My employer at the time decided to go all-in on agile/scrum. It took several months, but every department in the company was moved to it. As they were an analytics company, their first reflex was to try to capture as much workflow data as possible so they can do magic data-analysis things on it. Which meant purchasing seats in an Agile SaaS product so everyone could track their work in it.

The details

By fiat from on-high, story points were effectively 'days'.

Due to the size of the development organization, there were three levels of Product Managers over the Individual Contributors I counted myself a part of.

The apex Product Manager, we had two, were for the two products we sold.

Marketing was also in Agile, as was Sales.

The killer feature

Because we were an analytics company, the CEO wanted a "single pane of glass" to give a snapshot of how close we were to achieving our product goals. Gathering metrics on all of our sprint-velocities, story/epic completion percentages, and story estimates, allowed us to give him that pane of glass. It had:

  • Progress bars for how close our products were to their next major milestones.
  • How many sprints it will take to get there.
  • How many sprints it would take to get to the milestone beyond it.

Awesome!

The failure

That pane of glass was a lying piece of shit.

The dashboard we had to build was based on so many fuzzy measurements that it was a thumb in the wind approximation for how fast we were going, and in what direction. The human bias to trust numbers derived using Science! is a strong one, and they were inappropriately trusted. Which lead to pressure from On High for highly accurate estimates, as the various Product Managers realized what was going on and attempted to compensate (remove uncertainty from one of the biggest sources of it).

Anyone who has built software knows that problems come in three types:

  1. Stuff that was a lot easier than you thought
  2. Stuff that was pretty much as bad as you thought.
  3. Hell-projects of tar-pits, quicksand, ambush-yaks, and misery.

In the absence of outside pressures, story estimates usually are pitched at type 2 efforts; the honest estimate. Workplace culture introduces biases to this, urging devs to skew one way or the other. Skew 'easier', and you'll end up overshooting your estimates a lot. Skew 'harder' and your velocity will look great, but capacity planning will suffer.

This leads to an interesting back and forth! Dev-team skews harder for estimates. PM sees that team typically exceeds its capacity in productivity, so adds more capacity in later sprints. In theory equilibrium is reached between work-estimation and work-completion-rate. In reality, it means that the trustability of number is kind of low and always will be.

The irreducible complexity

See, the thing is, marketing and sales both need to know when a product will be released so they can kick off marketing campaigns and start warming up the sales funnel. Some kinds of ad-buys are done weeks or more in advance, so slipping product-ship at the last minute can throw off the whole marketing cadence. Trusting in (faulty) numbers means it may look like release will be in 8 weeks, so its safe to start baking that in.

Except those numbers aren't etched in stone. They're graven in the finest of morning dew.

As that 8 week number turns into 6, then 4, then 2, pressure to hit the mark increases. For a company selling on-prem software, you can afford to miss your delivery deadline so long as you have a hotfix/service-pack process in place to deliver stability updates quickly. You see this a lot with game-dev: the shipping installer is 8GB, but there are 2GB of day-1 patches to download before you can play. SaaS products need to work immediately on release, so all-nighters may become the norm for major features tied to marketing campaigns.

Better estimates would make this process a lot more trustable. But, there is little you can do to actually improve estimate quality.

Bad speaker advice

| No Comments

Last year, while I was developing my talks, I saw a bit of bad advice. I didn't recognize it at the time. Instead, I saw it as a goal to reach. The forum was a private one, and I've long forgotten who the players were. But here is a reconstructed, summarized view of what spurred me to try:

elph1120: You know what I love? A speaker who can do an entire talk from one slide.
612kenny: OMG yes. I saw someguy do that at someconference. It was amazeballs.
elph1120: Yeah, more speakers should do that.
gryphon: Totally.

This is bad advice. Don't do this.

Now to explain what happened...

I saw this, and decided to try and do that for my DevOpsDays Minneapolis talk last year. I got close, I needed 4 slides. Which is enough to fit into a tweet.

See? No link to SlideShare needed! Should be amazing!

It wasn't.

The number one critique I got, by a large, large margin was this:

Wean yourself from the speaker-podium.

In order to do a 4-slide talk, I had to lean pretty hard on speaker-notes. If you're leaning on speaker-notes, you're either tied to the podium or have cue-cards in your hands. Both of these are violations of the modern TED-talk style-guide tech-conferences are following these days. I should have noticed that the people rhapsodizing over one-slide talks were habitues of one of the holdouts of podium-driven talks in the industry.

That said, there is another way to do a speaker-note free talk: the 60-slide deck for a 30 minute talk. Your slides are the notes. So long as you can remember some points to talk about above and beyond what's written on the slides, you're providing value above and beyond the deck you built. The meme-slide laugh inducers provide levity and urge positive feedback. If you're new to speaking this is the style you should be aiming for.

A one-slide talk is PhD level speaking-skills. It means memorizing paragraph by paragraph a 3K word essay, and read it back while on stage and on camera. You should not be trying to reach this bar until you're already whatever about public speaking, and have delivered that talk a bunch of times already.

It's not as widely known as I hope, but there are a host of workplace protections that apply to non-union, salaried, overtime exempt workers. Not all of them are written into the legal code, and are, in fact, work-arounds. To explain what I'm talking about, read this:

This is a small sample-set, but it works to illustrate the point I'm about to make.

If you find yourself in the position of reporting a coworker to HR for harassing behavior, suddenly find your performance reviews solidly in needs improvement territory, and get fired later; there are workplace protections that will help you get through this, and make the life of your harasser less good.

To get there, here are a few facts and common practices that contribute to the firing, and what could happen afterwards:

  • Performance reviews are as much subjective as objective.
  • Tattling on a co-worker can make the perception of your work move from team player to troublemaker.
  • When the perception shifts like that, top-marks reviews suddenly become remediation-required reviews.
  • Due to US labor law, as amended by State laws, creating a hostile work environment due to sexism, racism, etc, is a criminal act.
  • In spite of that law, very few cases are seen in court, and even fewer reach a verdict.
  • At-will laws mean you can be fired without stated cause.
  • Everyone has a price for their silence.
  • Pathologic workplace cultures have no room for empathy.

Performance Reviews, and Career Improvement Plans

These are often used as the basis for a firing decision. Not all workplaces do them, but many do. It may be hidden in the OKR process, in 360-degree reviews, or another company-goal tracking system, but it's still there. Sometimes they're simple exceeds/meets/needs-improvement metrics, or 1 to 5 ranked metrics, and always have manager input on them.

All of them have some component of plays well with others as one of the tracked metrics. No one likes working with assholes, and this is how they track that. Unfortunately, tattling to mommy that Kenny was mean to her is often seen as not playing well with others.

Buying Your Silence

The severance process you go through after termination is there to buy your silence. Employers know full well there is a marketplace of opinion on good places to work, and if they can keep you from bagging on them on Glassdoor or social media, that's a win for them. You also get a month or two of paid healthcare as you look for someplace new. The method of doing this is called a non-disparagement clause in the severance agreement.

Laws are there to incentivise people to not get caught

If you have a good enough papertrail to plausibly bring suit against the company for one of the legally protected things like racism or sexism, there are strong incentives for them to settle this out of court. Everyone has a price, and most people have a price that doesn't include a written admission of guilt as a requirement. This is why there are so few actions brought against companies in court.

Pathological Empathy

Of the three Westrum Typology types of corporate communication styles (Pathological, Bureaucratic, Generative), it's the pathologic that fundamentally treats non-managers as objects. When you're an object, it doesn't matter if your fee-fees get hurt; what matters is that you're on-side and loyal. If you are seen to be disloyal, you will need to find a new master to swear your fealty to or you will be disposed of through the usual at-will / severance means.

Not all companies are pathologic. The studies I've seen says it's less than a quarter. That said, if the company is big enough you can quite definitely have portions of it that are pathologic while the rest are generative.


That's a lot of framing.

There are certain legal nightmares that companies have with regards to labor laws:

  • Having a now-ex employee bring a discrimination suit against them.
  • Having a class-action suit brought against a specific manager.
  • Having the Department of Labor bring suit against the company for systemic discrimination.

All of these actions are massively public and can't be silenced later. The fact of their filing is damnation enough.

This works for you, even though none of these are likely to come about for your specific case. You see, they're trying to avoid any of that ever happening. To avoid that happening they need to buy you off. Don't think that this is their way of protecting the bad man from any consequence. It's their attempt to, it's up to you to make it actually painful.

Once the third person has been fired and levered themselves into a $200K hush-money severance package, you can guarantee that The Powers That Be are going to sit the bad man down and explain to him that if he doesn't stop with the hands, they're going to have to Do Something; you're costing us a lot of money. One person doing that is just a whiner trying to extort money. Two people getting that is an abundance of whiners. Three people getting that begins to look like a pattern of behavior that is costing the company a lot of money.

This only works because the consequences of simply ignoring your whiny ass are now dire. Thanks, New Deal!

What my CompSci degree got me

| No Comments

The what use is a csci degree meme has been going around again, so I thought I'd interrogate what mine got me.

First, a few notes on my career journey:

  1. Elected not to go to grad-school. Didn't have the math for a masters or doctorate.
  2. Got a job in helpdesk, intending to get into Operations.
  3. Got promoted into sysadmin work.
  4. Did some major scripting as part of Y2K remediation, first big coding project after school.
  5. Got a new job, at WWU.
  6. Microsoft released PowerShell.
  7. Performed a few more acts of scripting. Knew I so totally wasn't a software engineer.
  8. Manage to change career tracks into Linux. Started learning Ruby as a survival mechanism.
  9. Today: I write code every day. Still don't consider myself a 'software engineer'.

Elapsed time: 20ish years.

As it happens, even though my career has been ops-focused I still got a lot out of that degree. Here are the big points.

Sysadmins and risk-management

| No Comments

This crossed my timeline today:

This is a risk-management statement that contains all of a sysadmin's cynical-bastard outlook on IT infrastructure.

Disappointed because all of their suggestions for making the system more resilient to failure are shot down by management. Or, some of them are, which is like all in that there are disasters that are uncovered. Commence drinking heavily to compensate.

Frantically busy because they're trying to mitigate all the failure-modes their own damned self using not enough resources, all the while dealing with continual change as the mission of the infrastructure shifts over time.

A good #sysadmin always expects the worst.

Yes, we do. Because all too often, we're the only risk-management professionals a system has. We better understand the risks to the system than anyone else. A sysadmin who plans for failure is one who isn't first on the block when a beheading is called for by the outage-enraged user-base.

However, there are a few failure-modes in this setup that many, many sysadmins fall foul of.

Perfection is the standard.

And no system is perfect.

Humans are shit at gut-level risk-assessment, part 1: If you've had friends eaten by a lion, you see lions everywhere.

This abstract threat has been made all too real, and now lions. Lions everywhere. For sysadmins it's things like multi-disk RAID failures, UPS batteries blowing up, and restoration failures because an application changed its behavior and the existing backup solution no longer was adequate to restore state.

Sysadmins become sensitized to failure. Those once-in-ten-years failures, like datacenter transfer-switch failures or Amazon region-outages, seem immediate and real. I knew a sysadmin who was paralyzed in fear over a multi-disk RAID failure in their infrastructure. They used big disks, who weren't 100% 'enterprise' grade. Recoveries from a single-disk failure were long as a result. Too long. A disk going bad during the recovery was a near certainty in their point of view, never mind that the disks in question were less than 3 years old, and the RAID system they were using had bad-block detection as a background process. That window of outage was too damned long.

Humans are shit at gut-level risk-assessment, part 2: Leeroy Jenkins sometimes gets the jackpot, so maybe you'll get that lucky...

This is why people think they can win mega-million lotterys and in casinos playing roulette. Because sometimes, you have to take a risk for a big payoff.

To sysadmins who have had friends eaten by lions, this way of thinking is completely alien. This is the developer who suggests swapping out the quite functional MySQL databases for Postgres. Or the peer sysadmin who really wants central IT to move away from forklift SAN-based disk-arrays for a bunch of commodity hardware, FreeBSD, and ZFS.

Mm hm. No.

Leeroy Jenkins management and lion-eaten sysadmins make for really unhappy sysadmins.

When it isn't a dev or a peer sysadmin asking, but a manager...

Sysadmin team: It may be a better solution. But do you know how many lions are lurking in the transition process??

Management team: It's a better platform. Do it anyway.

Cue heavy drinking as everyone prepares to lose a friend to lions.


This is why I suggest rewording that statement:

A good #sysadmin always expects the worst.
A great #sysadmin doesn't let that rule their whole outlook.

A great sysadmin has awareness of business risk, not just IT risks. A sysadmin who has been scarred by lions and sees large felines lurking everywhere will be completely miserable in an early or mid-stage startup. In an early stage startup, the big risk on everyone's mind is running out of money and losing their jobs; so that once-in-three-years disaster we feel so acutely is not the big problem it seems. Yeah, it can happen and it could shutter the company if it does happen; but the money remediating that problem would be better spent by expanding marketshare enough that we can assume we'll still be in business 2 years from now. A failure-obsessed sysadmin will not have job satisfaction in such a workplace.

One who has awareness of business risk will wait until the funding runway is long enough that pitching redundancy improvements will actually defend the business. This is a hard skill to learn, especially for people who've been pigeon-holed worker-units their entire carer. I find that asking myself one question helps:

How likely is it that this company will still be here in 2 years? 5? 7? 10?

If the answer to that is anything less than 'definitely', then there are failures that you can accept into your infrastructure.

The origins of on-call work

| No Comments

On September 6th, Susan Fowler posted an article titled, "Who's on-call?", talking about evolving on-call duties between development teams and SRE teams. She has this quote at the top:

I'm not sure when in the history of software engineering separate operations organizations were built and run to take on the so-called "operational" duties associated with running software applications and systems, but they've been around for quite some time now (by my research, at least the past twenty years - and that's a long time in the software world).

My first job was with a city government, and many of the people I was working with started at that city when they decided to computerize in 1978. Most of them have retired or died off by now. In 1996, when I started there, the original dot-com boom was very much on the upswing, and that city was still doing things the way they'd been done for years.

I got into the market in time to see the tail end of that era. One of the things I learned there was the origins of many of the patterns we see today. To understand the origins of on-call in IT systems, you have to go back to the era of serial networking, when the term 'minicomputer' was distinct from 'microcomputer', which were marketing terms to differentiate from 'mainframe'.

IT systems of the era employed people to do things we wouldn't even consider today, or would work our damnedest to automate out of existence. There were people who had, as their main job, duties such as:

  • Entering data into the computer from paper forms.
    • Really. All you did all day was punch in codes. Computer terminals were not on every desk, so specialists were hired to do it.
    • The worst part is: there are people still doing this today.
  • Kick off backups.
  • Change backup tapes when the computer told them to.
  • Load data-tapes when the computer told them to.
    • Tape stored more than spinning rust, so it was used as a primary storage medium. Disk was for temp-space.
    • I spent a summer being a Tape Librarian. My job was roboticized away.
  • Kick off the overnight print-runs.
  • Colate printer output into reports, for delivery to the mailroom.
  • Execute the overnight batch processes.
    • Your crontab was named 'Stephen,' and you saw him once a quarter at the office parties. Usually very tired-looking.
  • Monitor by hand system usage indicators, and log them in a paper logbook.
  • Keep an Operations Log of events that happened overnight, for review by the Systems Programmers in the morning.
  • Follow runbooks given to them by Systems Programming for performing updates overnight.
  • Be familiar with emergency procedures, and follow them when required.

Many of these things were only done by people working third shift. Which meant computer-rooms had a human on-staff 24/7. Sometimes many of them.

There was a side-effect to all of this, though. What if the overnight Operator had an emergency they couldn't handle? They had to call a Systems Programmer to advise a fix, or come in to fix it. In the 80's, when telephone modem came into their own, they may even be able to dial in and fix it from home.

On-Call was born.

There was another side-effect to all of this: it happened before the great CompSci shift in the colleges, so most Operators were women. And many Systems Programmers were too. This was why my first job was mostly women in IT management and senior technical roles. This was awesome.

A Systems Programmer, as they were called at the time, is less of a Software Engineering role as we would define it today. They were more DevOps, if not outright SysAdmin. They had coding chops, because much of systems management at the time required that. Their goal was more wiring together purchased software packages to work coherently, or modifying purchased software to work appropriately.


Time passed, and more and more of the overnight Operator's job was automated away. Eventually, the need for an overnight Operator exceeded requirements. Or you simply couldn't hire one to replace the Operator that just quit. However, the systems were still running 24/7, and you needed someone ready to respond to disasters. On-call got more intense, since you no longer had an experienced hand in the room at all times.

The Systems Programmers earned new job-titles. Software Engineering started to be a distinct skill-path and career, so was firewalled off in a department called Development. In those days, Development and Systems people spoke often; something you'll hear old hands grumble about with DevOps not being anything actually new. Systems was on-call, and sometimes Development was if there was a big thing rolling out.

Time passed again. Management culture changed, realizing that development people needed to be treated and managed differently than systems people. Software Engineering became known as Software Engineering, and became its own career-track. The new kids getting into the game never knew the close coordination with Systems that the old hands had, and assumed this separation was the way it's always been. Systems became known as Operations; to some chagrin of the old Systems hands who resented being called an 'Operator', which was typically very junior. Operations remained on-call, and kept informal lists of developers who could be relied on to answer the phone at o-dark-thirty in case things went deeply wrong.

More time, and the separation between Operations and Software Engineering became deeply entrenched. Some bright sparks realized that there were an awful lot of synergies to be had with close coordination between Ops and SE. And thus, DevOps was (re)born in the modern context.

Operations was still on-call, but now it was open for debate about how much of Software Engineering needed to be put on the Wake At 3AM In Case Of Emergency list.

And that is how on-call evolved from the minicomputer era, to the modern era of cloud computing.

You're welcome.

Other Blogs

My Other Stuff

Monthly Archives