24/7 availability and oncall

There is another meme going around OpsTwitter the past few days. This is a familiar refrain in discussions about on-call and quality of life. But the essence is:

If you need 24/7 availability, you also need follow-the-sun support. That way any crisis is in someone's day-time, regular-work day.

I agree, this is the standard you need to judge your solution against. However, this solution has some assumptions baked into it. Here are a few:

  • You have three teams operating 8 timezones from their neighbors (or two timezones spanning 12)
  • No one set of employment laws spans 24 timezones, so these teams will each be under different labor and national holiday laws.
  • Each timezone needs an on-call rotation.
  • The minimum viable on-call rotation per timezone is 3 people, but 6 is far more friendly to the people supporting the site.
  • Due to staffing reasons, your global on-call team needs 9 to 18 people on it (or 6 to 12 for a 12 timezone spread).
  • Due to the timezone spread, each team will have minimal coordination with each other. What coordination there is will involve one team being on a video-call at o-dark-thirty.
  • You need enough work to keep 9 to 18 people busy in addition to their fire-watch duties.

You know who can pull that off? Really big companies.

You know who can't pull that off? Companies employing in a single labor market, such as the US.

I mean, Guam is a US holding (UTC+10). Theoretically if you had a team in Guam and a team in New York City (UTC-4) you would have a 10 hour difference between them. You could sort of make this work while staying inside the US tax and legal domains, but you're reliant on the technical base of people in Guam which has a population a bit smaller than Des Moines, Iowa. Colonialism means people will think about hiring in Ireland or India before Guam. To do this you need to go international.

Most smaller companies won't go international, way too much paperwork involved at a time when you're supposed to be lean and fast.

I have worked with follow-the-sun exactly once in my career. We had Ops teams in the US East Coast, Poland, and China. It wasn't a true 8/8/8 split, but it was enough of a split that "after hours maintenance" always happened in someone's daytime. It was pretty dang nice. Then we had a layoff round and the Poland office went away. And we fired our Chinese Ops folk to save money, which meant we were waking the US staff up at o-dark-thirty to do maintenance.

I'm conflicted on this advice. On the surface, I totally get the sentiment: keep the annoying shit in everyone's daytime and don't force people to work overnights.

As an industry, we have history with split-shifting and incident response. The night operator used to be a common feature of any company with a computer, the person (or team of people) responsible for loading/unloading tapes, swapping paper for the printer, collating and packaging print-jobs, making sure the batch-jobs ran, calling the SYSOP when things smelled off, and a bunch of other now-forgotten tasks. Most organizations have gotten rid of the night operator for a lot of reasons. The two biggest being:

  1. We've mostly automated the job out of existence. Tapes (if tapes are still in use) are handled by robots. Print-jobs now show up as a PDF in your email. Batch-schedulers are really fancy now, so getting those batch-jobs run is highly automated. Monitoring systems keep track of way more things than we could track in the night operator era.
  2. No one wants to work overnights. Like, no one. At least not enough to easily find a replacement when the one person who does like it decides to leave/retire.

(The second point hit WWU while I was there)

As an industry we no longer have tradition of doing shift-work. The robust expectation is we'll have a day-time job and go home in the evenings. If you offer me an overnight job at +30% pay -- I'll take it for a while, but I'm still job-hunting for a real daytime job. Not sustainable, which is why on-call is how we're solving the one night operator task we couldn't automate out of existence: incident response.

Everyone needs some way to do incident response, even if they're 15 people with a big idea and a website -- far too small to be doing follow-the-sun rotations. Are they supposed to make it clear that you only guarantee availability certain hours? I think there is some legs in that idea, but the site will be negatively compared with the site next door that offers 24/7 availability (at the cost of little sleep for their few engineers).

Forcing change to the idea that Ops-type work is always done with a pager attached with unknown extra hours will take a shit-ton of work. Sea changes like that don't happen naturally. We cross-faded from night operators to on-call rotations due to the changing nature of the role: there wasn't enough work to do on the 11pm to 7am shift to keep someone fully occupied, so we tacked those duties onto the 7am-3pm crew (who now work a nicer 9am to 5pm schedule).

The only way to break the need for on-call for Ops-type roles is to stop making availability promises when you're not staffed to support it with people responding as part of their normal working hours. If your support-desk isn't answering the phone, site availability shouldn't be promised.

It's that or unionizing the entire sector.