March 2015 Archives

There is more to an on-call rotation than a shared calendar with names on it and an agreement to call whoever is on the calendar if something goes wrong.

People are people, and you have to take that into consideration when setting up a rotation. And that means compromise, setting expectations, and consequences for not meeting them. Here are a few policies every rotation should have somewhere. Preferably easy to get to.

The rotation should be published well in advance, and easy to find.

This seems like an obvious thing, but it needs to be said. People need to know in advance when they're going to be obligated to pay attention to work in their usual time off. This allows them to schedule their lives around the on-call schedule, and you, as the on-call manager, will have to deal with fewer shift-swaps as a result. You're looking to avoid...

Um, I forgot I was on-call next week, and I'm going to be in Peru to hike the Andes. *sheepish look.*

This is less likely to happen if the shift schedule is advertised well in advance. For bonus points, add the shift schedules to their work calendars.

(US) Monday Holiday Law is a thing. Don't do shift swaps on Monday.

If you're doing weekly shifts, it's a good idea to not do your shift swap on Monday. Due to the US Monday Holiday Law there are five weeks in a year (10% of the total!) where your shift change will happen on an official holiday. Two of those are days that almost everyone gets off: Labor Day and Memorial Day.

Whether or not you need to avoid shift swaps on a non-work day depends a lot on how hand-offs work for your organization.

Set shift-handoff expectations.

When one watch-stander is relieved by the next, there needs to be a handoff. For some organizations it could be as simple as making sure the other person is there and responsive before stepping down. For others, it can be complicated as they have more state to transfer. State such as:

  • Ongoing issues being tracked.
  • Hardware replacements due during the next shift period.
  • Maintenance tasks not completed by the outgoing watch-stander.
  • Escalation engineers that won't be available.

And so on. If your organization has state to transfer, be sure you have a policy in place to ensure it is transferred.

Acknowledge time must be defined for alarms.

The maximum time a watch-stander is allowed to wait before ACKing an alarm must be defined by policy, and failure to meet that must be noticed. If the ACK time expires, the alarm should escalate to the next tier of on-call.

This is a very critical policy to define, as it allows watch-standers to predict how much life they can have while on-call. If response-time is 10 minutes, it means that if the watch-stander is driving they have to pull over to ack the alarm. 10 minute ACK likely means they can't do anything long, like go to movies or their kid's band recital.

This also goes for people who are on the escalation schedule. It may be different times, but it should still be defined.

Response time must be defined for alarms.

The time to first response, actually working on the reported problem, must be defined by policy, and failure to meet that must be noticed. This may vary based on the type of alarm received, but it should still be defined for each alarm.

This is another very critical policy to define, and is even more impactful on watch-stander ability to do other things than be at work. The watch-stander pretty much has to stay within N minutes of a reliable Internet connection for their entire watch. If their commute to and from work is longer than N minutes, they can't be on-watch during their commute. And other things.

  • If 10 minutes or less, it is nearly impossible to do anything out of the house. Such as picking up prescriptions, going to the kid's band recital, soccer games, and theatre rehearsals.
  • If 20 minutes or less, quick errands out of the house may be doable, but that's about it.

As with ACK time, this should also be defined for the people on the escalation schedule.

Consequences for escalations must be defined.

People are people, and sometimes we can't get to the phone for some reason. Escalations will happen for both good reasons (ER visits) and bad (slept through it). The severity of a missed alarm varies based on the organization and what got missed, so this will be a highly localized policy. If an alarm is missed, or a pattern of missed alarms develops, there should be defined consequences.

This is the kind of thing that can be brought up in quarterly/annual reviews.

This gets its own section because it's very important:

The right shift length

How long your shifts should be is a function of several variables:

  • ACK time.
  • Response time.
  • Frequency of alarms.
  • Average time-to-resolution (TTR) for the alarms.

First and foremost: If your alarm frequency and TTR are frequent enough, say in the 30% of attention or larger range, you don't have an on-call rotation; you have a distributed NOC, and being on watch is a full time job. Don't expect them to do anything else, and pay them like they're at work.

People need to sleep, and watch-standers are more effective if they're not sleep-deprived. What's more, being sleep-deprived is incredibly fatiguing, so schedules that promote such fatigue are more likely to have people quit. Here is a table I made showing the combinations of alarm frequency and mean TTR, showing on average how many minutes a watch-stander will have between periods of on-demand attention:

Time To Resolution (TTR)
Alarm Freq < 5 min 5 to 10 min 10 to 20 min up to 30 min up to 60 min
15 min 10 5 0 0 0
30 min 25 20 10 0 0
60 min 55 50 40 30 0
90 min 85 80 70 60 30
2 hour 115 110 100 180 60
4 hour 235 230 220 210 180
6 hour 355 350 340 330 300
8 hour 475 470 460 450 420
No sleep possible
Quick naps possible
Restful sleep possible
Uninterupted sleep possible

Given that the average sleep cycle is 45 minutes, any combination that has less than that will be a shift that the watch-stander will not be able to restfully sleep. This has been colored dark orange. If you allow people time to get to sleep, say 20 minutes, and add that to the 45 minute sleep-cycle, that locks out anything at or under 70 minutes (colored light orange). For people who don't go to sleep easily (such as me) even the 80 and 85 minute slack times would be too little. The next tier where it's probable you could get one or two sleep-cycles in before having to wake up again, this is resful sleep. Finally, we have the tier where you may get uninteruppted (by work) sleep; the most resful sort.

Keep in mind the only-sort-of-random nature of alarms. Sometimes they arrive in clusters. Sometimes they are random, with quite a bit of variation which makes 'average' something you only sometimes see. If your average is 3 hours between alarms (180 minutes), alarms may commonly show up 20 minutes after a previous one, and 5 hours after. There will be nights where no sleep can be found, even though your average is long enough to theoretically support it.

If you're in the orange areas, shifts shouldn't be longer than a day. And probably should be half-days.

If you're near the orange areas, you probably should not have a week of that kind of thing so shift lengths should be less than 7 days.

Shift lengths longer than these guidelines risks burnout and creates rage-demons. We have too many rage-demons as it is, so please have a heart.

I believe this policy set provides the groundwork for a well defined on-call rotation. The on-call engineers know what is expected of them, and know the or-else if they don't live up to it.

This showed up today.

I get that. The little white lie that it's all right, I wasn't offended. The lying silence where the, "check that bullshit," should have been. The desire to belong to the in-group (or an in-group, even if it's an in-group of one) is probably baked into our genetics. Those that arbitrate membership in the in-group set the standards by which membership is granted. So long as there is power there, the little internal betrayals needed to achieve membership, or if that isn't possible, satellite membership, can be justified.

For a while. Until the price starts getting too high.

If the in-group is in all of the positions of both power and employee redress? That's a lot of incentive to shut the fuck up and laugh like you mean it.

And if you keep poking at it, because shuting the fuck up and laughing already is becoming very hard, you lose in-group status.

This is a very human progression, we've been doing it since pre-history. The modern workplace is supposed to be set up to deal with toxic managers and hostile work environments, but cronyism is incredibly corrosive. It takes active push-back to fend off, and of the corruption is deep enough that just costs you your job.

Most corporate severance agreements include something called a non-disparagement clause, which means, in effect:

The severed employee agrees to not say bad things about the Company, or cause material harm to the Company's business through their actions.

And accusing a manager of being a harassing asshole is the kind of thing that could trigger that clause. By telling the world about her experience with this manager, naming names, and calling out the toxic culture of that particular work-unit, she can be considered to be causing 'material harm' and could face serious legal consequences. If Google wants to be assholes about it, of course. But the language is there in the agreement specifically to scare ex-employees out of doing things like this.

The internal system was stacked against her, and the court of public opinion was also stacked against her by the very company that had the bad culture.

I'm guilty of making the same kind of calculations. I didn't seek in-group status as firmly as Kelly did, and it got me fired in the end. It turned out well for me, but was pretty traumatic at the time.

While I was there I did consciously choose to not call out jokes, behavior, or other things that offended me, specifically because I needed to stay on good terms with the in-group. I never got to crying, but the little niggling things did add up. It meant I didn't stay long at company events, didn't follow on after-work outings to bars, and generally stayed quiet a lot of the time. It was noticed.

Encryption is hard

| 1 Comment

I've run into this workflow problem before, but it happened again so I'm sharing.

We have a standard.

No passwords in plain-text. If passwords need to be emailed, the email will be encrypted with S/MIME.

Awesome. I have certificates, and so do my coworkers. Should be awesome!

To: coworker
From: me
Subject: Anti-spam appliance password

[The content can't be displayed because the S/MIME control isn't available]

Standard folowed, mischief managed.

To: me
From: coworker
Subject: RE: Anti-spam appliance password
Thanks! Worked great.

To: coworker
From: me
uid: admin1792
pw: 92*$&diq38yljq3


Encryption is hard. It would be awesome if a certain mail-client defaulted to replying-in-kind to encrypted emails. But it doesn't, and users have to remember to click the button. Which they never do.

Other Blogs

My Other Stuff

Monthly Archives