There is more to an on-call rotation than a shared calendar with names on it and an agreement to call whoever is on the calendar if something goes wrong.
People are people, and you have to take that into consideration when setting up a rotation. And that means compromise, setting expectations, and consequences for not meeting them. Here are a few policies every rotation should have somewhere. Preferably easy to get to.
The rotation should be published well in advance, and easy to find.
This seems like an obvious thing, but it needs to be said. People need to know in advance when they're going to be obligated to pay attention to work in their usual time off. This allows them to schedule their lives around the on-call schedule, and you, as the on-call manager, will have to deal with fewer shift-swaps as a result. You're looking to avoid...
Um, I forgot I was on-call next week, and I'm going to be in Peru to hike the Andes. *sheepish look.*
This is less likely to happen if the shift schedule is advertised well in advance. For bonus points, add the shift schedules to their work calendars.
(US) Monday Holiday Law is a thing. Don't do shift swaps on Monday.
If you're doing weekly shifts, it's a good idea to not do your shift swap on Monday. Due to the US Monday Holiday Law there are five weeks in a year (10% of the total!) where your shift change will happen on an official holiday. Two of those are days that almost everyone gets off: Labor Day and Memorial Day.
Whether or not you need to avoid shift swaps on a non-work day depends a lot on how hand-offs work for your organization.
Set shift-handoff expectations.
When one watch-stander is relieved by the next, there needs to be a handoff. For some organizations it could be as simple as making sure the other person is there and responsive before stepping down. For others, it can be complicated as they have more state to transfer. State such as:
- Ongoing issues being tracked.
- Hardware replacements due during the next shift period.
- Maintenance tasks not completed by the outgoing watch-stander.
- Escalation engineers that won't be available.
And so on. If your organization has state to transfer, be sure you have a policy in place to ensure it is transferred.
Acknowledge time must be defined for alarms.
The maximum time a watch-stander is allowed to wait before ACKing an alarm must be defined by policy, and failure to meet that must be noticed. If the ACK time expires, the alarm should escalate to the next tier of on-call.
This is a very critical policy to define, as it allows watch-standers to predict how much life they can have while on-call. If response-time is 10 minutes, it means that if the watch-stander is driving they have to pull over to ack the alarm. 10 minute ACK likely means they can't do anything long, like go to movies or their kid's band recital.
This also goes for people who are on the escalation schedule. It may be different times, but it should still be defined.
Response time must be defined for alarms.
The time to first response, actually working on the reported problem, must be defined by policy, and failure to meet that must be noticed. This may vary based on the type of alarm received, but it should still be defined for each alarm.
This is another very critical policy to define, and is even more impactful on watch-stander ability to do other things than be at work. The watch-stander pretty much has to stay within N minutes of a reliable Internet connection for their entire watch. If their commute to and from work is longer than N minutes, they can't be on-watch during their commute. And other things.
- If 10 minutes or less, it is nearly impossible to do anything out of the house. Such as picking up prescriptions, going to the kid's band recital, soccer games, and theatre rehearsals.
- If 20 minutes or less, quick errands out of the house may be doable, but that's about it.
As with ACK time, this should also be defined for the people on the escalation schedule.
Consequences for escalations must be defined.
People are people, and sometimes we can't get to the phone for some reason. Escalations will happen for both good reasons (ER visits) and bad (slept through it). The severity of a missed alarm varies based on the organization and what got missed, so this will be a highly localized policy. If an alarm is missed, or a pattern of missed alarms develops, there should be defined consequences.
This is the kind of thing that can be brought up in quarterly/annual reviews.
This gets its own section because it's very important:
The right shift length
How long your shifts should be is a function of several variables:
- ACK time.
- Response time.
- Frequency of alarms.
- Average time-to-resolution (TTR) for the alarms.
First and foremost: If your alarm frequency and TTR are frequent enough, say in the 30% of attention or larger range, you don't have an on-call rotation; you have a distributed NOC, and being on watch is a full time job. Don't expect them to do anything else, and pay them like they're at work.
People need to sleep, and watch-standers are more effective if they're not sleep-deprived. What's more, being sleep-deprived is incredibly fatiguing, so schedules that promote such fatigue are more likely to have people quit. Here is a table I made showing the combinations of alarm frequency and mean TTR, showing on average how many minutes a watch-stander will have between periods of on-demand attention:
|Time To Resolution (TTR)|
|Alarm Freq||< 5 min||5 to 10 min||10 to 20 min||up to 30 min||up to 60 min|
|No sleep possible|
|Quick naps possible|
|Restful sleep possible|
|Uninterupted sleep possible|
Given that the average sleep cycle is 45 minutes, any combination that has less than that will be a shift that the watch-stander will not be able to restfully sleep. This has been colored dark orange. If you allow people time to get to sleep, say 20 minutes, and add that to the 45 minute sleep-cycle, that locks out anything at or under 70 minutes (colored light orange). For people who don't go to sleep easily (such as me) even the 80 and 85 minute slack times would be too little. The next tier where it's probable you could get one or two sleep-cycles in before having to wake up again, this is resful sleep. Finally, we have the tier where you may get uninteruppted (by work) sleep; the most resful sort.
Keep in mind the only-sort-of-random nature of alarms. Sometimes they arrive in clusters. Sometimes they are random, with quite a bit of variation which makes 'average' something you only sometimes see. If your average is 3 hours between alarms (180 minutes), alarms may commonly show up 20 minutes after a previous one, and 5 hours after. There will be nights where no sleep can be found, even though your average is long enough to theoretically support it.
If you're in the orange areas, shifts shouldn't be longer than a day. And probably should be half-days.
If you're near the orange areas, you probably should not have a week of that kind of thing so shift lengths should be less than 7 days.
Shift lengths longer than these guidelines risks burnout and creates rage-demons. We have too many rage-demons as it is, so please have a heart.
I believe this policy set provides the groundwork for a well defined on-call rotation. The on-call engineers know what is expected of them, and know the or-else if they don't live up to it.