There is more to an on-call rotation than a shared calendar with names on it and an agreement to call whoever is on the calendar if something goes wrong.

People are people, and you have to take that into consideration when setting up a rotation. And that means compromise, setting expectations, and consequences for not meeting them. Here are a few policies every rotation should have somewhere. Preferably easy to get to.

The rotation should be published well in advance, and easy to find.

This seems like an obvious thing, but it needs to be said. People need to know in advance when they're going to be obligated to pay attention to work in their usual time off. This allows them to schedule their lives around the on-call schedule, and you, as the on-call manager, will have to deal with fewer shift-swaps as a result. You're looking to avoid...

Um, I forgot I was on-call next week, and I'm going to be in Peru to hike the Andes. *sheepish look.*

This is less likely to happen if the shift schedule is advertised well in advance. For bonus points, add the shift schedules to their work calendars.

(US) Monday Holiday Law is a thing. Don't do shift swaps on Monday.

If you're doing weekly shifts, it's a good idea to not do your shift swap on Monday. Due to the US Monday Holiday Law there are five weeks in a year (10% of the total!) where your shift change will happen on an official holiday. Two of those are days that almost everyone gets off: Labor Day and Memorial Day.

Whether or not you need to avoid shift swaps on a non-work day depends a lot on how hand-offs work for your organization.

Set shift-handoff expectations.

When one watch-stander is relieved by the next, there needs to be a handoff. For some organizations it could be as simple as making sure the other person is there and responsive before stepping down. For others, it can be complicated as they have more state to transfer. State such as:

  • Ongoing issues being tracked.
  • Hardware replacements due during the next shift period.
  • Maintenance tasks not completed by the outgoing watch-stander.
  • Escalation engineers that won't be available.

And so on. If your organization has state to transfer, be sure you have a policy in place to ensure it is transferred.

Acknowledge time must be defined for alarms.

The maximum time a watch-stander is allowed to wait before ACKing an alarm must be defined by policy, and failure to meet that must be noticed. If the ACK time expires, the alarm should escalate to the next tier of on-call.

This is a very critical policy to define, as it allows watch-standers to predict how much life they can have while on-call. If response-time is 10 minutes, it means that if the watch-stander is driving they have to pull over to ack the alarm. 10 minute ACK likely means they can't do anything long, like go to movies or their kid's band recital.

This also goes for people who are on the escalation schedule. It may be different times, but it should still be defined.

Response time must be defined for alarms.

The time to first response, actually working on the reported problem, must be defined by policy, and failure to meet that must be noticed. This may vary based on the type of alarm received, but it should still be defined for each alarm.

This is another very critical policy to define, and is even more impactful on watch-stander ability to do other things than be at work. The watch-stander pretty much has to stay within N minutes of a reliable Internet connection for their entire watch. If their commute to and from work is longer than N minutes, they can't be on-watch during their commute. And other things.

  • If 10 minutes or less, it is nearly impossible to do anything out of the house. Such as picking up prescriptions, going to the kid's band recital, soccer games, and theatre rehearsals.
  • If 20 minutes or less, quick errands out of the house may be doable, but that's about it.

As with ACK time, this should also be defined for the people on the escalation schedule.

Consequences for escalations must be defined.

People are people, and sometimes we can't get to the phone for some reason. Escalations will happen for both good reasons (ER visits) and bad (slept through it). The severity of a missed alarm varies based on the organization and what got missed, so this will be a highly localized policy. If an alarm is missed, or a pattern of missed alarms develops, there should be defined consequences.

This is the kind of thing that can be brought up in quarterly/annual reviews.


This gets its own section because it's very important:

The right shift length

How long your shifts should be is a function of several variables:

  • ACK time.
  • Response time.
  • Frequency of alarms.
  • Average time-to-resolution (TTR) for the alarms.

People need to sleep, and watch-standers are more effective if they're not sleep-deprived. Chronically sleep-deprived people have poor job satisfaction and are more likely to leave for greener pastures. Here is a table I made showing the combinations of alarm frequency and TTR, and showing on average how many minutes a watch-stander will have between periods of on-demand attention:

TTR
Alarm Freq < 5 min 5 to 10 min 10 to 20 min up to 30 min up to 60 min
15 min 10 5 0 0 0
30 min 25 20 10 0 0
60 min 55 50 40 30 0
90 min 85 80 70 60 30
2 hour 115 110 100 180 60
4 hour 235 230 220 210 180
6 hour 355 350 340 330 300

Given that the average sleep cycle is 45 minutes, any combination that has less than that will be a shift that the watch-stander will not be able to sleep. This has been colored dark orange. If you allow people time to get to sleep, say 20 minutes, that locks out anything at or under 70 minutes (colored light orange). For people who don't go to sleep easily (such as me) even the 80 and 85 minute slack times would be too little. The white cells are shift combinations where sleeping while on-call is something that could happen.

If your alarm frequency and TTR are frequent enough, say in the 30% of attention or larger range, you don't have an on-call rotation you have a distributed NOC and being on watch is a full time job. Don't expect them to do anything else.

If you're in the orange areas, shifts shouldn't be longer than a day. And probably should be shorter.

If you're near the orange areas, you probably should not have a week of that kind of thing so shift lengths should be less than 7 days.

Shift lengths longer than these guidelines risks burnout and creating rage-demons. We have too many rage-demons as it is, so please have a heart.


I believe this policy set provides the groundwork for a well defined on-call rotation. The on-call engineers know what is expected of them, and know the or-else if they don't live up to it.

This showed up today.

I get that. The little white lie that it's all right, I wasn't offended. The lying silence where the, "check that bullshit," should have been. The desire to belong to the in-group (or an in-group, even if it's an in-group of one) is probably baked into our genetics. Those that arbitrate membership in the in-group set the standards by which membership is granted. So long as there is power there, the little internal betrayals needed to achieve membership, or if that isn't possible, satellite membership, can be justified.

For a while. Until the price starts getting too high.

If the in-group is in all of the positions of both power and employee redress? That's a lot of incentive to shut the fuck up and laugh like you mean it.

And if you keep poking at it, because shuting the fuck up and laughing already is becoming very hard, you lose in-group status.


This is a very human progression, we've been doing it since pre-history. The modern workplace is supposed to be set up to deal with toxic managers and hostile work environments, but cronyism is incredibly corrosive. It takes active push-back to fend off, and of the corruption is deep enough that just costs you your job.

Most corporate severance agreements include something called a non-disparagement clause, which means, in effect:

The severed employee agrees to not say bad things about the Company, or cause material harm to the Company's business through their actions.

And accusing a manager of being a harassing asshole is the kind of thing that could trigger that clause. By telling the world about her experience with this manager, naming names, and calling out the toxic culture of that particular work-unit, she can be considered to be causing 'material harm' and could face serious legal consequences. If Google wants to be assholes about it, of course. But the language is there in the agreement specifically to scare ex-employees out of doing things like this.

The internal system was stacked against her, and the court of public opinion was also stacked against her by the very company that had the bad culture.


I'm guilty of making the same kind of calculations. I didn't seek in-group status as firmly as Kelly did, and it got me fired in the end. It turned out well for me, but was pretty traumatic at the time.

While I was there I did consciously choose to not call out jokes, behavior, or other things that offended me, specifically because I needed to stay on good terms with the in-group. I never got to crying, but the little niggling things did add up. It meant I didn't stay long at company events, didn't follow on after-work outings to bars, and generally stayed quiet a lot of the time. It was noticed.

Encryption is hard

| 1 Comment

I've run into this workflow problem before, but it happened again so I'm sharing.


We have a standard.

No passwords in plain-text. If passwords need to be emailed, the email will be encrypted with S/MIME.

Awesome. I have certificates, and so do my coworkers. Should be awesome!

To: coworker
From: me
Subject: Anti-spam appliance password

[The content can't be displayed because the S/MIME control isn't available]

Standard folowed, mischief managed.

To: me
From: coworker
Subject: RE: Anti-spam appliance password
Thanks! Worked great.

To: coworker
From: me
uid: admin1792
pw: 92*$&diq38yljq3
https://172.2.245.11/login.cgi

Sigh.

Encryption is hard. It would be awesome if a certain mail-client defaulted to replying-in-kind to encrypted emails. But it doesn't, and users have to remember to click the button. Which they never do.

Ratios

| No Comments

In an effort to better understand the challenges facing the ops team of a particular project here at $DayJob, a project manager asked this question:

How many users per [sysadmin] can our system support?

The poor lead sysadmin on that side of the house swiveled her chair over and said to me, "there is no answer to this question!" And we had a short but spirited discussion about the various ratios to admin staff at the places we've been. Per-user is useless, we agreed. Machine/Instance count per admin? Slightly better. But even then. Between us we compiled a short list of places we've been and places we've read about.

  • Company A: 1000:1 And most of that 1 FTE was parts-monkey to keep the install-base running. The engineer to system ratio was closer to 10K:1. User count: global internet
  • Company B: 200:1 Which was desperately understaffed, as the ops team was frantically trying to keep up with a runaway application and a physical plant that was rotting under the load. User count: most of the US.
  • Company C: 150:1 Which was just right! User count: none, it was a being developed product.
  • Company D: 60:1 And the admin was part-time because there wasn't enough work. User count: 200
  • Company E: 40:1 Largely because 25-30 of those 40 systems were one-offs. It was a busy team. Monocultures are for wimps. User count: 20K.

This chart was used to explain to the project manager in question the "it depends" nature of admin staffing levels, and you can't rely on industry norms to determine the target we should be hitting. Everyone wants to be like Company A. Almost no one gets there.

What are the ratios you've worked with? Let me know @sysadm1138

The sysadmin skills-path.

| No Comments

Tom Limoncelli posted a question today.

What is the modern rite of passage for sysadmins? I want to know.

That's a hard one, but it got me thinking about career-paths and skills development, and how it has changed since I did it. Back when I started, the Internet was just becoming a big source of information. If it wasn't on Usenet, the vendor's web-site might have a posted knowledge-base. You could learn a lot from those. I also learned a lot from other admins I was working with.

One of the big lamentations I hear on ServerFault is that kids these days expect a HOWTO for everything.

Well, they're right. I believe that's because of how friendly bloggers like myself have trained others into finding out how to do stuff. So I posit this progression of skill-set for a budding sysadmin deploying a NewThing.

  1. There is always a checklist if you google hard enough. If that one doesn't work, look for another one.
    • And if that doesn't work, ask a patch of likely experts (or bother the expert in the office) to make one for you. It works sometimes.
    • And if that doesn't work, give up in disgust.
  2. Google for checklists. Find one. Hit a snag. Look for another one. Hit another snag. Titrate between the two to get a good install/config.
    • If that doesn't work, follow the step-1 progression to get a good config. You'll have better luck with the experts this time.
  3. Google for checklists. Find a couple. Analyze them for failure points and look for gotcha-docs. Build a refined procedure to get a good install/config.
  4. Google for checklists. Find a few. Generalize a good install/config procedure out of them and write your own checklist.
    • If it works, blog about it.
  5. Google for checklists. Find a few, and some actual documentation. Make decisions about what settings you need to change and why, based on documentation evidence and other people's experience. Install it.
    • If it works, write it up for the internal wiki.
  6. [Graduation] Plunder support-forums for problem-reports to see where installs have gone wrong. Revise your checklist accordingly.
    • If it works, go to local Meetups to give talks about your deploy experience.

That seems about right. When you get to the point where your first thought about deploying a new thing is, "what can go wrong that I need to know about," you've arrived.

Smartphone ecosystems have definitely reached the level of complexity where we have to worry about hostile apps. And they're following the pattern shown by the Internet over the years in that there are classes of hostile actions:

  • Known/Allowed, also known as ad/revenue streams. App owners have to pay the bills somehow, and purchase fees only go so far.
  • Known/Disallowed, also known as malware following known exploits. For this we have scanners.
  • Unknown, apps doing things they shouldn't, by ways that aren't in the scanners yet. Evil, evil little beasties.

If there is one lesson about information security that has been true since the beginning, is that it's the victim's fault for getting owned. Really, look at the press following hacks: hacks are entirely the fault of the defending entity for not being good enough. If you just followed accepted security standards, this would never happen. Never mind that transitive trust models in very complex IT infrastructures are nearly impossible to fully secure, especially ones that involve humans, it's still the victim's fault.

Those 'accepted security standards' are somehow lacking in the app-stores, especially Android. It's like the app-owners don't really want you to secure yourself.

What would be very nice in these phone OS security system would be selectable permission filters. Don't want to allow bluetooth-access to any applications except those you whitelist? Don't want to share your contacts with an app that seemingly has no need for it? A limited version of this is in iOS, but as I'll get to in a moment it only goes so far.

There are two methods of denying access to capabilities, and we already have a good example of this two-tier model in the firewall world:

  • Notifies connections of no-connection.
  • Pretends there is nothing there.

The first method is nice for applications since they learn quickly to stop trying. The second is nice for defenders because it means potential attackers have to wait for timeouts before marking a IP:Port tuple as up/down. When it comes to phones, there are two ways to deal with selectable permissions:

  • Notify the app that they don't have rights to that thing. Apps know they're being banned.
  • Lie to the app and provide a stub service that returns nothing or a simple carrier-signal. Apps will have to do tests to see if they're banned.

IOS uses the first model. If you've ever seen a, "turn on bluetooth for an enhanced user experience," modal, that's what happened. I believe that Apple standards say that applications have to honor those settings in that they still run and don't quit in a huff over not getting your identity goodies. You may not be able to do much, but they'll still run.

Android currently doesn't have selectable permissions (out of the box; there are some apps that try to provide it), you decide whether or not an app can be allowed to do it's full list at the time you install it. This can be problematic, especially if circumstances require that you install certain apps, but you want to disable certain capabilities. Such as having only one phone with both work and email on it, and you'd rather they didn't wipe it when they fire you.

That's where things like XPrivacy can come in handy. This only runs on a rooted device, but it provides the stub-services needed to prevent apps from quitting in a huff over not getting the ability to remove accounts on the device, lie about Bluetooth/NFC/Wifi access and state, or falsify 'network' location data. Things like XPrivacy allow us to provide those very 'accepted security standards' that reduce victim-blaming after incidents. It would be awesome if this came stock, but we can't have everything.

Way back when I first got into Group Policies, which was just after Group Policies were released, one of the things we mooted about the BoF den was a simple thing we could do to tell users that they were on a managed station. What we came up with was pretty simple: manage the desktop background.

No, we didn't put an all-seeing-eye on it. That would be creepy, don't be silly. We used a logo of the company.

It made sense! A simple cue, and we'd save RAM (back in those days the desktop background took more than trivial RAM). We were happy.


It turns out, that's not how you build a happy user-base. By doing so, we told people explicitly everything you do can and will be used against you in an HR action. People don't like to be told they're being monitored.

You know who likes to be told they're being monitored like that? No one.

You know who we want to be monitored that way? Prisoners and people likely to become prisoners.

No one wants to be thought of as a prisoner, or likely to be one.

In fact, later GPO guides specifically discouraged doing things like managing the desktop background or theme. It could be done, but... why would you want to? Desktop theme is one very low impact thing on the system and the single biggest thing the user can customize to their preferences. It's a very low challenge to the system to increase user experience by a great amount. Let them customize and don't worry about it.

But still manage their IE zones, certificate enrollment policies, software distribution methods, and event-log reporting.

They can make their jail-cell a pink polka-dot wonder, far better than bare cinder-block! It's still a cell, but without that camera in their face, they're happier about living in it.


It looks like consumer-focused big-data stuff is suffering the same faults as early GPOs did: they're being too obvious about the surveillance.

"Hello, Mister ${mispronounced last name}," said the sales-clerk I'd never met before. I sighed in resignation, vowing to factory reset my cell-phone. Again. One of these days I'm just going go cash only.

Or another one I almost guarantee will happen:

TSA Customer Service
@sysadm1138 We noticed you were in DFW security line for 49 minutes. We would like some feed back about that, https://t.co/...

Er, wait. That's Big Brother. Sorry, dial slipped. Let's try again.

VIctorias Secret
@sysadm1138 We noticed you spent time in our DES MOINES, IA store. If you have time, please take a short survey about your visit. https://t.co/...

You've probably run into this one, but hitting a random website, and then that site haunts your web-ads (for those of you who don't run on AdBlock-Strict) for weeks.

They haven't figured out that a large percentage of us don't like being reminded we live in a panopticon. Give me my false illusion of anonymity and I'm happy!

It's all about the user-factors. What's good for the retailer, is not always good for their consumers. Obviously. But the best kind of thing like that are things that aren't obviously not-good for the consumer.

User-factors, people!

A minimum vacation policy

| No Comments

A, "dude, that's a cool idea," wave has passed through the technology sector in the wake of an article about a minimum vacation policy. This was billed as an evolution of the Unlimited Vacation Policy that is standard at startups these days. The article correctly points out some of the social features of unlimited-vacation-polices that aren't commonly voiced:

  • No one wants to be the person who takes the most vacation.
  • No one wants to take more vacation than others do.
  • Devaluing vacation means people don't actually take them. Instead opting for low-work working days in which they only do 2 hours remotely instead of a normal 10 in the office.

These points mean that people with an unlimited policy end up taking less actual vacation than workplaces with an explicit 15 days a year policy. Some of the social side-effects of a discrete max-vacation policy are not often spelled out, but are:

  • By counting it, you are owed it. If you have a balance when you leave, you're owed the pay for those earned days.
  • By counting it, it has more meaning. When you take a vacation day, you're using a valuable resource and are less likely to cheapen it by checking in at work.
  • There is never any doubt that you can use those days, just on what days you can use them (maintain coverage during the holidays/summer, that kind of thing).

Less stress all around, so long as a reasonable amount is given. To me, this looks like a better policy than unlimited.

But what about minimum-vacation? What's that all about?

The idea seems to be a melding of the best parts of unlimited and max. Employees are required to take a certain number of days off a year, and those days have to be full-disconnect days in which no checking in on work is done. Instead of using scarcity to urge people to take real vacations, it explicitly states you will take these days and you will not do any work on them. For the employer it means you do have to track vacation again, but they're required days, don't create the vacation-cash-out liability that max-vacation policies create, and you only have to track up to the the defined amount. If an employee takes 21 days in a year, you don't care since you stopped tracking one they hit 15.

The social factors here are much healthier than unlimited:

  • Explicit policy is in place saying that vacations are no-work days. People get actual down-time.
  • Explicit policy is in place that N vacation days shall be used, so everyone expects to use at least those days. Which is probably more than they'd use with an unlimited policy.
  • Creates the expectation that when people are on vacation, they're unreachable. Which improves cross-training and disaster resilience.

I still maintain that a max-vacation policy working in-hand with a liberal comp-time policy is best for workers, but I can't have everything. I like min-vacation a lot better than unlimited-vacation. I'm glad to see it begin to take hold.

Categories

| No Comments

Humans are curious critters. We keep trying to pick apart reality to figure out how it works. Part of that is to break reality up into smaller chunks so it makes sense. Abstractions improve understanding and allow further refinements to the model. It's what science is based on.

Biology is a continually vexing problem, though. In many ways, it's a continuous function we keep attempting to turn into a discrete maths problem. Taxonomy, the naming of species, is a great example of this. Early classification methods relied on similar morphology to determine relatedness, and that gave us a nice family tree. Then we figured out how to sequence genomes and we learned how wrong we were; they're now moving whole species/phylum branches around. It turns out nature sometimes solves the same problem the same way through completely unrelated species.

What sparked the rearrangement? A new way to classify. A new method was picked to be more accurate, and changes had to be made.

Topically, take a look at OS classification of non-mobile consumer computing devices (what used to be called desktop-OS). You can see this on any web-visitor analytics platform. Some break it down like:

  • Windows
  • OSX
  • Linux

Others get more specific, breaking it down to versions within OS:

  • Window XP
  • Windows 7
  • Windows 8
  • OSX 10.6
  • OSX 10.7
  • OSX 10.8
  • OSX 10.9
  • OSX 10.10
  • Linux

For some reason they don't break apart the Linux versions. Perhaps because it's such a small segment of the market and highly fragmented at that. Still more detailed charts go down to Windows service-pack levels. OS version is a discrete space, but in order to provide a brief chart some simplifications are made. Each analytics application makes its own classification decisions.

Less topically, lets take a look at a fictional made-up species, the Variegated Civet. Take the physical sex of this critter. The original population study was done in 1906 and an odd sex ratio was observed, 1.3 females to every male. As with all studies of the time, external morphology studies were used to determine sex with a few dissections as a cross-check.

Fast forward a bunch of years and genetic studies become financially doable for an appropriate sample-size of the population. It reveals a funny thing. Some of the females are genetically male. This raises eyebrows and further studies reveal the cause. A significant percentage of males undergo gonadogeneis at puberty, not in-utero, which skewed the original study's sex ratio.

A new classification technique, genetics, reveals an interesting feature in a specific population. It also raises the question of what how to differentiate pre-puberty males with fully formed gonads from those who will do so later. A third sex may need to be created to explain this species.

We're undergoing an attempt to change the cultural classification method for gender in humans. For ages it has been based on physical morphology and came in two types. Nature being nature, there are plenty of ambiguous presentations to make the classifier problem harder (intersex); but not enough to prompt the creation of a third gender. Those weird-cases were assigned into one or the other, which ever was closer, in the opinion of the classifier (sometimes nudging things along with a bit of surgery). For ages gender was a synonym of sex.

That's beginning to change, and it's not been an easy thing to bring about. For one, gender is becoming more widely seen as discrete from sex. For another, gender is at the early stages of redefining its classifier away from external morphology and/or chromosomes and into self description. Self-description brings it away from a discrete function (binary, trinary) and into more of a multi-axis graph.

It takes a long time for a change like this to take hold, and there are fights being had. Vital Records only record one of these, and it's still legally entangling both sex and gender. Maybe in some future time driver's licenses will have both sex and gender fields on it. Or maybe those fields will be left off all together (the better option, in my opinion). Chromosomes are not truth, as nature's continuous function ensures there will always be an exception (complete Androgen Insensitivity Syndrome is the big exception to the XY = Male 'truth').

The work continues.

Over a month, and nothing

| No Comments

This is what busy looks like. I had an interesting puppet thing happen that I wanted to write about, but I couldn't grab the needed log-file in time. Dammit. It was about an odd message that shows up in user resources when it only goes part way, and was an interaction with VMWare OS Customization scripts. Sorry!

In the last month I've:

  • Survived a Great Rearranging in our cubicals. They kept me with my friends which is ♥.
  • Been given the project to deploy our product on [cloud provider] since we actually have a client who will pay us to do it.
  • Written a LOT of puppet-code to refactor our setup into something that can deal with our product on multiple infrastructures.

BusyBusy.

Other Blogs

My Other Stuff

Monthly Archives