Paternity leave and on-call

| No Comments

It all started with this tweet.

Which you need to read (Medium.com). Some pull-quotes of interest:

My manager probably didn't realize that "How was your vacation" was the worst thing to ask me after I came back from paternity leave.

Patriarchy would have us believe that parenting is primarily the concern of the mother. Therefore paternity leave is a few extra days off for dad to chillax with his family and help mom out.

Beyond a recovery time from pregnancy, much of parental leave is learning to be a parent and adjusting to your new family and bonding with the baby. I can and did bond with the baby, but not as much as my female coworkers bonded with their babies.

I should also state, that I don't just want equality, I want a long time to bond with my child. Three months or more sounds nice. Not only can I learn to soothe him when he's upset, put him to sleep without worrying about being paged, but I can be around when he does the amazing things babies do in their first year: learning to sit, crawl, eat, stand and even walk.

At my current employer, I was shocked to learn that new dads get two weeks off.

Two.

At my previous startup, paternal leave was under the jurisdiction of the 'unlimited vacation' policy. Well...

Vacations are important. My friends would joke that the one way to actually be able to take vacations was to keep having children. Here the conflation was in jest, and also a caricature of the reality of vacations at startups.

We had a bit of a baby-boom while I was there. Dads were glared at if they showed up less than two weeks in and told to go home. After that, most of them worked part-time for a few weeks and slowly worked up to full time.

This article caused me to tweet...

The idea here is that IT managers who work for a company like mine with a really small amount of parental leave do have a bit of power to give Dad more time with the new kid: take them off of the call rota for a while. A better corporate policy is ideal, but it's a kind of local fix that just might help. Dad doesn't have to live to the pager and new-kid.

Interesting idea, but not a great one.

Which is a critique of the disaster-resilience of 3-person teams. I was on one, and we had to coordinate Summer Vacation Season to ensure we had two-person coverage for most of it, and if 1-person was unavoidable, keep it to a couple days at best. None of us had kids while I was there (the other two had teenagers, and I wasn't about to start), so we didn't get to live through a paternity-leave sized hole in coverage.

Which is the kind of team I'm on right now, and why I thought of the idea. We have enough people that a person sized hole, even a Sr. Engineer sized hole, can be filled for several to many weeks in the rotation.

That's the ideal route though, and touches on a very human point: if you're in a company where you always check mail or can expect pages off-hours, it doesn't matter if you're not in the official call-rotation. That's a company culture problem independent of the on-call rotation.

My idea can work, but it takes the right culture to pull off. Extended leave would be much better, and is the kind of thing we should be advocating for.

You should still read the article.

The project is done, and you have a monitoring system you like!

How, how do you keep liking it?

Like all good things, it takes maintenance. There are a few processes you should have in place to provide the right feedback loops to keep liking your shiny new monitoring environment.

  • Questions about monitoring should be in your incident retrospective process.
  • A periodic review of active alarms to be sure you still really want them.

Implementing these will provide both upward pressure to expand it into areas it needs to go, and downward pressure to get rid of needless noise.

There is more to a monitoring system than alarms and reports. Behind all of those cries for action are a lot of data. A lot of data. So much data, that you face scaling problems when you grow because all of your systems generate so much monitoring data.

Monitor everything!
-- Boss

A great idea in principle, but falls apart in one key way...

"Define 'everything', please"

'Everything' means different things for different people. It just isn't feasable to track every monitorable on everything with one-second granularity. Just about everyone will want to back away from that level of monitor-all-the-things. But what is the right fit?

It depends on what you want to do with it. Data being tracked supports four broad categories of monitoring.

  1. Performance
  2. Operational
  3. Capacity
  4. SLA

Performance Monitoring

This kind of monitoring tends to have low intervals between polls. It could be five minutes, but may be as little as every second. That kind of monitoring will create a deluge of data, and may only be done when diagnosing exposed problems or doing in-depth research on the system. It's not run all the time, unless you really do care about per-second changes in state of something.

This kind of monitoring is defined by a few attributes:

  • High granularity. You poll a lot.
  • Low urgency. You're doing this because you're looking into something, not because it's down.
  • Occasional need. You don't run it all the time, and not on a schedule.

Everything: 1 second granularity for CPU, IOPS, and pagefaults for a given cluster.


Operational Monitoring

The kind we're all familiar with. This is the kind of monitoring that tends to emit alarms for on-call rotations.

  • Medium granularity. Every 5 minutes, that kind of thing.
  • High urgency. Fast responses are needed.
  • Constant need. You run it all the time.

Everything: Every disk-event, everywhere.


Capacity Monitoring

Some of the alarms you have defined already may be capacity alarms, but capacity monitoring just tracks how much you are using of what you have. Some of this stuff doesn't change very fast.

  • Low granulariy. It may only get checked once a day.
  • Low urgency. Responding in a couple of days may be fast enough. If not slower.
  • Periodic need. Reviewed once in a while, probably on a schedule.

Everything: Anything that has a "Max" size value greater than the "Current" value.


SLA Monitoring

I've already gone on at length about SLAs, but this is the monitoring that directly supports the SLA pass/fail metrics. I break it apart from the other types because of how it's accessed.

  • Low granularity. Some metrics may be medium, but in general SLA trackers are over-time style.
  • Medium urgency. If a failing grade is determined, response needs to happen. How fast, depends on what's not going to get met.
  • Continual and Periodic need. Some things will be monitored continually, others will only be checked on long schedules; possibly once a week, if not once a month.

Everything: Everything it takes to build those reports.


Be aware that 'everything' is context-sensitive when you're talking with people and don't freak out when a grand high executive says, "everything," at you. They're probably thinking about the SLA Monitoring version of everything, which is entirely manageable.

Don't panic, and keep improving your monitoring system.

In the last article we created a list of monitorables and things that look like the kind of alarms we want to see.

Now what?

First off, go back to the list of alarms you already have. Go through those and see which of those existing alarms directly support the list you just created. It may be depressing how few of them do, but rejoice! Fewer alarms mean fewer emails!

What does 'directly support' mean?

Lets look at one monitorable and see what kind of alarms might directly or indirectly support it.

Main-page cluster status.

There are a number of alarms that could already be defined for this one.

  • Main-page availability as polled directly on the load-balancer.
  • Pingability of each cluster member.
  • Main-page reachability on each cluster member.
  • CPU/Disk/Ram/Swap on each cluster member.
  • Switch-port status for the load-balancer and each cluster-member.
  • Webserver process existence on each cluster member.
  • Webserver process CPU/RAM usage on each cluster member.

And more, I'm sure. That's a lot of data, and we don't need to define alarms for all of it. The question to ask is, "How do I determine the status of the cluster?"

The answer could be, "All healthy nodes behind the load-balancer return the main-page, with at least three nodes behind the load-balancer for fault tolerance." This leads to a few alarms we'd like to see:

  • Cluster has dropped below minimum quorum.
  • Node ${X} is behind the load-balancer but serving up errors.
  • The load-balancer is not serving any pages.

We can certainly track all of those other things, but we don't need alarms on them. Those will come in handy when the below-quorum alarm is responded to. This list is what I'd call directly supporting. The rest are indirect indicators and we don't need PagerDuty to tell us about them, we'll find it ourselves once we start troubleshooting the actual problem.


Now that we have a list of existing alarms we want to keep and a list of alarms we'd like to have, the next step is determining when we want to be alarmed.

You are in the weeds.

You're getting a thousand or more alarms over the course of a weekend. You can't look at them all, that would require not sleeping. And you actually do sleep. How do you cope with it?

Lots of email rules, probably. Send the select few you actually care about to the email-to-sms gateway for your phone. File others into the special folders that make your phone go bingle. And mark-all-read the folder with 1821 messages in it when you get to the office on Monday (2692 on Tuesday after a holiday Monday).

Your monitoring system is OK. It does catch stuff, but the trick is noticing the alarm in all the noise.

You want to get to the point where alarms come rarely, and get acted upon when they show up. 1821 messages a weekend is not that system. Over a 60 hour weekend, 1821 messages is one message every two minutes. Or if it's like most monitoring system, it's a few messages an hour with a couple of bursts of hundreds over the course of a few polling-cycles as something big flaps and everything behind it goes 'down'. That alarming load is only sustainable with a fully staffed round-the-clock NOC.

Very few of us have those.

Paring down the load requires asking a few questions:

LISA 2013 was very good to me. I saw a lot of sessions about monitoring, theories of, and I've spent most of 2014 trying to revise the monitoring system at work to be less sucky and more awesome. It's mostly worked, and is an awesome-thing that's definitely going on my resume.

Credit goes primarily to two sources:

  • A Working Theroy of Monitoring, by Caskey L Dickson of Google, from LISA 2013.
  • SRE University: Non-abstract large systems design for sysadmins, by John Looney and company, of Google, also at LISA 2013.

It was these sessions that inspired me to refine a slide in A Working Theory of Monitoring and put my own spin on it:

I'll be explaining this further, but this is what the components of a monitoring system look like. Well, monitoring ecosystem since there is rarely a single system that does everything. There are plenty of companies that will sell you a product that does everything, but even they get supplimented by home-brew reporting engines and custom-built automation scripts.

There is more to an on-call rotation than a shared calendar with names on it and an agreement to call whoever is on the calendar if something goes wrong.

People are people, and you have to take that into consideration when setting up a rotation. And that means compromise, setting expectations, and consequences for not meeting them. Here are a few policies every rotation should have somewhere. Preferably easy to get to.

The rotation should be published well in advance, and easy to find.

This seems like an obvious thing, but it needs to be said. People need to know in advance when they're going to be obligated to pay attention to work in their usual time off. This allows them to schedule their lives around the on-call schedule, and you, as the on-call manager, will have to deal with fewer shift-swaps as a result. You're looking to avoid...

Um, I forgot I was on-call next week, and I'm going to be in Peru to hike the Andes. *sheepish look.*

This is less likely to happen if the shift schedule is advertised well in advance. For bonus points, add the shift schedules to their work calendars.

(US) Monday Holiday Law is a thing. Don't do shift swaps on Monday.

If you're doing weekly shifts, it's a good idea to not do your shift swap on Monday. Due to the US Monday Holiday Law there are five weeks in a year (10% of the total!) where your shift change will happen on an official holiday. Two of those are days that almost everyone gets off: Labor Day and Memorial Day.

Whether or not you need to avoid shift swaps on a non-work day depends a lot on how hand-offs work for your organization.

Set shift-handoff expectations.

When one watch-stander is relieved by the next, there needs to be a handoff. For some organizations it could be as simple as making sure the other person is there and responsive before stepping down. For others, it can be complicated as they have more state to transfer. State such as:

  • Ongoing issues being tracked.
  • Hardware replacements due during the next shift period.
  • Maintenance tasks not completed by the outgoing watch-stander.
  • Escalation engineers that won't be available.

And so on. If your organization has state to transfer, be sure you have a policy in place to ensure it is transferred.

Acknowledge time must be defined for alarms.

The maximum time a watch-stander is allowed to wait before ACKing an alarm must be defined by policy, and failure to meet that must be noticed. If the ACK time expires, the alarm should escalate to the next tier of on-call.

This is a very critical policy to define, as it allows watch-standers to predict how much life they can have while on-call. If response-time is 10 minutes, it means that if the watch-stander is driving they have to pull over to ack the alarm. 10 minute ACK likely means they can't do anything long, like go to movies or their kid's band recital.

This also goes for people who are on the escalation schedule. It may be different times, but it should still be defined.

Response time must be defined for alarms.

The time to first response, actually working on the reported problem, must be defined by policy, and failure to meet that must be noticed. This may vary based on the type of alarm received, but it should still be defined for each alarm.

This is another very critical policy to define, and is even more impactful on watch-stander ability to do other things than be at work. The watch-stander pretty much has to stay within N minutes of a reliable Internet connection for their entire watch. If their commute to and from work is longer than N minutes, they can't be on-watch during their commute. And other things.

  • If 10 minutes or less, it is nearly impossible to do anything out of the house. Such as picking up prescriptions, going to the kid's band recital, soccer games, and theatre rehearsals.
  • If 20 minutes or less, quick errands out of the house may be doable, but that's about it.

As with ACK time, this should also be defined for the people on the escalation schedule.

Consequences for escalations must be defined.

People are people, and sometimes we can't get to the phone for some reason. Escalations will happen for both good reasons (ER visits) and bad (slept through it). The severity of a missed alarm varies based on the organization and what got missed, so this will be a highly localized policy. If an alarm is missed, or a pattern of missed alarms develops, there should be defined consequences.

This is the kind of thing that can be brought up in quarterly/annual reviews.


This gets its own section because it's very important:

The right shift length

How long your shifts should be is a function of several variables:

  • ACK time.
  • Response time.
  • Frequency of alarms.
  • Average time-to-resolution (TTR) for the alarms.

People need to sleep, and watch-standers are more effective if they're not sleep-deprived. Chronically sleep-deprived people have poor job satisfaction and are more likely to leave for greener pastures. Here is a table I made showing the combinations of alarm frequency and TTR, and showing on average how many minutes a watch-stander will have between periods of on-demand attention:

TTR
Alarm Freq < 5 min 5 to 10 min 10 to 20 min up to 30 min up to 60 min
15 min 10 5 0 0 0
30 min 25 20 10 0 0
60 min 55 50 40 30 0
90 min 85 80 70 60 30
2 hour 115 110 100 180 60
4 hour 235 230 220 210 180
6 hour 355 350 340 330 300

Given that the average sleep cycle is 45 minutes, any combination that has less than that will be a shift that the watch-stander will not be able to sleep. This has been colored dark orange. If you allow people time to get to sleep, say 20 minutes, that locks out anything at or under 70 minutes (colored light orange). For people who don't go to sleep easily (such as me) even the 80 and 85 minute slack times would be too little. The white cells are shift combinations where sleeping while on-call is something that could happen.

If your alarm frequency and TTR are frequent enough, say in the 30% of attention or larger range, you don't have an on-call rotation you have a distributed NOC and being on watch is a full time job. Don't expect them to do anything else.

If you're in the orange areas, shifts shouldn't be longer than a day. And probably should be shorter.

If you're near the orange areas, you probably should not have a week of that kind of thing so shift lengths should be less than 7 days.

Shift lengths longer than these guidelines risks burnout and creating rage-demons. We have too many rage-demons as it is, so please have a heart.


I believe this policy set provides the groundwork for a well defined on-call rotation. The on-call engineers know what is expected of them, and know the or-else if they don't live up to it.

This showed up today.

I get that. The little white lie that it's all right, I wasn't offended. The lying silence where the, "check that bullshit," should have been. The desire to belong to the in-group (or an in-group, even if it's an in-group of one) is probably baked into our genetics. Those that arbitrate membership in the in-group set the standards by which membership is granted. So long as there is power there, the little internal betrayals needed to achieve membership, or if that isn't possible, satellite membership, can be justified.

For a while. Until the price starts getting too high.

If the in-group is in all of the positions of both power and employee redress? That's a lot of incentive to shut the fuck up and laugh like you mean it.

And if you keep poking at it, because shuting the fuck up and laughing already is becoming very hard, you lose in-group status.


This is a very human progression, we've been doing it since pre-history. The modern workplace is supposed to be set up to deal with toxic managers and hostile work environments, but cronyism is incredibly corrosive. It takes active push-back to fend off, and of the corruption is deep enough that just costs you your job.

Most corporate severance agreements include something called a non-disparagement clause, which means, in effect:

The severed employee agrees to not say bad things about the Company, or cause material harm to the Company's business through their actions.

And accusing a manager of being a harassing asshole is the kind of thing that could trigger that clause. By telling the world about her experience with this manager, naming names, and calling out the toxic culture of that particular work-unit, she can be considered to be causing 'material harm' and could face serious legal consequences. If Google wants to be assholes about it, of course. But the language is there in the agreement specifically to scare ex-employees out of doing things like this.

The internal system was stacked against her, and the court of public opinion was also stacked against her by the very company that had the bad culture.


I'm guilty of making the same kind of calculations. I didn't seek in-group status as firmly as Kelly did, and it got me fired in the end. It turned out well for me, but was pretty traumatic at the time.

While I was there I did consciously choose to not call out jokes, behavior, or other things that offended me, specifically because I needed to stay on good terms with the in-group. I never got to crying, but the little niggling things did add up. It meant I didn't stay long at company events, didn't follow on after-work outings to bars, and generally stayed quiet a lot of the time. It was noticed.

Encryption is hard

| 1 Comment

I've run into this workflow problem before, but it happened again so I'm sharing.


We have a standard.

No passwords in plain-text. If passwords need to be emailed, the email will be encrypted with S/MIME.

Awesome. I have certificates, and so do my coworkers. Should be awesome!

To: coworker
From: me
Subject: Anti-spam appliance password

[The content can't be displayed because the S/MIME control isn't available]

Standard folowed, mischief managed.

To: me
From: coworker
Subject: RE: Anti-spam appliance password
Thanks! Worked great.

To: coworker
From: me
uid: admin1792
pw: 92*$&diq38yljq3
https://172.2.245.11/login.cgi

Sigh.

Encryption is hard. It would be awesome if a certain mail-client defaulted to replying-in-kind to encrypted emails. But it doesn't, and users have to remember to click the button. Which they never do.

Ratios

| No Comments

In an effort to better understand the challenges facing the ops team of a particular project here at $DayJob, a project manager asked this question:

How many users per [sysadmin] can our system support?

The poor lead sysadmin on that side of the house swiveled her chair over and said to me, "there is no answer to this question!" And we had a short but spirited discussion about the various ratios to admin staff at the places we've been. Per-user is useless, we agreed. Machine/Instance count per admin? Slightly better. But even then. Between us we compiled a short list of places we've been and places we've read about.

  • Company A: 1000:1 And most of that 1 FTE was parts-monkey to keep the install-base running. The engineer to system ratio was closer to 10K:1. User count: global internet
  • Company B: 200:1 Which was desperately understaffed, as the ops team was frantically trying to keep up with a runaway application and a physical plant that was rotting under the load. User count: most of the US.
  • Company C: 150:1 Which was just right! User count: none, it was a being developed product.
  • Company D: 60:1 And the admin was part-time because there wasn't enough work. User count: 200
  • Company E: 40:1 Largely because 25-30 of those 40 systems were one-offs. It was a busy team. Monocultures are for wimps. User count: 20K.

This chart was used to explain to the project manager in question the "it depends" nature of admin staffing levels, and you can't rely on industry norms to determine the target we should be hitting. Everyone wants to be like Company A. Almost no one gets there.

What are the ratios you've worked with? Let me know @sysadm1138

Other Blogs

My Other Stuff

Monthly Archives