Recently in sysadmin Category

I've seen this dyamic happen a couple of times now. It goes kind of like this.

October: We're going all in on AWS! It's the future. Embrace it.
November: IT is working very hard on moving us there, thank you for your patience.
December: We're in! Enjoy the future.
January: This AWS bill is intolerable. Turn off everything we don't need.
February: Stop migrating things to AWS, we'll keep these specific systems on-prem for now.
March: Move these systems out of AWS.
April: Nothing gets moved to AWS unless it produces more revenue than it costs to run.

What's prompting this is a shock that is entirely predictable, but manages to penetrate the reality distortion field of upper management because the shock is to the pocketbook. They notice that kind of thing. To illustrate what I'm talking about, here is a made-up graph showing technology spend over a course of several years.

BudgetType-AWS.pngThe AWS line actually results in more money over time, as AWS does a good job of capturing costs that the traditional method generally ignores or assumes is lost in general overhead. But the screaming doesn't happen at the end of four years when they run the numbers, it happens in month four when the ongoing operational spend after build-out is done is w-a-y over what it used to be.

The spikes for traditional on-prem work are for forklifts of machinery showing up. Money is spent, new things show up, and they impact the monthly spend only infrequently. In this case, the base-charge increased only twice over the time-span. Some of those spikes are for things like maintenance-contract renewals, which don't impact base-spend one whit.

The AWS line is much less spikey, as new capabilities are assumed into the base-budget in an ongoing basis. You're no longer dropping $125K in a single go, you're dribbling it out over the course of a year or more. AWS price-drops mean that monthly spend actually goes down a few times.

Pay only for what you use!

Amazon is great at pointing that out, and hilighting the convenience of it. But what they don't mention is that by doing so, you will learn the hard way about what it is you really use. The AWS Calculator is an awesome tool, but if you don't know how your current environment works, it's like throwing darts at a wall for accurately predicting what you'll end up spending. You end up obsessing over small line-item charges you've never had to worry about before (how many IOPs do we do? Crap! I don't know! How many thousands will that cost us?), and missing the big items that nail you (Whoa! They meter bandwidth between AZs? Maybe we shouldn't be running our Hadoop cluster in multi-AZ mode).

There is a reason that third party AWS integrators are a thriving market.

Also, this 'what you use' is not subject to Oops exceptions without a lot of wrangling with Account Management. Have something that downloaded the entire EPEL repo twice a day for a month, and only learned about it when your bandwidth charge was 9x what it should be? Too bad, pay up or we'll turn the account off.

Unlike the forklift model, you pay for it every month without fail. If you have a bad quarter, you can't just not pay the bill for a few months and tru-up later. You're spending it, or they're turning your account off. This takes away some of the cost-shifting flexibility the old style had.

Unlike the forklift model, AWS prices its stuff assuming a three year turnover rate. Many companies have a 5 to 7 years lifetime for IT assets. Three to four years in production, with an afterlife of two to five years in various pre-prod, mirror, staging, and development roles.The cost of those assets therefore amortizes over 5-9 years, not 3.

Predictable spending, at last.

Hah.

Yes, it is predictable over time given accurate understanding of what is under management. But when your initial predictions end up being wildly off, it seems like it isn't predictable. It seems like you're being held over the coals.

And when you get a new system into AWS and the cost forecast is wildly off, it doesn't seem predictable.

And when your system gets the rocket-launch you've been craving and you're scaling like mad; but the scale-costs don't match your cost forecast, it doesn't seem predictable.

It's only predictable if you fully understand the cost items and how your systems interact with it.

Reserved instances will save you money

Yes! They will! Quite a lot of it, in fact. They let a company go back to the forklift-method of cost-accounting, at least for part of it. I need 100 m3.large instances, on a three year up-front model. OK! Monthly charges drop drastically, and the monthly spend chart begins to look like the old model again.

Except.

Reserved instances cost a lot of money up front. That's the point, that's the trade-off for getting a cheaper annual spend. But many companies get into AWS because they see it as cheaper than on-prem. Which means they're sensitive to one-month cost-spikes, which in turn means buying reserved instances doesn't happen and they stay on the high cost on-demand model.

AWS is Elastic!

Elastic in that you can scale up and down at will, without week/month long billing, delivery and integration cycles.

Elastic in that you have choice in your cost accounting methods. On-demand, and various kinds of reserved instances.

It is not elastic in when the bill is due.

It is not elastic with individual asset pricing, no matter how special you are as a company.


All of these things trip up upper, non-technical management. I've seen it happen three times now, and I'm sure I'll see it again at some point.

Maybe this will help you in illuminating the issues with your own management.

I'm not a developer but...

| No Comments

...I'm sure spending a lot of time in code lately.

Really. Over the last five months, I'd say that 80% of my normal working hours are spent grinding on puppet code. Or training others in getting them to maybe do some puppet stuff. I've even got some continuous integration work in, building a trio of sanity-tests for our puppet infrastructure:

  • 'puppet parser validate' returns OK for all .pp files.
    • Still on the 'current' parser, we haven't gotten as far as future/puppet4 yet.
  • puppet-lint returns no errors for the modules we've cleared.
    • This required extensive style-fixes before I put it in.
  • Catalogs compile for a certain set of machines we have.
    • I'm most proud of this, as this check actually finds dependency problems unlike puppet-parser.

Completely unsurprising, the CI stuff has actually caught bugs before it got pushed to production! Whoa! One of these days I'll be able to grab some of the others and demo this stuff, but we're off-boarding a senior admin right now and the brain-dumping is not being done by me for a few weeks.

We're inching closer to getting things rigged that a passing-build in the 'master' branch triggers an automatic deployment. That will take a bit of thought about, as some deploys (such as class-name changes) require coordinated modifications in other systems.

Because I get to define what's 'possible', and anything is possible given enough time, management backing, and an unlimited budget.

If I don't have management backing, I will decide on my own how to fit this new ASAP in amongst my other ASAP work and the work that has actual deadlines attached to it.

If this ASAP has a time/money tradeoff, I need management backing to tell me which way to go. And what other work to let sluff in order to get the time needed.


In the end, there are only a few priority levels that people actually use.

  1. Realtime. I will stand here until I get what I need.
  2. ASAP.
  3. On this defined date or condition.
  4. Whenever you can get to it.

Realtime is a form of ASAP, but it's the kind of ASAP where the requester is highly invested in it and will keep statusing and may throw resources at it in order to get the thingy as soon as actually possible. Think major production outages.

ASAP is really 'as soon as you can get to it, unless I think that's not fast enough.' For sysadmin teams where the load-average is below the number of processors this can work pretty well. For loaded sysadmin teams, the results will not be to the liking of the open-ended deadline requestors.

On this defined date or condition is awesome, as it gives us expectations of delivery and allows us to do queue optimization.

Whenever you can get to it is like nicing a process. It'll be a while, but it'll be gotten to. Eventually.

"ASAP, but no later than [date]" is a much better way of putting it. It gives a hint to the queue optimizer as to where to slot the work amongst everything else.

Thank you.

Paternity leave and on-call

| No Comments

It all started with this tweet.

Which you need to read (Medium.com). Some pull-quotes of interest:

My manager probably didn't realize that "How was your vacation" was the worst thing to ask me after I came back from paternity leave.

Patriarchy would have us believe that parenting is primarily the concern of the mother. Therefore paternity leave is a few extra days off for dad to chillax with his family and help mom out.

Beyond a recovery time from pregnancy, much of parental leave is learning to be a parent and adjusting to your new family and bonding with the baby. I can and did bond with the baby, but not as much as my female coworkers bonded with their babies.

I should also state, that I don't just want equality, I want a long time to bond with my child. Three months or more sounds nice. Not only can I learn to soothe him when he's upset, put him to sleep without worrying about being paged, but I can be around when he does the amazing things babies do in their first year: learning to sit, crawl, eat, stand and even walk.

At my current employer, I was shocked to learn that new dads get two weeks off.

Two.

At my previous startup, paternal leave was under the jurisdiction of the 'unlimited vacation' policy. Well...

Vacations are important. My friends would joke that the one way to actually be able to take vacations was to keep having children. Here the conflation was in jest, and also a caricature of the reality of vacations at startups.

We had a bit of a baby-boom while I was there. Dads were glared at if they showed up less than two weeks in and told to go home. After that, most of them worked part-time for a few weeks and slowly worked up to full time.

This article caused me to tweet...

The idea here is that IT managers who work for a company like mine with a really small amount of parental leave do have a bit of power to give Dad more time with the new kid: take them off of the call rota for a while. A better corporate policy is ideal, but it's a kind of local fix that just might help. Dad doesn't have to live to the pager and new-kid.

Interesting idea, but not a great one.

Which is a critique of the disaster-resilience of 3-person teams. I was on one, and we had to coordinate Summer Vacation Season to ensure we had two-person coverage for most of it, and if 1-person was unavoidable, keep it to a couple days at best. None of us had kids while I was there (the other two had teenagers, and I wasn't about to start), so we didn't get to live through a paternity-leave sized hole in coverage.

Which is the kind of team I'm on right now, and why I thought of the idea. We have enough people that a person sized hole, even a Sr. Engineer sized hole, can be filled for several to many weeks in the rotation.

That's the ideal route though, and touches on a very human point: if you're in a company where you always check mail or can expect pages off-hours, it doesn't matter if you're not in the official call-rotation. That's a company culture problem independent of the on-call rotation.

My idea can work, but it takes the right culture to pull off. Extended leave would be much better, and is the kind of thing we should be advocating for.

You should still read the article.

The project is done, and you have a monitoring system you like!

How, how do you keep liking it?

Like all good things, it takes maintenance. There are a few processes you should have in place to provide the right feedback loops to keep liking your shiny new monitoring environment.

  • Questions about monitoring should be in your incident retrospective process.
  • A periodic review of active alarms to be sure you still really want them.

Implementing these will provide both upward pressure to expand it into areas it needs to go, and downward pressure to get rid of needless noise.

There is more to a monitoring system than alarms and reports. Behind all of those cries for action are a lot of data. A lot of data. So much data, that you face scaling problems when you grow because all of your systems generate so much monitoring data.

Monitor everything!
-- Boss

A great idea in principle, but falls apart in one key way...

"Define 'everything', please"

'Everything' means different things for different people. It just isn't feasable to track every monitorable on everything with one-second granularity. Just about everyone will want to back away from that level of monitor-all-the-things. But what is the right fit?

It depends on what you want to do with it. Data being tracked supports four broad categories of monitoring.

  1. Performance
  2. Operational
  3. Capacity
  4. SLA

Performance Monitoring

This kind of monitoring tends to have low intervals between polls. It could be five minutes, but may be as little as every second. That kind of monitoring will create a deluge of data, and may only be done when diagnosing exposed problems or doing in-depth research on the system. It's not run all the time, unless you really do care about per-second changes in state of something.

This kind of monitoring is defined by a few attributes:

  • High granularity. You poll a lot.
  • Low urgency. You're doing this because you're looking into something, not because it's down.
  • Occasional need. You don't run it all the time, and not on a schedule.

Everything: 1 second granularity for CPU, IOPS, and pagefaults for a given cluster.


Operational Monitoring

The kind we're all familiar with. This is the kind of monitoring that tends to emit alarms for on-call rotations.

  • Medium granularity. Every 5 minutes, that kind of thing.
  • High urgency. Fast responses are needed.
  • Constant need. You run it all the time.

Everything: Every disk-event, everywhere.


Capacity Monitoring

Some of the alarms you have defined already may be capacity alarms, but capacity monitoring just tracks how much you are using of what you have. Some of this stuff doesn't change very fast.

  • Low granulariy. It may only get checked once a day.
  • Low urgency. Responding in a couple of days may be fast enough. If not slower.
  • Periodic need. Reviewed once in a while, probably on a schedule.

Everything: Anything that has a "Max" size value greater than the "Current" value.


SLA Monitoring

I've already gone on at length about SLAs, but this is the monitoring that directly supports the SLA pass/fail metrics. I break it apart from the other types because of how it's accessed.

  • Low granularity. Some metrics may be medium, but in general SLA trackers are over-time style.
  • Medium urgency. If a failing grade is determined, response needs to happen. How fast, depends on what's not going to get met.
  • Continual and Periodic need. Some things will be monitored continually, others will only be checked on long schedules; possibly once a week, if not once a month.

Everything: Everything it takes to build those reports.


Be aware that 'everything' is context-sensitive when you're talking with people and don't freak out when a grand high executive says, "everything," at you. They're probably thinking about the SLA Monitoring version of everything, which is entirely manageable.

Don't panic, and keep improving your monitoring system.

In the last article we created a list of monitorables and things that look like the kind of alarms we want to see.

Now what?

First off, go back to the list of alarms you already have. Go through those and see which of those existing alarms directly support the list you just created. It may be depressing how few of them do, but rejoice! Fewer alarms mean fewer emails!

What does 'directly support' mean?

Lets look at one monitorable and see what kind of alarms might directly or indirectly support it.

Main-page cluster status.

There are a number of alarms that could already be defined for this one.

  • Main-page availability as polled directly on the load-balancer.
  • Pingability of each cluster member.
  • Main-page reachability on each cluster member.
  • CPU/Disk/Ram/Swap on each cluster member.
  • Switch-port status for the load-balancer and each cluster-member.
  • Webserver process existence on each cluster member.
  • Webserver process CPU/RAM usage on each cluster member.

And more, I'm sure. That's a lot of data, and we don't need to define alarms for all of it. The question to ask is, "How do I determine the status of the cluster?"

The answer could be, "All healthy nodes behind the load-balancer return the main-page, with at least three nodes behind the load-balancer for fault tolerance." This leads to a few alarms we'd like to see:

  • Cluster has dropped below minimum quorum.
  • Node ${X} is behind the load-balancer but serving up errors.
  • The load-balancer is not serving any pages.

We can certainly track all of those other things, but we don't need alarms on them. Those will come in handy when the below-quorum alarm is responded to. This list is what I'd call directly supporting. The rest are indirect indicators and we don't need PagerDuty to tell us about them, we'll find it ourselves once we start troubleshooting the actual problem.


Now that we have a list of existing alarms we want to keep and a list of alarms we'd like to have, the next step is determining when we want to be alarmed.

You are in the weeds.

You're getting a thousand or more alarms over the course of a weekend. You can't look at them all, that would require not sleeping. And you actually do sleep. How do you cope with it?

Lots of email rules, probably. Send the select few you actually care about to the email-to-sms gateway for your phone. File others into the special folders that make your phone go bingle. And mark-all-read the folder with 1821 messages in it when you get to the office on Monday (2692 on Tuesday after a holiday Monday).

Your monitoring system is OK. It does catch stuff, but the trick is noticing the alarm in all the noise.

You want to get to the point where alarms come rarely, and get acted upon when they show up. 1821 messages a weekend is not that system. Over a 60 hour weekend, 1821 messages is one message every two minutes. Or if it's like most monitoring system, it's a few messages an hour with a couple of bursts of hundreds over the course of a few polling-cycles as something big flaps and everything behind it goes 'down'. That alarming load is only sustainable with a fully staffed round-the-clock NOC.

Very few of us have those.

Paring down the load requires asking a few questions:

LISA 2013 was very good to me. I saw a lot of sessions about monitoring, theories of, and I've spent most of 2014 trying to revise the monitoring system at work to be less sucky and more awesome. It's mostly worked, and is an awesome-thing that's definitely going on my resume.

Credit goes primarily to two sources:

  • A Working Theroy of Monitoring, by Caskey L Dickson of Google, from LISA 2013.
  • SRE University: Non-abstract large systems design for sysadmins, by John Looney and company, of Google, also at LISA 2013.

It was these sessions that inspired me to refine a slide in A Working Theory of Monitoring and put my own spin on it:

I'll be explaining this further, but this is what the components of a monitoring system look like. Well, monitoring ecosystem since there is rarely a single system that does everything. There are plenty of companies that will sell you a product that does everything, but even they get supplimented by home-brew reporting engines and custom-built automation scripts.

There is more to an on-call rotation than a shared calendar with names on it and an agreement to call whoever is on the calendar if something goes wrong.

People are people, and you have to take that into consideration when setting up a rotation. And that means compromise, setting expectations, and consequences for not meeting them. Here are a few policies every rotation should have somewhere. Preferably easy to get to.

The rotation should be published well in advance, and easy to find.

This seems like an obvious thing, but it needs to be said. People need to know in advance when they're going to be obligated to pay attention to work in their usual time off. This allows them to schedule their lives around the on-call schedule, and you, as the on-call manager, will have to deal with fewer shift-swaps as a result. You're looking to avoid...

Um, I forgot I was on-call next week, and I'm going to be in Peru to hike the Andes. *sheepish look.*

This is less likely to happen if the shift schedule is advertised well in advance. For bonus points, add the shift schedules to their work calendars.

(US) Monday Holiday Law is a thing. Don't do shift swaps on Monday.

If you're doing weekly shifts, it's a good idea to not do your shift swap on Monday. Due to the US Monday Holiday Law there are five weeks in a year (10% of the total!) where your shift change will happen on an official holiday. Two of those are days that almost everyone gets off: Labor Day and Memorial Day.

Whether or not you need to avoid shift swaps on a non-work day depends a lot on how hand-offs work for your organization.

Set shift-handoff expectations.

When one watch-stander is relieved by the next, there needs to be a handoff. For some organizations it could be as simple as making sure the other person is there and responsive before stepping down. For others, it can be complicated as they have more state to transfer. State such as:

  • Ongoing issues being tracked.
  • Hardware replacements due during the next shift period.
  • Maintenance tasks not completed by the outgoing watch-stander.
  • Escalation engineers that won't be available.

And so on. If your organization has state to transfer, be sure you have a policy in place to ensure it is transferred.

Acknowledge time must be defined for alarms.

The maximum time a watch-stander is allowed to wait before ACKing an alarm must be defined by policy, and failure to meet that must be noticed. If the ACK time expires, the alarm should escalate to the next tier of on-call.

This is a very critical policy to define, as it allows watch-standers to predict how much life they can have while on-call. If response-time is 10 minutes, it means that if the watch-stander is driving they have to pull over to ack the alarm. 10 minute ACK likely means they can't do anything long, like go to movies or their kid's band recital.

This also goes for people who are on the escalation schedule. It may be different times, but it should still be defined.

Response time must be defined for alarms.

The time to first response, actually working on the reported problem, must be defined by policy, and failure to meet that must be noticed. This may vary based on the type of alarm received, but it should still be defined for each alarm.

This is another very critical policy to define, and is even more impactful on watch-stander ability to do other things than be at work. The watch-stander pretty much has to stay within N minutes of a reliable Internet connection for their entire watch. If their commute to and from work is longer than N minutes, they can't be on-watch during their commute. And other things.

  • If 10 minutes or less, it is nearly impossible to do anything out of the house. Such as picking up prescriptions, going to the kid's band recital, soccer games, and theatre rehearsals.
  • If 20 minutes or less, quick errands out of the house may be doable, but that's about it.

As with ACK time, this should also be defined for the people on the escalation schedule.

Consequences for escalations must be defined.

People are people, and sometimes we can't get to the phone for some reason. Escalations will happen for both good reasons (ER visits) and bad (slept through it). The severity of a missed alarm varies based on the organization and what got missed, so this will be a highly localized policy. If an alarm is missed, or a pattern of missed alarms develops, there should be defined consequences.

This is the kind of thing that can be brought up in quarterly/annual reviews.


This gets its own section because it's very important:

The right shift length

How long your shifts should be is a function of several variables:

  • ACK time.
  • Response time.
  • Frequency of alarms.
  • Average time-to-resolution (TTR) for the alarms.

People need to sleep, and watch-standers are more effective if they're not sleep-deprived. Chronically sleep-deprived people have poor job satisfaction and are more likely to leave for greener pastures. Here is a table I made showing the combinations of alarm frequency and TTR, and showing on average how many minutes a watch-stander will have between periods of on-demand attention:

TTR
Alarm Freq < 5 min 5 to 10 min 10 to 20 min up to 30 min up to 60 min
15 min 10 5 0 0 0
30 min 25 20 10 0 0
60 min 55 50 40 30 0
90 min 85 80 70 60 30
2 hour 115 110 100 180 60
4 hour 235 230 220 210 180
6 hour 355 350 340 330 300

Given that the average sleep cycle is 45 minutes, any combination that has less than that will be a shift that the watch-stander will not be able to sleep. This has been colored dark orange. If you allow people time to get to sleep, say 20 minutes, that locks out anything at or under 70 minutes (colored light orange). For people who don't go to sleep easily (such as me) even the 80 and 85 minute slack times would be too little. The white cells are shift combinations where sleeping while on-call is something that could happen.

If your alarm frequency and TTR are frequent enough, say in the 30% of attention or larger range, you don't have an on-call rotation you have a distributed NOC and being on watch is a full time job. Don't expect them to do anything else.

If you're in the orange areas, shifts shouldn't be longer than a day. And probably should be shorter.

If you're near the orange areas, you probably should not have a week of that kind of thing so shift lengths should be less than 7 days.

Shift lengths longer than these guidelines risks burnout and creating rage-demons. We have too many rage-demons as it is, so please have a heart.


I believe this policy set provides the groundwork for a well defined on-call rotation. The on-call engineers know what is expected of them, and know the or-else if they don't live up to it.

Other Blogs

My Other Stuff

Monthly Archives