April 2015 Archives

I'm not a developer but...

...I'm sure spending a lot of time in code lately.

Really. Over the last five months, I'd say that 80% of my normal working hours are spent grinding on puppet code. Or training others in getting them to maybe do some puppet stuff. I've even got some continuous integration work in, building a trio of sanity-tests for our puppet infrastructure:

  • 'puppet parser validate' returns OK for all .pp files.
    • Still on the 'current' parser, we haven't gotten as far as future/puppet4 yet.
  • puppet-lint returns no errors for the modules we've cleared.
    • This required extensive style-fixes before I put it in.
  • Catalogs compile for a certain set of machines we have.
    • I'm most proud of this, as this check actually finds dependency problems unlike puppet-parser.

Completely unsurprising, the CI stuff has actually caught bugs before it got pushed to production! Whoa! One of these days I'll be able to grab some of the others and demo this stuff, but we're off-boarding a senior admin right now and the brain-dumping is not being done by me for a few weeks.

We're inching closer to getting things rigged that a passing-build in the 'master' branch triggers an automatic deployment. That will take a bit of thought about, as some deploys (such as class-name changes) require coordinated modifications in other systems.

Why 'ASAP' is a craptastic deadline

Because I get to define what's 'possible', and anything is possible given enough time, management backing, and an unlimited budget.

If I don't have management backing, I will decide on my own how to fit this new ASAP in amongst my other ASAP work and the work that has actual deadlines attached to it.

If this ASAP has a time/money tradeoff, I need management backing to tell me which way to go. And what other work to let sluff in order to get the time needed.


In the end, there are only a few priority levels that people actually use.

  1. Realtime. I will stand here until I get what I need.
  2. ASAP.
  3. On this defined date or condition.
  4. Whenever you can get to it.

Realtime is a form of ASAP, but it's the kind of ASAP where the requester is highly invested in it and will keep statusing and may throw resources at it in order to get the thingy as soon as actually possible. Think major production outages.

ASAP is really 'as soon as you can get to it, unless I think that's not fast enough.' For sysadmin teams where the load-average is below the number of processors this can work pretty well. For loaded sysadmin teams, the results will not be to the liking of the open-ended deadline requestors.

On this defined date or condition is awesome, as it gives us expectations of delivery and allows us to do queue optimization.

Whenever you can get to it is like nicing a process. It'll be a while, but it'll be gotten to. Eventually.

"ASAP, but no later than [date]" is a much better way of putting it. It gives a hint to the queue optimizer as to where to slot the work amongst everything else.

Thank you.

Paternity leave and on-call

It all started with this tweet.

Which you need to read (Medium.com). Some pull-quotes of interest:

My manager probably didn't realize that "How was your vacation" was the worst thing to ask me after I came back from paternity leave.

Patriarchy would have us believe that parenting is primarily the concern of the mother. Therefore paternity leave is a few extra days off for dad to chillax with his family and help mom out.

Beyond a recovery time from pregnancy, much of parental leave is learning to be a parent and adjusting to your new family and bonding with the baby. I can and did bond with the baby, but not as much as my female coworkers bonded with their babies.

I should also state, that I don't just want equality, I want a long time to bond with my child. Three months or more sounds nice. Not only can I learn to soothe him when he's upset, put him to sleep without worrying about being paged, but I can be around when he does the amazing things babies do in their first year: learning to sit, crawl, eat, stand and even walk.

At my current employer, I was shocked to learn that new dads get two weeks off.

Two.

At my previous startup, paternal leave was under the jurisdiction of the 'unlimited vacation' policy. Well...

Vacations are important. My friends would joke that the one way to actually be able to take vacations was to keep having children. Here the conflation was in jest, and also a caricature of the reality of vacations at startups.

We had a bit of a baby-boom while I was there. Dads were glared at if they showed up less than two weeks in and told to go home. After that, most of them worked part-time for a few weeks and slowly worked up to full time.

This article caused me to tweet...

The idea here is that IT managers who work for a company like mine with a really small amount of parental leave do have a bit of power to give Dad more time with the new kid: take them off of the call rota for a while. A better corporate policy is ideal, but it's a kind of local fix that just might help. Dad doesn't have to live to the pager and new-kid.

Interesting idea, but not a great one.

Which is a critique of the disaster-resilience of 3-person teams. I was on one, and we had to coordinate Summer Vacation Season to ensure we had two-person coverage for most of it, and if 1-person was unavoidable, keep it to a couple days at best. None of us had kids while I was there (the other two had teenagers, and I wasn't about to start), so we didn't get to live through a paternity-leave sized hole in coverage.

Which is the kind of team I'm on right now, and why I thought of the idea. We have enough people that a person sized hole, even a Sr. Engineer sized hole, can be filled for several to many weeks in the rotation.

That's the ideal route though, and touches on a very human point: if you're in a company where you always check mail or can expect pages off-hours, it doesn't matter if you're not in the official call-rotation. That's a company culture problem independent of the on-call rotation.

My idea can work, but it takes the right culture to pull off. Extended leave would be much better, and is the kind of thing we should be advocating for.

You should still read the article.

The project is done, and you have a monitoring system you like!

How, how do you keep liking it?

Like all good things, it takes maintenance. There are a few processes you should have in place to provide the right feedback loops to keep liking your shiny new monitoring environment.

  • Questions about monitoring should be in your incident retrospective process.
  • A periodic review of active alarms to be sure you still really want them.

Implementing these will provide both upward pressure to expand it into areas it needs to go, and downward pressure to get rid of needless noise.

There is more to a monitoring system than alarms and reports. Behind all of those cries for action are a lot of data. A lot of data. So much data, that you face scaling problems when you grow because all of your systems generate so much monitoring data.

Monitor everything!
-- Boss

A great idea in principle, but falls apart in one key way...

"Define 'everything', please"

'Everything' means different things for different people. It just isn't feasable to track every monitorable on everything with one-second granularity. Just about everyone will want to back away from that level of monitor-all-the-things. But what is the right fit?

It depends on what you want to do with it. Data being tracked supports four broad categories of monitoring.

  1. Performance
  2. Operational
  3. Capacity
  4. SLA

Performance Monitoring

This kind of monitoring tends to have low intervals between polls. It could be five minutes, but may be as little as every second. That kind of monitoring will create a deluge of data, and may only be done when diagnosing exposed problems or doing in-depth research on the system. It's not run all the time, unless you really do care about per-second changes in state of something.

This kind of monitoring is defined by a few attributes:

  • High granularity. You poll a lot.
  • Low urgency. You're doing this because you're looking into something, not because it's down.
  • Occasional need. You don't run it all the time, and not on a schedule.

Everything: 1 second granularity for CPU, IOPS, and pagefaults for a given cluster.


Operational Monitoring

The kind we're all familiar with. This is the kind of monitoring that tends to emit alarms for on-call rotations.

  • Medium granularity. Every 5 minutes, that kind of thing.
  • High urgency. Fast responses are needed.
  • Constant need. You run it all the time.

Everything: Every disk-event, everywhere.


Capacity Monitoring

Some of the alarms you have defined already may be capacity alarms, but capacity monitoring just tracks how much you are using of what you have. Some of this stuff doesn't change very fast.

  • Low granulariy. It may only get checked once a day.
  • Low urgency. Responding in a couple of days may be fast enough. If not slower.
  • Periodic need. Reviewed once in a while, probably on a schedule.

Everything: Anything that has a "Max" size value greater than the "Current" value.


SLA Monitoring

I've already gone on at length about SLAs, but this is the monitoring that directly supports the SLA pass/fail metrics. I break it apart from the other types because of how it's accessed.

  • Low granularity. Some metrics may be medium, but in general SLA trackers are over-time style.
  • Medium urgency. If a failing grade is determined, response needs to happen. How fast, depends on what's not going to get met.
  • Continual and Periodic need. Some things will be monitored continually, others will only be checked on long schedules; possibly once a week, if not once a month.

Everything: Everything it takes to build those reports.


Be aware that 'everything' is context-sensitive when you're talking with people and don't freak out when a grand high executive says, "everything," at you. They're probably thinking about the SLA Monitoring version of everything, which is entirely manageable.

Don't panic, and keep improving your monitoring system.

In the last article we created a list of monitorables and things that look like the kind of alarms we want to see.

Now what?

First off, go back to the list of alarms you already have. Go through those and see which of those existing alarms directly support the list you just created. It may be depressing how few of them do, but rejoice! Fewer alarms mean fewer emails!

What does 'directly support' mean?

Lets look at one monitorable and see what kind of alarms might directly or indirectly support it.

Main-page cluster status.

There are a number of alarms that could already be defined for this one.

  • Main-page availability as polled directly on the load-balancer.
  • Pingability of each cluster member.
  • Main-page reachability on each cluster member.
  • CPU/Disk/Ram/Swap on each cluster member.
  • Switch-port status for the load-balancer and each cluster-member.
  • Webserver process existence on each cluster member.
  • Webserver process CPU/RAM usage on each cluster member.

And more, I'm sure. That's a lot of data, and we don't need to define alarms for all of it. The question to ask is, "How do I determine the status of the cluster?"

The answer could be, "All healthy nodes behind the load-balancer return the main-page, with at least three nodes behind the load-balancer for fault tolerance." This leads to a few alarms we'd like to see:

  • Cluster has dropped below minimum quorum.
  • Node ${X} is behind the load-balancer but serving up errors.
  • The load-balancer is not serving any pages.

We can certainly track all of those other things, but we don't need alarms on them. Those will come in handy when the below-quorum alarm is responded to. This list is what I'd call directly supporting. The rest are indirect indicators and we don't need PagerDuty to tell us about them, we'll find it ourselves once we start troubleshooting the actual problem.


Now that we have a list of existing alarms we want to keep and a list of alarms we'd like to have, the next step is determining when we want to be alarmed.

You are in the weeds.

You're getting a thousand or more alarms over the course of a weekend. You can't look at them all, that would require not sleeping. And you actually do sleep. How do you cope with it?

Lots of email rules, probably. Send the select few you actually care about to the email-to-sms gateway for your phone. File others into the special folders that make your phone go ♫♬♫bingle♬♫♬. And mark-all-read the folder with 1821 messages in it when you get to the office on Monday (2692 on Tuesday after a holiday Monday).

Your monitoring system is OK. It does catch stuff, but the trick is noticing the alarm in all the noise.

You want to get to the point where alarms come rarely, and get acted upon when they show up. 1821 messages a weekend is not that system. Over a 60 hour weekend, 1821 messages is one message every two minutes. Or if it's like most monitoring system, it's a few messages an hour with a couple of bursts of hundreds over the course of a few polling-cycles as something big flaps and everything behind it goes 'down'. That alarming load is only sustainable with a fully staffed round-the-clock NOC.

Very few of us have those.

Paring down the load requires asking a few questions:

A synthetic theory of monitoring

LISA 2013 was very good to me. I saw a lot of sessions about monitoring, theories of, and I've spent most of 2014 trying to revise the monitoring system at work to be less sucky and more awesome. It's mostly worked, and is an awesome-thing that's definitely going on my resume.

Credit goes primarily to two sources:

  • A Working Theroy of Monitoring, by Caskey L Dickson of Google, from LISA 2013.
  • SRE University: Non-abstract large systems design for sysadmins, by John Looney and company, of Google, also at LISA 2013.

It was these sessions that inspired me to refine a slide in A Working Theory of Monitoring and put my own spin on it:

I'll be explaining this further, but this is what the components of a monitoring system look like. Well, monitoring ecosystem since there is rarely a single system that does everything. There are plenty of companies that will sell you a product that does everything, but even they get supplimented by home-brew reporting engines and custom-built automation scripts.