Recently in sysadmin Category

Or, a post I'd never thought I'd make seeing as I'm a sysadmin.

But it seems I'm the senior git expert in my team, so I'm making it. So odd.


There are a series of questions you should ask among your team before moving a repo over to git. Git is a hell of a toolbox, and like all toolboxes there are nearly infinite ways of using it. There is no one true way, only ways that are better for you than others. These are a series of questions to help you figure out how you want to use it, so you can be happier down the road.

Q: How do you use the commit-log?

History is awesome. Looking back five years in the code repository to figure out WTF a past developer was thinking about writing that bit of spaghetti code is quite useful if that commit includes something like, "found weird-ass edge case in glib, this is the workaround until they get a fix." That's actionable. Maybe it's even tied to a bug number in the bug tracking system, or a support ticket.

Do you ever look through the history? What are you looking for? Knowing this allows you to learn what you want out of your source-control.

Q: What is the worth of a commit?

A commit in Git is not the same thing as in SVN, Fog, or ClearCase. In some, a commit, or checkin, is a pretty big thing. It takes reviews, or approvals before it can be made.

This question is there to get you thinking about what a commit is. Commits in git are cheap, that changes things. Knowing that you will be facing more of then than you had in the past will help guide you in the later questions.

Q: Is every commit sacred, or you do you value larger, well documented commits more?

Practically everyone I know has made a commit with the message of 'asdf'. If you're grinding on a stupid thing, it may take you 19 commits to come up with the two lines of code that actually work. In five years, when you come back to look at that line of code, the final commit-message on those lines might be '

a1bd0809 maybe this will work

Not exactly informative.

bdc8671a Reformat method calls to handle new version of nokogiri

That is informative.

Most projects value more informative commits over lots of little, iterative ones. But your team may be different. And may change its mind after experience has been had.

Q: Should new features be all in one commit, or in a few modular commits?

Some features are quite large. So large, that rebasing them into a single commit leads to a diff of hundreds of lines. Such a large feature means that the history on those files will be slathered with the same initial-feature-commit with no context for why it is that way.

Is that good enough? Mabe it is, maybe you're more interested in the hotfix commits that are fixing bugs and explain non-intuitive behavior and workaround. Maybe it isn't, and you need each sub-feature in its own. Or maybe you want every non-fixup commit.

This is where your approach to the history really informs your decision. If you know how you deal with the past, you will be better able to put process in place to be happier with your past self.


Once you've thought about these questions and your answers to them, you'll be better able to consider the deeper problem of branching strategy. Git is notoriously lacking in undo features, at least in shared repos, so getting this out of the way early is good.

I've seen this dyamic happen a couple of times now. It goes kind of like this.

October: We're going all in on AWS! It's the future. Embrace it.
November: IT is working very hard on moving us there, thank you for your patience.
December: We're in! Enjoy the future.
January: This AWS bill is intolerable. Turn off everything we don't need.
February: Stop migrating things to AWS, we'll keep these specific systems on-prem for now.
March: Move these systems out of AWS.
April: Nothing gets moved to AWS unless it produces more revenue than it costs to run.

What's prompting this is a shock that is entirely predictable, but manages to penetrate the reality distortion field of upper management because the shock is to the pocketbook. They notice that kind of thing. To illustrate what I'm talking about, here is a made-up graph showing technology spend over a course of several years.

BudgetType-AWS.pngThe AWS line actually results in more money over time, as AWS does a good job of capturing costs that the traditional method generally ignores or assumes is lost in general overhead. But the screaming doesn't happen at the end of four years when they run the numbers, it happens in month four when the ongoing operational spend after build-out is done is w-a-y over what it used to be.

The spikes for traditional on-prem work are for forklifts of machinery showing up. Money is spent, new things show up, and they impact the monthly spend only infrequently. In this case, the base-charge increased only twice over the time-span. Some of those spikes are for things like maintenance-contract renewals, which don't impact base-spend one whit.

The AWS line is much less spikey, as new capabilities are assumed into the base-budget in an ongoing basis. You're no longer dropping $125K in a single go, you're dribbling it out over the course of a year or more. AWS price-drops mean that monthly spend actually goes down a few times.

Pay only for what you use!

Amazon is great at pointing that out, and hilighting the convenience of it. But what they don't mention is that by doing so, you will learn the hard way about what it is you really use. The AWS Calculator is an awesome tool, but if you don't know how your current environment works, it's like throwing darts at a wall for accurately predicting what you'll end up spending. You end up obsessing over small line-item charges you've never had to worry about before (how many IOPs do we do? Crap! I don't know! How many thousands will that cost us?), and missing the big items that nail you (Whoa! They meter bandwidth between AZs? Maybe we shouldn't be running our Hadoop cluster in multi-AZ mode).

There is a reason that third party AWS integrators are a thriving market.

Also, this 'what you use' is not subject to Oops exceptions without a lot of wrangling with Account Management. Have something that downloaded the entire EPEL repo twice a day for a month, and only learned about it when your bandwidth charge was 9x what it should be? Too bad, pay up or we'll turn the account off.

Unlike the forklift model, you pay for it every month without fail. If you have a bad quarter, you can't just not pay the bill for a few months and tru-up later. You're spending it, or they're turning your account off. This takes away some of the cost-shifting flexibility the old style had.

Unlike the forklift model, AWS prices its stuff assuming a three year turnover rate. Many companies have a 5 to 7 years lifetime for IT assets. Three to four years in production, with an afterlife of two to five years in various pre-prod, mirror, staging, and development roles.The cost of those assets therefore amortizes over 5-9 years, not 3.

Predictable spending, at last.

Hah.

Yes, it is predictable over time given accurate understanding of what is under management. But when your initial predictions end up being wildly off, it seems like it isn't predictable. It seems like you're being held over the coals.

And when you get a new system into AWS and the cost forecast is wildly off, it doesn't seem predictable.

And when your system gets the rocket-launch you've been craving and you're scaling like mad; but the scale-costs don't match your cost forecast, it doesn't seem predictable.

It's only predictable if you fully understand the cost items and how your systems interact with it.

Reserved instances will save you money

Yes! They will! Quite a lot of it, in fact. They let a company go back to the forklift-method of cost-accounting, at least for part of it. I need 100 m3.large instances, on a three year up-front model. OK! Monthly charges drop drastically, and the monthly spend chart begins to look like the old model again.

Except.

Reserved instances cost a lot of money up front. That's the point, that's the trade-off for getting a cheaper annual spend. But many companies get into AWS because they see it as cheaper than on-prem. Which means they're sensitive to one-month cost-spikes, which in turn means buying reserved instances doesn't happen and they stay on the high cost on-demand model.

AWS is Elastic!

Elastic in that you can scale up and down at will, without week/month long billing, delivery and integration cycles.

Elastic in that you have choice in your cost accounting methods. On-demand, and various kinds of reserved instances.

It is not elastic in when the bill is due.

It is not elastic with individual asset pricing, no matter how special you are as a company.


All of these things trip up upper, non-technical management. I've seen it happen three times now, and I'm sure I'll see it again at some point.

Maybe this will help you in illuminating the issues with your own management.

I'm not a developer but...

| No Comments

...I'm sure spending a lot of time in code lately.

Really. Over the last five months, I'd say that 80% of my normal working hours are spent grinding on puppet code. Or training others in getting them to maybe do some puppet stuff. I've even got some continuous integration work in, building a trio of sanity-tests for our puppet infrastructure:

  • 'puppet parser validate' returns OK for all .pp files.
    • Still on the 'current' parser, we haven't gotten as far as future/puppet4 yet.
  • puppet-lint returns no errors for the modules we've cleared.
    • This required extensive style-fixes before I put it in.
  • Catalogs compile for a certain set of machines we have.
    • I'm most proud of this, as this check actually finds dependency problems unlike puppet-parser.

Completely unsurprising, the CI stuff has actually caught bugs before it got pushed to production! Whoa! One of these days I'll be able to grab some of the others and demo this stuff, but we're off-boarding a senior admin right now and the brain-dumping is not being done by me for a few weeks.

We're inching closer to getting things rigged that a passing-build in the 'master' branch triggers an automatic deployment. That will take a bit of thought about, as some deploys (such as class-name changes) require coordinated modifications in other systems.

Because I get to define what's 'possible', and anything is possible given enough time, management backing, and an unlimited budget.

If I don't have management backing, I will decide on my own how to fit this new ASAP in amongst my other ASAP work and the work that has actual deadlines attached to it.

If this ASAP has a time/money tradeoff, I need management backing to tell me which way to go. And what other work to let sluff in order to get the time needed.


In the end, there are only a few priority levels that people actually use.

  1. Realtime. I will stand here until I get what I need.
  2. ASAP.
  3. On this defined date or condition.
  4. Whenever you can get to it.

Realtime is a form of ASAP, but it's the kind of ASAP where the requester is highly invested in it and will keep statusing and may throw resources at it in order to get the thingy as soon as actually possible. Think major production outages.

ASAP is really 'as soon as you can get to it, unless I think that's not fast enough.' For sysadmin teams where the load-average is below the number of processors this can work pretty well. For loaded sysadmin teams, the results will not be to the liking of the open-ended deadline requestors.

On this defined date or condition is awesome, as it gives us expectations of delivery and allows us to do queue optimization.

Whenever you can get to it is like nicing a process. It'll be a while, but it'll be gotten to. Eventually.

"ASAP, but no later than [date]" is a much better way of putting it. It gives a hint to the queue optimizer as to where to slot the work amongst everything else.

Thank you.

Paternity leave and on-call

| No Comments

It all started with this tweet.

Which you need to read (Medium.com). Some pull-quotes of interest:

My manager probably didn't realize that "How was your vacation" was the worst thing to ask me after I came back from paternity leave.

Patriarchy would have us believe that parenting is primarily the concern of the mother. Therefore paternity leave is a few extra days off for dad to chillax with his family and help mom out.

Beyond a recovery time from pregnancy, much of parental leave is learning to be a parent and adjusting to your new family and bonding with the baby. I can and did bond with the baby, but not as much as my female coworkers bonded with their babies.

I should also state, that I don't just want equality, I want a long time to bond with my child. Three months or more sounds nice. Not only can I learn to soothe him when he's upset, put him to sleep without worrying about being paged, but I can be around when he does the amazing things babies do in their first year: learning to sit, crawl, eat, stand and even walk.

At my current employer, I was shocked to learn that new dads get two weeks off.

Two.

At my previous startup, paternal leave was under the jurisdiction of the 'unlimited vacation' policy. Well...

Vacations are important. My friends would joke that the one way to actually be able to take vacations was to keep having children. Here the conflation was in jest, and also a caricature of the reality of vacations at startups.

We had a bit of a baby-boom while I was there. Dads were glared at if they showed up less than two weeks in and told to go home. After that, most of them worked part-time for a few weeks and slowly worked up to full time.

This article caused me to tweet...

The idea here is that IT managers who work for a company like mine with a really small amount of parental leave do have a bit of power to give Dad more time with the new kid: take them off of the call rota for a while. A better corporate policy is ideal, but it's a kind of local fix that just might help. Dad doesn't have to live to the pager and new-kid.

Interesting idea, but not a great one.

Which is a critique of the disaster-resilience of 3-person teams. I was on one, and we had to coordinate Summer Vacation Season to ensure we had two-person coverage for most of it, and if 1-person was unavoidable, keep it to a couple days at best. None of us had kids while I was there (the other two had teenagers, and I wasn't about to start), so we didn't get to live through a paternity-leave sized hole in coverage.

Which is the kind of team I'm on right now, and why I thought of the idea. We have enough people that a person sized hole, even a Sr. Engineer sized hole, can be filled for several to many weeks in the rotation.

That's the ideal route though, and touches on a very human point: if you're in a company where you always check mail or can expect pages off-hours, it doesn't matter if you're not in the official call-rotation. That's a company culture problem independent of the on-call rotation.

My idea can work, but it takes the right culture to pull off. Extended leave would be much better, and is the kind of thing we should be advocating for.

You should still read the article.

The project is done, and you have a monitoring system you like!

How, how do you keep liking it?

Like all good things, it takes maintenance. There are a few processes you should have in place to provide the right feedback loops to keep liking your shiny new monitoring environment.

  • Questions about monitoring should be in your incident retrospective process.
  • A periodic review of active alarms to be sure you still really want them.

Implementing these will provide both upward pressure to expand it into areas it needs to go, and downward pressure to get rid of needless noise.

There is more to a monitoring system than alarms and reports. Behind all of those cries for action are a lot of data. A lot of data. So much data, that you face scaling problems when you grow because all of your systems generate so much monitoring data.

Monitor everything!
-- Boss

A great idea in principle, but falls apart in one key way...

"Define 'everything', please"

'Everything' means different things for different people. It just isn't feasable to track every monitorable on everything with one-second granularity. Just about everyone will want to back away from that level of monitor-all-the-things. But what is the right fit?

It depends on what you want to do with it. Data being tracked supports four broad categories of monitoring.

  1. Performance
  2. Operational
  3. Capacity
  4. SLA

Performance Monitoring

This kind of monitoring tends to have low intervals between polls. It could be five minutes, but may be as little as every second. That kind of monitoring will create a deluge of data, and may only be done when diagnosing exposed problems or doing in-depth research on the system. It's not run all the time, unless you really do care about per-second changes in state of something.

This kind of monitoring is defined by a few attributes:

  • High granularity. You poll a lot.
  • Low urgency. You're doing this because you're looking into something, not because it's down.
  • Occasional need. You don't run it all the time, and not on a schedule.

Everything: 1 second granularity for CPU, IOPS, and pagefaults for a given cluster.


Operational Monitoring

The kind we're all familiar with. This is the kind of monitoring that tends to emit alarms for on-call rotations.

  • Medium granularity. Every 5 minutes, that kind of thing.
  • High urgency. Fast responses are needed.
  • Constant need. You run it all the time.

Everything: Every disk-event, everywhere.


Capacity Monitoring

Some of the alarms you have defined already may be capacity alarms, but capacity monitoring just tracks how much you are using of what you have. Some of this stuff doesn't change very fast.

  • Low granulariy. It may only get checked once a day.
  • Low urgency. Responding in a couple of days may be fast enough. If not slower.
  • Periodic need. Reviewed once in a while, probably on a schedule.

Everything: Anything that has a "Max" size value greater than the "Current" value.


SLA Monitoring

I've already gone on at length about SLAs, but this is the monitoring that directly supports the SLA pass/fail metrics. I break it apart from the other types because of how it's accessed.

  • Low granularity. Some metrics may be medium, but in general SLA trackers are over-time style.
  • Medium urgency. If a failing grade is determined, response needs to happen. How fast, depends on what's not going to get met.
  • Continual and Periodic need. Some things will be monitored continually, others will only be checked on long schedules; possibly once a week, if not once a month.

Everything: Everything it takes to build those reports.


Be aware that 'everything' is context-sensitive when you're talking with people and don't freak out when a grand high executive says, "everything," at you. They're probably thinking about the SLA Monitoring version of everything, which is entirely manageable.

Don't panic, and keep improving your monitoring system.

In the last article we created a list of monitorables and things that look like the kind of alarms we want to see.

Now what?

First off, go back to the list of alarms you already have. Go through those and see which of those existing alarms directly support the list you just created. It may be depressing how few of them do, but rejoice! Fewer alarms mean fewer emails!

What does 'directly support' mean?

Lets look at one monitorable and see what kind of alarms might directly or indirectly support it.

Main-page cluster status.

There are a number of alarms that could already be defined for this one.

  • Main-page availability as polled directly on the load-balancer.
  • Pingability of each cluster member.
  • Main-page reachability on each cluster member.
  • CPU/Disk/Ram/Swap on each cluster member.
  • Switch-port status for the load-balancer and each cluster-member.
  • Webserver process existence on each cluster member.
  • Webserver process CPU/RAM usage on each cluster member.

And more, I'm sure. That's a lot of data, and we don't need to define alarms for all of it. The question to ask is, "How do I determine the status of the cluster?"

The answer could be, "All healthy nodes behind the load-balancer return the main-page, with at least three nodes behind the load-balancer for fault tolerance." This leads to a few alarms we'd like to see:

  • Cluster has dropped below minimum quorum.
  • Node ${X} is behind the load-balancer but serving up errors.
  • The load-balancer is not serving any pages.

We can certainly track all of those other things, but we don't need alarms on them. Those will come in handy when the below-quorum alarm is responded to. This list is what I'd call directly supporting. The rest are indirect indicators and we don't need PagerDuty to tell us about them, we'll find it ourselves once we start troubleshooting the actual problem.


Now that we have a list of existing alarms we want to keep and a list of alarms we'd like to have, the next step is determining when we want to be alarmed.

You are in the weeds.

You're getting a thousand or more alarms over the course of a weekend. You can't look at them all, that would require not sleeping. And you actually do sleep. How do you cope with it?

Lots of email rules, probably. Send the select few you actually care about to the email-to-sms gateway for your phone. File others into the special folders that make your phone go bingle. And mark-all-read the folder with 1821 messages in it when you get to the office on Monday (2692 on Tuesday after a holiday Monday).

Your monitoring system is OK. It does catch stuff, but the trick is noticing the alarm in all the noise.

You want to get to the point where alarms come rarely, and get acted upon when they show up. 1821 messages a weekend is not that system. Over a 60 hour weekend, 1821 messages is one message every two minutes. Or if it's like most monitoring system, it's a few messages an hour with a couple of bursts of hundreds over the course of a few polling-cycles as something big flaps and everything behind it goes 'down'. That alarming load is only sustainable with a fully staffed round-the-clock NOC.

Very few of us have those.

Paring down the load requires asking a few questions:

LISA 2013 was very good to me. I saw a lot of sessions about monitoring, theories of, and I've spent most of 2014 trying to revise the monitoring system at work to be less sucky and more awesome. It's mostly worked, and is an awesome-thing that's definitely going on my resume.

Credit goes primarily to two sources:

  • A Working Theroy of Monitoring, by Caskey L Dickson of Google, from LISA 2013.
  • SRE University: Non-abstract large systems design for sysadmins, by John Looney and company, of Google, also at LISA 2013.

It was these sessions that inspired me to refine a slide in A Working Theory of Monitoring and put my own spin on it:

I'll be explaining this further, but this is what the components of a monitoring system look like. Well, monitoring ecosystem since there is rarely a single system that does everything. There are plenty of companies that will sell you a product that does everything, but even they get supplimented by home-brew reporting engines and custom-built automation scripts.

Other Blogs

My Other Stuff

Monthly Archives