Recently in sysadmin Category

The origins of on-call work

| No Comments

On September 6th, Susan Fowler posted an article titled, "Who's on-call?", talking about evolving on-call duties between development teams and SRE teams. She has this quote at the top:

I'm not sure when in the history of software engineering separate operations organizations were built and run to take on the so-called "operational" duties associated with running software applications and systems, but they've been around for quite some time now (by my research, at least the past twenty years - and that's a long time in the software world).

My first job was with a city government, and many of the people I was working with started at that city when they decided to computerize in 1978. Most of them have retired or died off by now. In 1996, when I started there, the original dot-com boom was very much on the upswing, and that city was still doing things the way they'd been done for years.

I got into the market in time to see the tail end of that era. One of the things I learned there was the origins of many of the patterns we see today. To understand the origins of on-call in IT systems, you have to go back to the era of serial networking, when the term 'minicomputer' was distinct from 'microcomputer', which were marketing terms to differentiate from 'mainframe'.

IT systems of the era employed people to do things we wouldn't even consider today, or would work our damnedest to automate out of existence. There were people who had, as their main job, duties such as:

  • Entering data into the computer from paper forms.
    • Really. All you did all day was punch in codes. Computer terminals were not on every desk, so specialists were hired to do it.
    • The worst part is: there are people still doing this today.
  • Kick off backups.
  • Change backup tapes when the computer told them to.
  • Load data-tapes when the computer told them to.
    • Tape stored more than spinning rust, so it was used as a primary storage medium. Disk was for temp-space.
    • I spent a summer being a Tape Librarian. My job was roboticized away.
  • Kick off the overnight print-runs.
  • Colate printer output into reports, for delivery to the mailroom.
  • Execute the overnight batch processes.
    • Your crontab was named 'Stephen,' and you saw him once a quarter at the office parties. Usually very tired-looking.
  • Monitor by hand system usage indicators, and log them in a paper logbook.
  • Keep an Operations Log of events that happened overnight, for review by the Systems Programmers in the morning.
  • Follow runbooks given to them by Systems Programming for performing updates overnight.
  • Be familiar with emergency procedures, and follow them when required.

Many of these things were only done by people working third shift. Which meant computer-rooms had a human on-staff 24/7. Sometimes many of them.

There was a side-effect to all of this, though. What if the overnight Operator had an emergency they couldn't handle? They had to call a Systems Programmer to advise a fix, or come in to fix it. In the 80's, when telephone modem came into their own, they may even be able to dial in and fix it from home.

On-Call was born.

There was another side-effect to all of this: it happened before the great CompSci shift in the colleges, so most Operators were women. And many Systems Programmers were too. This was why my first job was mostly women in IT management and senior technical roles. This was awesome.

A Systems Programmer, as they were called at the time, is less of a Software Engineering role as we would define it today. They were more DevOps, if not outright SysAdmin. They had coding chops, because much of systems management at the time required that. Their goal was more wiring together purchased software packages to work coherently, or modifying purchased software to work appropriately.


Time passed, and more and more of the overnight Operator's job was automated away. Eventually, the need for an overnight Operator exceeded requirements. Or you simply couldn't hire one to replace the Operator that just quit. However, the systems were still running 24/7, and you needed someone ready to respond to disasters. On-call got more intense, since you no longer had an experienced hand in the room at all times.

The Systems Programmers earned new job-titles. Software Engineering started to be a distinct skill-path and career, so was firewalled off in a department called Development. In those days, Development and Systems people spoke often; something you'll hear old hands grumble about with DevOps not being anything actually new. Systems was on-call, and sometimes Development was if there was a big thing rolling out.

Time passed again. Management culture changed, realizing that development people needed to be treated and managed differently than systems people. Software Engineering became known as Software Engineering, and became its own career-track. The new kids getting into the game never knew the close coordination with Systems that the old hands had, and assumed this separation was the way it's always been. Systems became known as Operations; to some chagrin of the old Systems hands who resented being called an 'Operator', which was typically very junior. Operations remained on-call, and kept informal lists of developers who could be relied on to answer the phone at o-dark-thirty in case things went deeply wrong.

More time, and the separation between Operations and Software Engineering became deeply entrenched. Some bright sparks realized that there were an awful lot of synergies to be had with close coordination between Ops and SE. And thus, DevOps was (re)born in the modern context.

Operations was still on-call, but now it was open for debate about how much of Software Engineering needed to be put on the Wake At 3AM In Case Of Emergency list.

And that is how on-call evolved from the minicomputer era, to the modern era of cloud computing.

You're welcome.

InfluxDB queries, a guide

| No Comments

I've been playing with InfluxDB lately. One of the problems I'm facing is getting what I need out of it. Which means exploring the query language. The documentation needs some polishing in spots, so I may submit a PR to it once I get something worked up. But until then, enjoy some googlebait about how the SELECT syntax works, and what you can do with it.

Rule 1: Never, ever put a WHERE condition that involves 'value'. Value is not indexed. Doing so will cause table-scans, and for a database that can legitimately contain over a billion rows, that's bad. Don't do it.
Rule 2: No joins.

With that out of the way, have some progressively more complex queries to explain how the heck this all works!


Return a list of values.

Dump everything in a measurement, going back as far as you have data. You almost never want to do this

SELECT value FROM site_hits

The one exception to this rule, is if you're pulling out something like an event stream, where events are encoded as tags-values.

SELECT event_text, value FROM eventstream

Return a list of values from a measurement, with given tags.

One of the features of InfluxDB, is that you can tag values in a measurement. These function like extra fields in a database row, but you still can't join on them. The syntax for this should not be surprising.

SELECT value FROM site_hits WHERE webapp = 'api' AND environment = 'prod'

Return a list of values from a measurement, with given tags that match a regex.

Yes, you can use regexes in your WHERE clauses.

SELECT value FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod'


That's cool and all, but the real power of InfluxDB comes with the aggregation functions and grouping. This is what allows you to learn what the max value was for a given measurement over the past 30 minutes, and other useful things. These yield time-series that can be turned into nice charts.

Return a list of values, grouped by application

This is the first example of GROUP BY, and isn't one you'll probably ever need to use. This will emit multiple time-series.

SELECT value FROM site_hits where webapp =~ /^api_[a-z]*/ AND environment = 'prod' GROUP BY webapp

Return a list of values, grouped by time into 10 minute buckets

When using time for a GROUP BY value, you must provide an aggregation function! This will add together all of the hits in the 10 minute bucket into a single value, returning a time-stream of 10 minute buckets of hits.

SELECT sum(value) FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod' GROUP BY time(10m)

Return a list of values, grouped by both web-server and time into 10 minute buckets

This does the same thing as the previous, but will yield multiple time-series. Some graphing packages will helpfully chart multiple lines based on this single query. Handy, especially if servername changes on a daily basis as new nodes are added and removed.

SELECT sum(value) FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod' GROUP BY time(10m), servername

Return a list of values, grouped by time into 10 minute buckets, for data receive in the last 24 hours.

This adds a time-based condition to the WHERE clause. To keep the line shorter, we're not going to group on servername.

SELECT sum(value) FROM site_hits WHERE webapp =~ /^api_[a-z]*/ AND environment = 'prod' AND time > now() - 24h GROUP BY time(10m)


There is one more trick InfluxDB can do, and this isn't documented very well. InfluxDB can partition data in a database into retention policies. There is a default retention policy on each database, and if you don't specify a retention-policy to query from, you are querying the default. All of the above examples are querying the default retention-policy.

By using continuous queries you can populate other retention policies with data from the default policy. Perhaps your default policy keeps data for 6 weeks at 10 second granularity, but you want to keep another policy for 1 minute granularity for six months, and another policy for 10 minute granularity for two years. These queries allow you to do that.

Querying data from a non-default retention policy is done like this:

Return 14 weeks of hits to API-type webapps, in 1 hour buckets

SELECT sum(value) FROM "6month".site_hits WHERE webapp =~ /api_[a-z]*/ AND environment = 'prod' AND time > now() - 14w GROUP BY time(1h)

The same could be done for "18month", if that policy was on the server.

Groking audit

| No Comments

I've been working with Logstash lately, and one of the tasks I was given was attempting to improve parsing of audit.log entries. Turning things like this:

type=SYSCALL msg=audit(1445878971.457:6169): arch=c000003e syscall=59 success=yes exit=0 a0=c2c3a8 a1=c64bc8 a2=c34408 a3=7fff44e370f0 items=2 ppid=16974 pid=18771 auid=1004 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=5 comm="compiled_evil" exe="/home/justsomeuser/bin/compiled_evil" key="hinkystuff"

Into nice and indexed entries where we can make Kibana graphs of all commands caught with the hinkystuff audit ruleset.

The problem with audit.log entries is that they're not very regexible. Oh, they can be. But optional sometimes-there-sometimes-not fields suck a lot. Take for example, the SYSCALL above. Items a0 through a3 are arguments 1-3 of the syscall, and there may be 1 to 3 of them. Expressing that in regex/grok is trying.

So I made a thing:

Logstash-auditlog: Grok patterns and examples for parsing Audit settings with Logstash.

May it be useful.

Asynchronus brick walls

| No Comments

I had the dubious honor of running into a brick wall that has so far resisted my skill in both search-engines and asking for help. It went so far that I asked on StackOverflow and got nothing (12 views as of this writing).

http://stackoverflow.com/questions/33088971/dealing-with-recursion-sync-loops-in-coffeescript

So far, trying to do this thing in coffee has taken two days. This is not a language set up to deal with "do this procedure one or more times until you don't get a thing, and then continue with execution," It can be done, I have no doubt about that. I have simply failed to figure out how.

Yesterday I gave up and wrote a ruby script that did what I needed. It took 6 hours to get the features I needed, including various checks to make sure I'm not breaking things by doing it. It worked the way I needed it to, and it was awesome.

And unfortunately proved that I definitely need to figure out the 'one or more' thing in coffee, going forward. My solution to that may very well be 'shell out to an external script that can do that kind of thing and accept the results.'

I think I now know why the Amazon SDK for Javascript doesn't include paging support like the CLI tools written in python do.

Or, a post I'd never thought I'd make seeing as I'm a sysadmin.

But it seems I'm the senior git expert in my team, so I'm making it. So odd.


There are a series of questions you should ask among your team before moving a repo over to git. Git is a hell of a toolbox, and like all toolboxes there are nearly infinite ways of using it. There is no one true way, only ways that are better for you than others. These are a series of questions to help you figure out how you want to use it, so you can be happier down the road.

Q: How do you use the commit-log?

History is awesome. Looking back five years in the code repository to figure out WTF a past developer was thinking about writing that bit of spaghetti code is quite useful if that commit includes something like, "found weird-ass edge case in glib, this is the workaround until they get a fix." That's actionable. Maybe it's even tied to a bug number in the bug tracking system, or a support ticket.

Do you ever look through the history? What are you looking for? Knowing this allows you to learn what you want out of your source-control.

Q: What is the worth of a commit?

A commit in Git is not the same thing as in SVN, Fog, or ClearCase. In some, a commit, or checkin, is a pretty big thing. It takes reviews, or approvals before it can be made.

This question is there to get you thinking about what a commit is. Commits in git are cheap, that changes things. Knowing that you will be facing more of then than you had in the past will help guide you in the later questions.

Q: Is every commit sacred, or you do you value larger, well documented commits more?

Practically everyone I know has made a commit with the message of 'asdf'. If you're grinding on a stupid thing, it may take you 19 commits to come up with the two lines of code that actually work. In five years, when you come back to look at that line of code, the final commit-message on those lines might be '

a1bd0809 maybe this will work

Not exactly informative.

bdc8671a Reformat method calls to handle new version of nokogiri

That is informative.

Most projects value more informative commits over lots of little, iterative ones. But your team may be different. And may change its mind after experience has been had.

Q: Should new features be all in one commit, or in a few modular commits?

Some features are quite large. So large, that rebasing them into a single commit leads to a diff of hundreds of lines. Such a large feature means that the history on those files will be slathered with the same initial-feature-commit with no context for why it is that way.

Is that good enough? Mabe it is, maybe you're more interested in the hotfix commits that are fixing bugs and explain non-intuitive behavior and workaround. Maybe it isn't, and you need each sub-feature in its own. Or maybe you want every non-fixup commit.

This is where your approach to the history really informs your decision. If you know how you deal with the past, you will be better able to put process in place to be happier with your past self.


Once you've thought about these questions and your answers to them, you'll be better able to consider the deeper problem of branching strategy. Git is notoriously lacking in undo features, at least in shared repos, so getting this out of the way early is good.

I've seen this dyamic happen a couple of times now. It goes kind of like this.

October: We're going all in on AWS! It's the future. Embrace it.
November: IT is working very hard on moving us there, thank you for your patience.
December: We're in! Enjoy the future.
January: This AWS bill is intolerable. Turn off everything we don't need.
February: Stop migrating things to AWS, we'll keep these specific systems on-prem for now.
March: Move these systems out of AWS.
April: Nothing gets moved to AWS unless it produces more revenue than it costs to run.

What's prompting this is a shock that is entirely predictable, but manages to penetrate the reality distortion field of upper management because the shock is to the pocketbook. They notice that kind of thing. To illustrate what I'm talking about, here is a made-up graph showing technology spend over a course of several years.

BudgetType-AWS.pngThe AWS line actually results in more money over time, as AWS does a good job of capturing costs that the traditional method generally ignores or assumes is lost in general overhead. But the screaming doesn't happen at the end of four years when they run the numbers, it happens in month four when the ongoing operational spend after build-out is done is w-a-y over what it used to be.

The spikes for traditional on-prem work are for forklifts of machinery showing up. Money is spent, new things show up, and they impact the monthly spend only infrequently. In this case, the base-charge increased only twice over the time-span. Some of those spikes are for things like maintenance-contract renewals, which don't impact base-spend one whit.

The AWS line is much less spikey, as new capabilities are assumed into the base-budget in an ongoing basis. You're no longer dropping $125K in a single go, you're dribbling it out over the course of a year or more. AWS price-drops mean that monthly spend actually goes down a few times.

Pay only for what you use!

Amazon is great at pointing that out, and hilighting the convenience of it. But what they don't mention is that by doing so, you will learn the hard way about what it is you really use. The AWS Calculator is an awesome tool, but if you don't know how your current environment works, it's like throwing darts at a wall for accurately predicting what you'll end up spending. You end up obsessing over small line-item charges you've never had to worry about before (how many IOPs do we do? Crap! I don't know! How many thousands will that cost us?), and missing the big items that nail you (Whoa! They meter bandwidth between AZs? Maybe we shouldn't be running our Hadoop cluster in multi-AZ mode).

There is a reason that third party AWS integrators are a thriving market.

Also, this 'what you use' is not subject to Oops exceptions without a lot of wrangling with Account Management. Have something that downloaded the entire EPEL repo twice a day for a month, and only learned about it when your bandwidth charge was 9x what it should be? Too bad, pay up or we'll turn the account off.

Unlike the forklift model, you pay for it every month without fail. If you have a bad quarter, you can't just not pay the bill for a few months and tru-up later. You're spending it, or they're turning your account off. This takes away some of the cost-shifting flexibility the old style had.

Unlike the forklift model, AWS prices its stuff assuming a three year turnover rate. Many companies have a 5 to 7 years lifetime for IT assets. Three to four years in production, with an afterlife of two to five years in various pre-prod, mirror, staging, and development roles.The cost of those assets therefore amortizes over 5-9 years, not 3.

Predictable spending, at last.

Hah.

Yes, it is predictable over time given accurate understanding of what is under management. But when your initial predictions end up being wildly off, it seems like it isn't predictable. It seems like you're being held over the coals.

And when you get a new system into AWS and the cost forecast is wildly off, it doesn't seem predictable.

And when your system gets the rocket-launch you've been craving and you're scaling like mad; but the scale-costs don't match your cost forecast, it doesn't seem predictable.

It's only predictable if you fully understand the cost items and how your systems interact with it.

Reserved instances will save you money

Yes! They will! Quite a lot of it, in fact. They let a company go back to the forklift-method of cost-accounting, at least for part of it. I need 100 m3.large instances, on a three year up-front model. OK! Monthly charges drop drastically, and the monthly spend chart begins to look like the old model again.

Except.

Reserved instances cost a lot of money up front. That's the point, that's the trade-off for getting a cheaper annual spend. But many companies get into AWS because they see it as cheaper than on-prem. Which means they're sensitive to one-month cost-spikes, which in turn means buying reserved instances doesn't happen and they stay on the high cost on-demand model.

AWS is Elastic!

Elastic in that you can scale up and down at will, without week/month long billing, delivery and integration cycles.

Elastic in that you have choice in your cost accounting methods. On-demand, and various kinds of reserved instances.

It is not elastic in when the bill is due.

It is not elastic with individual asset pricing, no matter how special you are as a company.


All of these things trip up upper, non-technical management. I've seen it happen three times now, and I'm sure I'll see it again at some point.

Maybe this will help you in illuminating the issues with your own management.

I'm not a developer but...

| No Comments

...I'm sure spending a lot of time in code lately.

Really. Over the last five months, I'd say that 80% of my normal working hours are spent grinding on puppet code. Or training others in getting them to maybe do some puppet stuff. I've even got some continuous integration work in, building a trio of sanity-tests for our puppet infrastructure:

  • 'puppet parser validate' returns OK for all .pp files.
    • Still on the 'current' parser, we haven't gotten as far as future/puppet4 yet.
  • puppet-lint returns no errors for the modules we've cleared.
    • This required extensive style-fixes before I put it in.
  • Catalogs compile for a certain set of machines we have.
    • I'm most proud of this, as this check actually finds dependency problems unlike puppet-parser.

Completely unsurprising, the CI stuff has actually caught bugs before it got pushed to production! Whoa! One of these days I'll be able to grab some of the others and demo this stuff, but we're off-boarding a senior admin right now and the brain-dumping is not being done by me for a few weeks.

We're inching closer to getting things rigged that a passing-build in the 'master' branch triggers an automatic deployment. That will take a bit of thought about, as some deploys (such as class-name changes) require coordinated modifications in other systems.

Because I get to define what's 'possible', and anything is possible given enough time, management backing, and an unlimited budget.

If I don't have management backing, I will decide on my own how to fit this new ASAP in amongst my other ASAP work and the work that has actual deadlines attached to it.

If this ASAP has a time/money tradeoff, I need management backing to tell me which way to go. And what other work to let sluff in order to get the time needed.


In the end, there are only a few priority levels that people actually use.

  1. Realtime. I will stand here until I get what I need.
  2. ASAP.
  3. On this defined date or condition.
  4. Whenever you can get to it.

Realtime is a form of ASAP, but it's the kind of ASAP where the requester is highly invested in it and will keep statusing and may throw resources at it in order to get the thingy as soon as actually possible. Think major production outages.

ASAP is really 'as soon as you can get to it, unless I think that's not fast enough.' For sysadmin teams where the load-average is below the number of processors this can work pretty well. For loaded sysadmin teams, the results will not be to the liking of the open-ended deadline requestors.

On this defined date or condition is awesome, as it gives us expectations of delivery and allows us to do queue optimization.

Whenever you can get to it is like nicing a process. It'll be a while, but it'll be gotten to. Eventually.

"ASAP, but no later than [date]" is a much better way of putting it. It gives a hint to the queue optimizer as to where to slot the work amongst everything else.

Thank you.

Paternity leave and on-call

| No Comments

It all started with this tweet.

Which you need to read (Medium.com). Some pull-quotes of interest:

My manager probably didn't realize that "How was your vacation" was the worst thing to ask me after I came back from paternity leave.

Patriarchy would have us believe that parenting is primarily the concern of the mother. Therefore paternity leave is a few extra days off for dad to chillax with his family and help mom out.

Beyond a recovery time from pregnancy, much of parental leave is learning to be a parent and adjusting to your new family and bonding with the baby. I can and did bond with the baby, but not as much as my female coworkers bonded with their babies.

I should also state, that I don't just want equality, I want a long time to bond with my child. Three months or more sounds nice. Not only can I learn to soothe him when he's upset, put him to sleep without worrying about being paged, but I can be around when he does the amazing things babies do in their first year: learning to sit, crawl, eat, stand and even walk.

At my current employer, I was shocked to learn that new dads get two weeks off.

Two.

At my previous startup, paternal leave was under the jurisdiction of the 'unlimited vacation' policy. Well...

Vacations are important. My friends would joke that the one way to actually be able to take vacations was to keep having children. Here the conflation was in jest, and also a caricature of the reality of vacations at startups.

We had a bit of a baby-boom while I was there. Dads were glared at if they showed up less than two weeks in and told to go home. After that, most of them worked part-time for a few weeks and slowly worked up to full time.

This article caused me to tweet...

The idea here is that IT managers who work for a company like mine with a really small amount of parental leave do have a bit of power to give Dad more time with the new kid: take them off of the call rota for a while. A better corporate policy is ideal, but it's a kind of local fix that just might help. Dad doesn't have to live to the pager and new-kid.

Interesting idea, but not a great one.

Which is a critique of the disaster-resilience of 3-person teams. I was on one, and we had to coordinate Summer Vacation Season to ensure we had two-person coverage for most of it, and if 1-person was unavoidable, keep it to a couple days at best. None of us had kids while I was there (the other two had teenagers, and I wasn't about to start), so we didn't get to live through a paternity-leave sized hole in coverage.

Which is the kind of team I'm on right now, and why I thought of the idea. We have enough people that a person sized hole, even a Sr. Engineer sized hole, can be filled for several to many weeks in the rotation.

That's the ideal route though, and touches on a very human point: if you're in a company where you always check mail or can expect pages off-hours, it doesn't matter if you're not in the official call-rotation. That's a company culture problem independent of the on-call rotation.

My idea can work, but it takes the right culture to pull off. Extended leave would be much better, and is the kind of thing we should be advocating for.

You should still read the article.

The project is done, and you have a monitoring system you like!

How, how do you keep liking it?

Like all good things, it takes maintenance. There are a few processes you should have in place to provide the right feedback loops to keep liking your shiny new monitoring environment.

  • Questions about monitoring should be in your incident retrospective process.
  • A periodic review of active alarms to be sure you still really want them.

Implementing these will provide both upward pressure to expand it into areas it needs to go, and downward pressure to get rid of needless noise.

Other Blogs

My Other Stuff

Monthly Archives