May 2014 Archives

Orders of complexity

When automating a business process, be it figuring out when user meta-data needs to be eliminated or how to set up a certain type of server, there are certain orders of complexity you face:

  1. Do THAT.
  2. If THIS then DO THAT.
  3. If THIS then DO THAT, except WHEN.
  4. If THIS then DO THAT, except WHEN, but do it anyway IF.
  5. If THIS then DO THAT, except WHEN, but do it anyway IF so long as THAT isn't true.
  6. If THIS then DO THAT, except WHEN, but do it anyway IF so long as THAT isn't true or THIS is true.

An example for case 6:

IF a user is terminated, disable all of their accounts immediately and archive their data within 14 days; except if a manager puts a hold on it, but delete it anyway after 31 days, so long as the manager isn't a C-level or doing so would cause political problems at which point just don't bother deleting.

Try automating that.

This is the big problem with automating business processes: us humans glorify in our exception creation and handling processes, and we're damned bad at being consistent about it. The same is true about managing a fleet of servers; if it's fully manual the rules about what goes where are similarly complex. That 87 point checklist is a good sign.

When attempting to push out an Identity Management System project or a Configuration Management project a lot of the hard work goes into simplifying the complexity in the existing rules. This is not a strictly technical problem, it's a people problem.

You have six groups of developers deploying Tomcat apps. Yay standards! But each group has their own requirements for how they want Tomcat and the supporting JVM massaged.

The people problem here is figuring out how many of those differences are just mythology ("this blog on the internet said doing it this way would be bad, and had charts. We don't like bad.") and how many have technical reasons behind them ("we run out of private-bytes if we do it that way"). That's a lot of negotiation, gentle easings into new processes to smooth workflow, and a lot of technical handholding to reassure everyone that this really is a better way. Even after all that work you still may end up with Puppet classes like this:

  • tomcat6::tc_repos
  • tomcat6::helpdeskportal
  • tomcat6::SNARC
  • tomcat7::bbc_crawler
  • tomcat7::HR_APPS
  • tomcat7::buildSystem

Fighting entropy is hard, hard work. Technically hard, and socially hard. In this new devops era of programmable-everything, that entropy has to be encoded, cross-checked, regression tested, and maintained somehow. Entropy will win in the end, but you can at least kick the can down the road enough that it'll have a harder time stealing your weekend.

If you're lucky enough to face a greenfield environment deployment of some kind, maybe you're figuring out how applications and access will happen in a public cloud of some kind, you can at least put in rules and procedures at the start to help constrain the organic growth of exceptions. We... aren't always that lucky, sometimes we have to cram the entropy demon back into jail the hard way.

But if you do have the entropy demon in jail? Yay! But it's going to be a constant fight to keep the business rules encoded in the automation simple. Keep up the fight. You don't want the Eschaton bootstrapping on your watch.

A taxonomy of IT users

Over the years I've seen a small collection of fake-names crop up in the sysadmin space. Here is a list:

An oldie, but a Sysadmin who has gone over to the dark-side.

Originally coined by Laura Chappell, Fred is the User From Hell. Or, The Power User who Isn't. Fred knows everything, or rather, thinks they do. They're wrong, but don't know it, and it makes your life all too interesting. Fred may be a manager, a peer, or a frequent-flier in the ticket queue.

Originally from a famous Warcraft video, this is the peer who just deploys stuff because it's cool. They... haven't learned (the hard way) how this can go wrong, so aren't naturally suspicious. This could be the rose-colored glasses of youth and exuberance, or it could be a trusting nature. They'll learn.

Coined by The Phoenix Project, Brent is the person that ends up with their hands in everything one way or the other. They may be a single-point-of-knowledge, the only person who knows anything about topic X, or just the person that gets handed the weird stuff because, well, "Brent probably knows". A lot of us are a Brent, and it sure as heck makes getting long vacations approved difficult. There may be more than one of them, depending on topics.

I used to be a Leeroy, then I learned better.

I've been a Brent (oddball-stuff troubleshooter variety) at my current and last three jobs.

Right now people have figured out that I know how to use Wireshark to discover oddball problems, so I'm having to do a lot of packet analysis lately to rule out oddball problems. This isn't something I can cross-train on very well, but I'm going to have to find a way; people's eyes tend to glaze over when you get into TCP RFCs and it's easier to make me do it and not have to learn for themselves.

I did this at a previous job, so here are a few tips for what will make it easier on everyone.

Plan at least 4, preferably 6, weeks out.

This gives your watch-standers the chance to arrange their lives around the schedule. At 6 weeks out, they'll be rearranging their lives around the schedule, rather than rearranging the watch-schedule around their lives. As the one managing the schedule, this means you'll be doing fewer weekend-swaps and people are more likely to just know who is on call.

Send out calendar events for the rotations.

This is more of a weekend-watch thing, but putting calender events in their calendar will further cement that they're obligated for that period. Also, it's a nice reminder in email (or whatever) that their shift has been scheduled.

Have the call list posted somewhere mobile-friendly.

Many times, the watch-stander is merely the first responder; it's their job to figure out what domain the problem sits in (app/database/storage/hypervisor/facilities/etc) and call the person who can actually fix the thingy. Having the call-list easily accessible from mobile is a really nice thing to have. This can be a Google Doc, or an app like PagerDuty. An Excel spreadsheet on Sharepoint is not so much.

Have the duties of the watch-stander clearly defined.

This seems obvious, but... it isn't. There are some questions you need answers to, otherwise you're going to experience sadness:

  • How fast must they respond to automated alerts?
  • Do they need to always answer the phone, or is voice-mail acceptable so long as the response is within a window?
  • How fast must they respond to emails?

The answers to these questions tell the watch-stander how much of a life they can fit in around the schedule. A movie is probably Right Out, but nipping out to the grocery store for a few things... maybe. Do they need to turn on bluetooth while driving, or can they wait until they stop (or just not drive at all)? How much 'response' can happen on a phone will greatly affect the quality-of-life questions.

What kind of sadness can you expect?

Missed alerts mostly. Without clearly defined response guidelines your watch-standers are going to sleep through their phones, miss emails, and otherwise fail to meet performance expectations. If you write those expectations down, they're far more likely to stick to them!

If you're doing automatic alert assignment, have an escalation policy.

You need a backstop in case the watch-stander sleeps through something. The backstop tier should never get called, but when they do it's an Event. An event people try to avoid, because something failed. Knowing that someone will notice if an automated alert gets ignored makes people more likely to respond in time.

If you're doing a 7-day watch, swap shifts on something other than Monday or Friday.

Depends on locality, but for US locations, Monday Holiday Law means that Mondays are occasionally vacation days and you don't want to do a watch-swap on a non-business day. In the same vein, many organizations have a rule in place stating that if the observed holiday in the case of an exempt holiday (New Years for example) lands on a Saturday is to have the day off be Friday.

At the same time, there is a US holiday that camps on top of Thursday (see next item). Tuesday or Wednesday are good choices.

If you're doing a weekend watch, have a policy in place for handling long weekends.

The 4 day Thanksgiving Holiday in the US is a great example, as that duty schedule is double what a normal one would be. Decide if you're going to create two shifts for it or allow one person to cover the whole thing, and decide well in advance. For some organizations the Friday after Thanksgiving is a major production day so this may be moot ;).