December 2011 Archives

The need for 'standby time'

This morning's SANS blog-entry rang true with me.

They're coming from an InfoSec point of view, but this fits into the overall IT framework rather well. I even remembered it in the chart I posted back in January:
A discrete Security department is one of the last things to break out of the pack, and that's something the SANS diary entry addresses. Between one-gun IT and a discrete security department (a "police department" in their terms) you get the volunteer fire-department. There may be one person who is the CSO, but there are no full-timers who do nothing but InfoSec stuff. When stuff happens, people from everywhere get drawn in. Sometimes it's the same people every time, and when a formal Security department is formed it'll probably be those people who are in it.

But, stand-by time:

Although it may sound like it means "stand around doing nothing," standby time is more like on-call or ready-to-serve time. Some organizations implement on-call time as that week or two that you're stuck with the pager so if anything happens after-hours you're the one that gets called. Otherwise known as the "sorry family, I can't do anything with you this week" time. As the organization grows, that will become less onerous as they move to a fully-staffed 24/7 structure with experienced people. That's not really what I mean by standby time.

Standby time is time that is set aside in the daily schedule that is devoted to incident-response. Most of the time it should focus on the first stage of incident-response, or Preparation. It's time spent keeping up to date on security news and events, updating documentation, and building tools and response processes. It's an interruptable time should an incident arise, but it's not interruptable for other meetings or projects.

This is time I've been calling "fire-watch" all along. Time when I have to be there in case something goes wrong, but I don't have anything else really going on. I spent a lot of 2010 in "fire-watch" thanks to the ongoing budget crisis at WWU and the impact it had on our project pace.

Kevin Liston is advocating putting actual time on the schedule when you are doing the crisis watch as your primary duty. During this time anything else you do is the kind of thing that is immediately interruptable, such as studying or doing the otherwise low-priority grunt work of InfoSec. Or, you know, blogging.

Does this apply to the Systems Administration space?

I believe it does, and it follows a similar progression to the InfoSec field.

When a crisis emerges in a one-gun shop, well, you already know who is going to handle it.

In the mid-size shops like we had at WWU the person who handles the crisis is generally the one who discovers it first, a decidedly ad-hoc process.

In the largest of shops where they have a formal Network Operations Center they may have sysadmin staff on permanent standby just in case something goes wrong. Once Upon a Time, they were called 'Operators', I don't know what they're called these days. They're there for first line triage of anything that can go wrong, and they know who to call when something fails outside their knowledge-base.

Standby Time is useful, since it gives you a whack of time you'd otherwise be bored in, during which you can do such useful things as:

  • Updating documentation
  • Reviewing patches and approving their application
  • Spreadsheeting for budget planning
  • Reviewing monitoring system logs for anomalies any automated systems may have missed
  • Reviewing the monitoring and alerting framework to make sure it makes sense

Or, those 'low priority' tasks we rarely seem to get to until it's too painful not to get around to it.

Not all sysadminly types need to be on the fire watch, but some do. In DevOps environments where the systems folk are neck deep in development, some of them may only be on it for a little while. In others where certain specialties that are rarely involved in incident-response are in evidence, such as Storage Administrators, they may never get on the front-line fire watch but may carry the 2nd/3rd tier pager.

Note, this is in addition to any helpdesk that may be in evidence. The person on fire watch will be the first responder in case something like a load-spike triggers a cascading failure among the front-end load-balanced web-servers. A fire-watch is more important for entities that have little application diversity and few internal users; things like esty, ebay, and amazon. It's less important for entities that have a lot of internal users and a huge diversity in internal systems, places like WWU. In these cases you can have a lot of things that can go wrong in little ways and who knows how to fix 'em is hard to track.

If nothing else, you can put "fire watch" on your calendar as an excuse to do the low-level tasks that need to get done while at the same time fending off meeting invites.

Alone on the shelf


Standing there, alone, wedged between a bunch of Linux books, is the lone PowerShell book.

Laptop demographics at LISA11

Looking around at what we're hauling, this is what I've noticed:

  • About 40% are toting Apple laptops of various types. I haven't seen any white macbooks, so they're all Airs or Pros.
  • Of the 60% of PCs, less than half are running Windows.
  • Of the Windows installs, nearly all are running Win7. I've seen maybe three people running XP.

Clearly USENIX needs to hit up Apple for a Sponsorship next year. We're clearly a fan of their products.

I've seen a few tablets around. IPads dominate the small sample size.

Lisa 2011: The Limoncelli Test

| 1 Comment
Also known as M7.

From the book:

Tom's books total over 2,100 pages of advice. In this class he'll narrow all that down to 32 essential practices. Tom will blast though all the 32 practices, explaining what brought him to include each one on the list, plus tips for incorporating the practice, policy, or technology into your organization. You'll find some great ideas for providing better service with less effort.

Take back to work: How to identify and fix your biggest problems, cross-train your team, strengthen your systems--and more!

Topics include:

  • Improving sysadmin-user interaction
  • Best practices for working together as a team
  • Best practices for service operations
  • Engineering for reliability
  • Sustainable Enterprise fleet (desktop/laptop) management
  • How to figure out what your team does right, and where it needs to improve
This was a very good session. It covers the Limoncelli Test, unsurprisingly. This is one of many attempts to come up with a Sysadmin version of the Joel Test (ServerFault tried). But this one seems to be going the distance. Do click on the link, as it leads right to the test. Tom has even written essays about each point to support its being there.

Some of the stuff in here is obvious if you've been in the industry for a while (use a ticket-tracking system, automated patching) others perhaps  not so much (there are three policies that all sysadmin departments need to have defined to be effective). Some applies only to multi-person environments (pager rotations) while others are universally applicable (service monitoring).

I got a lot of goodies out of this. Some of it I had been peripherally aware of, but had never seen written up like this all in one spot before.

Ops Docs

An Ops Doc is a kind of service documentation. Each service you offer needs an ops doc and it needs to have certain things in it:
  • Overview: What it is, what it does.
  • Build: How to built it, get it.
  • Deploy: How to install it, configure it.
  • Common Tasks: What do you commonly do with it, and what kinds of issues commonly come up + resolutions.
  • Pager playbook: Document alert handling.
  • Disaster Recovery: What are the DR policies for this service, and how do you run them.
  • Service Level Agreements: What has been promised to whom, what are the penalties. What is it, where is it, how to deal with it.

Critical? Periodic audit. Probably by a non-technical manager, such as a Project Manager. And by "non-technical" I mean someone for whom managing people is their job, not managing technology directly.

The Three Empowering Policies

There are three policies that all Sysadmins need to have defined in order to be effective. Otherwise, people will just walk up whenever and ask you to do stuff, and you'll do it whenever they ask, however they ask, since we're nice that way. This is managing by interrupt and that's not a good way to manage our time. What's more, it leads to grumbling, and reinforcing the Server Troll reputation we sometimes get. The three policies

Acceptable methods for users to ask for help

Walking up and asking may be the best way for you, but in general it isn't a good way. Having a policy that defines what are the ways that users may ask for help allows sysadmins to better budget their time, and makes them more efficient overall.

The definition of an emergency

By enshrining the definitions of emergency into policy you prevent localized issues being advocated by a vocal person or small group of users from sucking resources away from a larger issue affecting the entire system but doesn't have a vocal advocate driving attention to it. The example Tom uses is a Code Red is something that stops production cold, a Code Yellow is something that could lead to a Code Red if left unattended.

The scope of service

This policy defines what is and is not covered. It is this policy that tells people that the sysadmins are not fax-repair qualified, or whether or not they make house-calls for teleworkers. This policy also defines when service is available, and what the after-hours options are. 

How to convince people to make big changes

I have to give big, big thanks to Tom for this one. This section of the class was focused on how to convince manager-types or other people with the power to block IT changes that such a change is in their best interest. I've been saying for years that one of the chief skills a well qualified Systems Engineer needs is the ability to effectively speak to management. A technician doesn't need to talk to people persuasively. A technical manager needs to talk to other managers. Tom went there.

Thank you.

Thank you.

Thank you.

I've met many people in our field who stuck with computers because either people are scary, or they don't want to deal with the bullshit that dealing with people day in and day out requires. These are not the people that make it to Senior jobs, at least not without some help. Tom identified some effective strategies for social engineering your way to what needs doing.

A more full treatment of this topic will be in another blog post. Heck, I've got a proto-book on this very topic in progress. So this will be briefer than it really needs.

  • Don't make people feel wrong. Phrase changes in non-accusatory ways. Making them feel wrong gets them defensive, and MUCH less likely to agree that you are presenting the best way forward.
  • Don't make people feel blamed. Explain how this change will improve everything overall. People feeling responsible for bad decisions get defensive. You don't want that.
  • Invent questions that'll give THEM the idea. Social engineering. If they come up with it (subtly pushed) they're more likely to follow through.
  • Don't be threatening to their authority. Authority can come in the form of direct power (they're your boss), or indirect (they have 20 years on you, and everyone listens to them before you). People don't like upstarts, and can quash your idea out of hand just because you seem like a threat. Don't be a threat.
  • For big changes, break them up into smaller changes and present those. Smaller change is less scary than bigger change.
  • The Statement of Undeniable Value. If you can distil your change down to simple to understand numbers, it can make it a LOT easier to convince people that this change is needed. Suddenly, all of that seemingly irreducible complexity is now distilled into a discrete dollars-per-unit savings.
  • Some people respond to data, others respond to peer recommendation. Knowing the difference is key. Knowing that Google uses a specific product raises that product's shine in the eyes of that specific decision maker saying 'no' all the time.

Also is a nice wizard-style website to help you figure out how to persuade certain people to do things. It takes some social know-how to really get the most out of it, but if you have that it can really help you get even better.

Getting stalked by web-sites

| 1 Comment
This is the kind of complaint that those of you with well tuned AdBock and GreaseMonkey plugins get all superior about. So click on to see what the rest of us have been putting up with.