Getting stuck in Siberia

| No Comments

I went on a bit of a twitter rant recently.

Good question, since that's a very different problem than the one I was ranting about. How do you deal with that?


I hate to break it to you, but if you're in the position where your manager is actively avoiding you it's all on you to fix it. There are cases where it isn't up to you, such as if there are a lot of people being avoided and it's affecting the manager's work-performance, but that's a systemic problem. No, for this case I'm talking about you are being avoided, and not your fellow direct-reports. It's personal, not systemic.

No, it's not fair. But you still have to deal with it.

You have a question to ask yourself:

Do I want to change myself to keep the job, or do I want to change my manager by getting a new job?

Because this shunning activity is done by managers who would really rather fire your ass, but can't or won't for some reason. Perhaps they don't have firing authority. Perhaps the paperwork is too much to bother with firing someone. Perhaps they're the conflict-avoidant type and pretending you don't exist is preferable to making you Very Angry by firing them.

You've been non-verbally invited to Go Away. You get to decide if that's what you want to do.

Going Away

Start job-hunting, and good riddance. They may even overlook job-hunt activities on the job, but don't push it.

Staying and Escalating

They can't/won't get rid of you, but you're still there. It's quite tempting to stick around and intimidate your way into their presence and force them to react. They're avoiding you for a reason, so hit those buttons harder. This is not the adult way to respond to the situation, but they started it.

I shouldn't have to say that, but this makes for a toxic work environment for everyone else so... don't do that.

Staying and Reforming

Perhaps the job itself is otherwise awesome-sauce, or maybe getting another job will involve moving and you're not ready for that. Time to change yourself.

Step 1: Figure out why the manager is hiding from you.
Step 2: Stop doing that.
Step 3: See if your peace-offering is accepted.

Figure out why they're hiding

This is key to the whole thing. Maybe they see you as too aggressive. Maybe you keep saying no and they hate that. Maybe you never give an unqualified answer and they want definites. Maybe you always say, 'that will never work,' to anything put before you. Maybe you talk politics in the office and they don't agree with you. Maybe you don't go paintballing on weekends. Whatever it is...

Stop doing that.

It's not always easy to know why someone is avoiding you. That whole avoidant thing makes it hard. Sometimes you can get intelligence from coworkers about what the manager has been saying when you're not around or what happens when your name comes up. Ask around, at least it'll show you're aware of the problem.

And then... stop doing whatever it is. Calm down. Say yes more often. Start qualifying answers only in your head instead of out loud. Say, "I'll see what I can do" instead of "that'll never work." Stop talking politics in the office. Go paintballing on weekends. Whatever it is, start establishing a new set of behaviors.

And wait.

Maybe they'll notice and warm up. It'll be hard, but you probably need the practice to change your habits.

See if your peace-offering is accepted

After your new leaf is turned over, it might pay off to draw their attention to it. This step definitely depends on the manager and the source of the problem, but demonstrating a new way of behaving before saying you've been behaving better can be the key to get back into the communications stream. It also hangs a hat on the fact that you noticed you were in bad graces and took effort to change.

What if it's not accepted?

Then learn to live in Siberia and work through proxies, or lump it and get another job.

If you're wondering why comments aren't working, as I was, and are on shared hosting, as I am, and get to looking at your error_log file and see something like this in it:

[Sun Oct 12 12:34:56 2014] [error] [client 192.0.2.5] 
ModSecurity: Access denied with code 406 (phase 2).
Match of "beginsWith http://%{SERVER_NAME}/" against "MATCHED_VAR" required.
[file "/etc/httpd/modsecurity.d/10_asl_rules.conf"] [line "1425"] [id "340503"] [rev "1"]
[msg "Remote File Injection attempt in ARGS (/cgi-bin/mt4/mt-comments.cgi)"]
[severity "CRITICAL"]
[hostname "example.com"]
[uri "/cgi-bin/mt/mt-comments.cgi"]
[unique_id "PIMENTOCAKE"]

It's not just you.

It seems that some webhosts have a mod_security rule in place that bans submitting anything through "mt-comments.cgi". As this is the main way MT submits comments, this kind of breaks things. Happily, working around a rule like this is dead easy.

  1. Rename your mt-comments.cgi file to something else
  2. Add "CommentScript ${renamed file}" to your mt-config.cgi file

And suddenly comments start working again!

Except for Google, since they're deprecating OpenID support.

"Over the next few decades demand in the top layer of the labor market may well centre on individuals with high abstract reasoning, creative, and interpersonal skills that are beyond most workers, including graduates."
-Economist, vol413/num8907, Oct 4, 2014, "Special Report: The Third Great Wave. Productivity: Technology isn't Working"

The rest of the Special Report lays a convincing argument that people who have automation-creation as part of their primary job duties are in for quite a bit of growth and that people in industries subject to automation are going to have a hard time of it. This has a direct impact to sysadminly career direction.

In the past decate Systems Administration has been moving away from mechanics who deploy hardware, install software and fix problems and towards Engineers who are able to build automation for provisioning new computing instances, installing application frameworks, and know how to troubleshoot problems with all of that. In many ways we're a specialized niche of Software Engineering now, and that means we can ride the rocket with them. If you want to continue to have a good job in the new industrial revolution, keep plugging along and don't become the dragon in the datacenter people don't talk to.

Abstract Reasoning

Being able to comprehend how a complex system works is a prime example of abstract reasoning. Systems Administration is more than just knowing the arcana of init, grub, or WMI; we need to know how systems interact with each other. This is a skill that has been a pre-requisite for Senior Sysadmins for several decades now, so isn't new. It's already on our skill-path. This is where System Engineers make their names, and sometimes become Systems Architects.

Creative

This has been less on our skill-path, but is definitely something we've been focusing on in the past decade or so. Building large automation systems, even with frameworks such as Puppet or Chef, takes a fair amount of both abstract reasoning and creativity. If you're good at this, you've got 'creative' down.

This has impacts for the lower rungs of the sysadmin skill-ladder. Brand new sysadmins are going to be doing less racking-and-stacking and more parsing and patching ruby or ruby-like DSLs.

Interpersonal Skills

This is where sysadmins tend to fall down. A lot of us got into this gig because we didn't have to talk to people who weren't other sysadmins. Technology made sense, people didn't.

This skill is more a reflection of the service-oriented economy, and sysadmins are only sort of that, but our role in product creation and maintenance is ever more social these days. If you're one of two sysadmin-types in a company with 15 software engineers, you're going to have to learn how to have a good relationship with software engineers. In olden days, only very senior sysadmins had to have the Speaker to Management skill, now even mid-levels need to be able to speak coherently to technical and non-technical management.

It is no coincidence that many of the tutorials at conferences like LISA are aimed at building business and social skills in sysadmins. It's worth your time to attend them, since your career advancement depends on it.


Yes, we're well positioned to do well in the new economy. We just have to make a few changes we've known about for a while now.

And if there isn't a stipend...

| No Comments

Sysadmin-types, we kind of have to have a phone. It's what the monitoring system makes vibrate when our attention is needed, and we also tend to be "always on-call", even if it's tier 4 emergency last resort on-call. But sometimes we're the kind of on-call where we have to pay attention any time an alert comes in, regardless of hour, and that's when things get real.

So what if you're in that kind of job, or applying for one, and it turns out that your employer doesn't provide a cell phone and doesn't provide reimbursement. Some Bring Your Own Device policies are written this way. Or maybe your employer moves to a BYOD policy and the company paid telecoms are going away.

Can they do that?

Yes they can, but.

As with all labor laws, the rules vary based on where you are in the world. However, in August 2014 (a month and a half ago!) Schwann's Home Services, Inc lost an appeal in California Appellate court. This is important because California contains Silicon Valley and what happens there tends to percolate out to the rest of the tech industry. This ruling held that employees who do company business on personal phones are entitled to reimbursement.

The ruling didn't provide a legal framework for how much reimbursement is required, just that some is.

This thing is so new that the ripples haven't been felt everywhere yet. No-reimbursement policies are not legal, that much is clear, but beyond that, not much is. For non-California based companies such as those in tech hot-spots like Seattle, New York, or the DC area this is merely a warning that the legal basis for such no-reimbursement policies is not firm. As the California-based companies revise policies in light of this ruling, accepted-practice in the tech field will shift without legal action elsewhere.

My legal google-fu is too weak to tell if this thing can be appealed to the state Supreme Court, though it looks like it might have already toured through there.

Until then...

I strongly recommend against using your personal phone for both work and private. Having two phones, even phones you pay for, provides an affirmative separation between your work identity subject to corporate policies and liability, and your private identity. This is more expensive than just getting an unlimited voice/text plan with lots of data and dual-homing, but you face fewer risks to yourself that way. No-reimbursement BYOD policies are unfair to tech-workers the way that employers that require a uniform to be worn who don't provide a uniform allowance are unfair; for some of us, that phone is essential to our ability to do our jobs and should be expensed to the employer. Laws and precedent always take a while to catch up to business reality, and BYOD is getting caught up.

There is something that not many people seem to realize about how your personal email can get sucked into a lawsuit filed against your company. It all comes down to ediscovery...

When it comes to things to send alarming emails about, CPU, RAM, Swap, and Disk are the four everyone thinks of. If something seems slow, check one or all of those four to see if it really is slow. This sets up a causal chain...

It was slow, and CPU was high. Therefore, when CPU is high it is slow. QED.

We will now alarm on high CPU.

It may be true in that one case, but high CPU is not always a sign of bad. In fact, high CPU is a perfectly normal occurrence in some systems.

  1. Render farms are supposed to run that high all the time.
  2. Build servers are supposed to be running that high a lot of the time.
  3. Databases chewing on long-running queries.
  4. Big-data analytics that can run for hours.
  5. QE systems grinding on builds.
  6. Test-environment systems being ground on by QE.

Of course, not all CPU checks are created equal. Percent-CPU is one thing, Load Average is another. If Percent-CPU is 100% and your load-average matches the number of cores in the system, you're probably fine. If Percent-CPU is 100% and your load-average is 6x the number of cores in the system, you're probably not fine. If your monitoring system only grabs Percent-CPU, you won't be able to tell what kind of 100% event it is.

As a generic, apply-it-to-everything alarm, High-CPU is a really poor thing to pick. It's easy to monitor, which is why it gets selected for alarming. But, don't do that.

Cases where a High-CPU alarm won't actually tell you that something is going wrong:

  • Everything in the previous list.
  • If your app is single-threaded, the actual high-CPU event for that app on a multi-core system is going to be WELL below 100%. It may even be as low as 12.5%.
  • If it's a single web-server in a load-balanced pool of them, it won't be a BOTHER HUMANS RIGHT NOW event.
  • During routine patching. It should be snoozed on a maintenance window anyway, but sometimes it doesn't happen.
  • Initializing a big application. Some things normally chew lots of CPU when spinning up for the first time.

CPU/Load Average is something you probably should monitor, since there is value in retroactive analysis and aggregate analysis. Analyzing CPU trends can tell you it's time to buy more hardware, or turn up the max-instances value in your auto-scaling group. These are all the kinds of thing you look at in retrospective, they're not things that you want waking you up at 2:38am.

Only turn on CPU alarms if you know that is an error condition worthy of waking up a human. Turning it on for everything just in case is a great way to train yourself out of ignoring high-CPU alarms, which means you'll miss the ones you actually care about. Human factors, they're part of everything.

Redundancy in the Cloud

| No Comments

Strange as it might be to contemplate, but imagine what would happen if AWS went into receivership and was shut down to liquidate assets? What would that mean for your infrastructure? Project? Or even startup?

It would be pretty bad.

Startups have been deploying preferentially on AWS or other Cloud services for some time now, in part due to venture-capitalist push to not have physical infrastructure to liquidate should the startup go *pop* and to scale fast should a much desired rocket-launch happen. If AWS shut down fully for, say, a week, the impact to pretty much everything would be tremendous.

Or what if it was Azure? Fully debilitating for those that are on it, but the wide impacts would be less.

Cloud vendors are big things. In the old physical days we used to deal with the all-our-eggs-in-one-basket problem by putting eggs in multiple places. If you're on AWS, Amazon is very big about making sure you deploy across multiple Availability Zones and helping you become multi-region in the process if that's important to you. See? More than one basket for your eggs. I have to presume Azure and the others are similar, since I haven't used them.

Do you put your product on multiple cloud-vendors as your more-than-one-basket approach?

It isn't as easy as it was with datacenters, that's for sure.

This approach can work if you treat the Cloud vendors as nothing but Virtualization and block-storage vendors. The multiple-datacenter approach worked in large part because colos sell only a few things that impact the technology (power, space, network connectivity, physical access controls), though pricing and policies may differ wildly. Cloud vendors are not like that, they differentiate in areas that are technically relevant.

Do you deploy your own MySQL servers, or do you use RDS?
Do you deploy your now MongoDB servers, or do you use DynamoDB?
Do you deploy your own CDN, or do you use CloudFront?
Do you deploy your own Redis group, or do you use SQS?
Do you deploy your own Chef, or do you use OpsWorks?

The deeper down the hole of Managed Services you dive, and Amazon is very invested in pushing people to use them, the harder it is to take your toys and go elsewhere. Or run your toys on multiple Cloud infrastructures. Azure and the other vendors are building up their own managed service offerings because AWS is successfully differentiating from everyone else by having the widest offering. The end-game here is to have enough managed services offerings that virtual private servers don't need to be used at all.

Deploying your product on multiple cloud vendors requires either eschewing managed-services entirely, or accepting greater management overhead due to very significant differences in how certain parts of your stack are managed. Cloud vendors are very much Infrastructure-as-Code, and deploying on both AWS and Azure is like deploying the same application in Java and .NET; it takes a lot of work, the dialect differences can be insurmountable, and the expertise required means different people are going to be working on each environment which creates organizational challenges. Deploying on multiple cloud-vendors is far harder than deploying in multiple physical datacenters, and this is very much intentional.

It can be done, it just takes drive.

  • New features will be deployed on one infrastructure before the others, and the others will follow on as the integration teams figure out how to port it.
  • Some features may only ever live on one infrastructure as they're not deemed important enough to go to all of the effort to port to another infrastructure. Even if policy says everything must be multi-infrastructure, because that's how people work.
  • The extra overhead of running in multiple infrastructures is guaranteed to become a target during cost-cutting drives.

The ChannelRegister article's assertion that AWS is now in "too big to fail" territory, and thus requiring governmental support to prevent wide-spread industry collapse, is a reasonable assertion. It just plain costs too much to plan for that kind of disaster in corporate disaster-response planning.

The alerting problem

| No Comments

4100 emails.

That's the approximate number of alert emails that got auto-deleted while I was away on vacation. That number will rise further before I officially come back from vacation, but it's still a big number. The sad part is, 98% of those emails are for:

  • Problems I don't care about.
  • Unsnoozable known issues.
  • Repeated alarms for the first two points (puppet, I'm looking at you)

We've made great efforts in our attempt to cut down our monitoring fatigue problem, but we're not there yet. In part this is because the old, verbose monitoring system is still up and running, in part this is due to limitations in the alerting systems we have access to, and in part due to organizational habits that over-notify for alarms under the theory of, "if we tell everyone, someone will notice."

A couple weeks ago, PagerDuty had a nice blog-post about tackling alert fatigue, and had a lot of good points to consider. I want to spend some time on point 6:

Make sure the right people are getting alerts.

How many of you have a mailing list you dump random auto-generated crap like cron errors and backup failure notices to?

This pattern is very common in sysadmin teams, especially teams that began as one or a very few people. It just doesn't scale. Also, you learn to just ignore a bunch of things like backup "failures" for always-open files. You don't build an effective alerting system with the assumption that alerts can be ignored; if you find yourself telling new hires, "oh ignore those, they don't mean anything," you have a problem.

The failure mode of tell-everyone is that everyone can assume someone else saw it first and is working on it. And no one works on it.

I've seen exactly this failure mode many times. I've even perpetrated it, since I know certain coworkers are always on top of certain kinds of alerts so I can safely ignore actually-critical alerts. It breaks down if those people have a baby and are out of the office for four weeks. Or were on the Interstate for three hours and not checking mail at that moment.

When this happens and big stuff gets dropped, technical management gets kind of cranky. Which leads to hypervigilence and...

The failure mode of tell-everyone is that everyone will pile into the problem at the same time and make things worse.

I've seen this one too. A major-critical alarm is sent to a big distribution list, six admins immediately VPN in and start doing low-impact diagnostics. Diagnostics that aren't low impact if six people are doing them at the same time. Diagnostics that aren't meant to be run in parallel and can return non-deterministic results if run that way, which tells six admins different stories about what's actually wrong sending six admins into six different directions to solve not-actually-a-problem issues.

This is the Thundering Herd problem as it applies to sysadmins.

The usual fix for this is to build in a culture of, "I've got this," emails and to look for those messages before working on a problem.

The usual fix for this fails if admins do a little "verify the problem is actually a problem" work before sending the email and stomp on each other's toes in the process.

The usual fix for that is to build a culture of, "I'm looking into it," emails.

Which breaks down if a sysadmin is reasonably sure they're the only one who saw the alert and works on it anyway. Oops.


Really, these are all examples of telling the right people about the problem, but you really do need to go into more detail than "the right people". You need, "the right person". You need an on-call schedule that will notify one or two of the Right People about problems. Build that with the expectation that if you're in the hot seat you will answer ALL alerts, and build a rotation so no one is in the hotseat long enough to start ignoring alarms, and you have a far more reliable alerting system.

PagerDuty sells such a scheduling system. But what if you can't afford X-dollars a seat for something like that? You have some options. Here is one:

An on-call distribution-list and scheduler tasks
This recipe will provide an on-call rotation using nothing but free tools. It won't work with all environments. Scripting or API access to the email system is required.

Ingredients:

    • 1 on-call distribution list.
    • A list of names of people who can go into the DL.
    • A task scheduler such as cron or Windows Task Scheduler.
    • A database of who is supposed to be on-call when (can substitute a flat file if needed)
    • A scripting language that can talk to both email system management and database.

Instructions:

Build a script that can query the database (or flat-file) to determine who is supposed to be on-call right now, and can update the distribution-list with that name. Powershell can do all of this for full MS-stack environments. For non-MS environments more creativity may be needed.

Populate the database (or flat-file) with the times and names of who is to be on-call.

Schedule execution of the script using a task scheduler.

Configure your alert-emailing system to send mail to the on-call distribution list.

Nice and free! You don't get a GUI to manage the schedule and handling on-call shift swaps will be fully manual, but you at least are now sending alerts to people who know they need to respond to alarms. You can even build the watch-list so that it'll always include certain names that always want to know whenever something happens, such as managers. The thundering herd and circle-of-not-me problems are abated.

This system doesn't handle escalations at all, that's going to cost you either money or internal development time. You kind of do get what you pay for, after all.

How long should on-call shifts be?

That depends on your alert-frequency, how long it takes to remediate an alert, and the response time required.

Alert Frequency and Remediation:

  • Faster than once per 30 minutes:
    • They're a professional fire-fighter now. This is their full-time job, schedule them accordingly.
  • One every 30 minutes to an hour:
    • If remediation takes longer than 1 minute on average, the watch-stander can't do much of anything else but wait for alerts to show up. 8-12 hours is probably the most you can expect reasonable performance.
    • If remediation takes less than a minute, 16 hours is the most you can expect because this frequency ensures no sleep will be had by the watch-stander.
  • One every 1-2 hours:
    • If remediation takes longer than 10 minutes on average, the watch-stander probably can't sleep on their shift. 16 hours is probably the maximum shift length.
    • If remediation takes less than 10 minutes, sleep is more possible. However, if your watch-standers are the kind of people who don't fall asleep fast, you can't rely on that. 1 day for people who sleep at the drop of a hat, 16 hours for the rest of us.
  • One every 2-4 hours:
    • Sleep will be significantly disrupted by the watch. 2-4 days for people who sleep at the drop of a hat. 1 day for the rest of us.
  • One every 4-6 hours:
    • If remediation takes longer than an hour, 1 week for people who sleep at the drop of a hat. 2-4 days for the rest of us.
  • Slower than one every 6 hours:
    • 1 week

Response Time:

This is a fuzzy one, since it's about work/life balance. If all alerts need to be responded to within 5 minutes of their arrival, the watch-stander needs to be able to respond in 5 minutes. This means no driving or doing anything that requires not paying attention to the phone such as kid's performances or after-work meetups. For a watch-stander that drives to work, their on-call shift can't overlap their commute.

For 30 minute response, things are easier. Driving short trips is easier, and longer ones so long as the watch-stander pulls over to check what each alert is when they arrive. Kid performances are still problematic, and longer commutes just as much.

And then there is the curve-ball known as, "define 'response'". If Response is acking the alert, that's one thing and much less disruptive to off-hours lie. If Response is defined as "starts working on the problem," that's much more disruptive since the watch-stander has to have a laptop and bandwidth at all times.

The answers here will determine what a reasonable on-call shift looks like. A week of 5 minute time-to-work is going to cause the watch-stander to be house-bound for that entire week and that sucks a lot; there better be on-call pay associated with a schedule like that or you're going to get turnover as sysadmins go work for someone less annoying.


It's more than just make sure the right people are getting alerts, it's building a system of notifying the Right People in such a way that the alerts will get responded to and handled.

This will build a better alerting system overall.

Hiring for bias

I read yet another article on bias in in-person interviews lately. This comes as no surprise, bias is incredibly hard to overcome in hiring processes which is why US Civil Service hiring procedures look so arcane. The money-quote is this one:

Studies have shown that hiring randomly off of the short-list is just as effective as conducting interviews.

Which is a great way of saying that hiring without interviews works just as well as the expense and effort of doing it with interviews.

Now, humans are incredibly social creatures and the idea of hiring off of paper scares the willies out of us. We at least want to see who we're buying hiring so the drive to at least look at candidates is very strong. However, looking at candidates introduces a whole range of unconscious bias into the equation, especially if 'culture fit' is one of the top criteria being selected for. If those biases are not adequately controlled for, they're going to dominate the final pick off of the short-list. Also, candidates are very aware that this is a singular high-pressure marketing opportunity and will cheat every way possible to put themselves forward in a favorable way.

First, lets look at how candidates can skew things.

For some, every tweet is sacred. Each and every one is to be read, and caught up on if they've been away.

For some it's a stream of interesting, to be looked in upon when the whim strikes. There may be a list of really interesting people that gets checked more often.

For some it's ARG ARG NOISY GO AWAY ARG. This is why we have email.


For some sysadmin teams, every alert is sacred. To be acted upon the moment it arrives, and looked at when you've been away to be sure everything is handled.

For some sysadmin teams, the alerts are glanced at for suspicious patterns but otherwise ignored. Some really interesting ones may be forwarded by email-rules to phones.

For some sysadmin teams, the alerts are completely ignored. Problems are handled in email when other people notice them.


If you ever wondered how someone with a thousand twitter-follows can keep up, it's simple: they don't. It's lossy, it has to be. With that many follows you're dealing with a tweet every 30 seconds and you only keep up when you're bored. Or you make Lists of people you actually care about following, and only browse the master list when there isn't anything else going on.

The same dynamic holds true for a system with 600 individual servers with a variety of applications installed on them. Even if you only turn on OS basics like CPU/RAM/Swap/Disk, your alert stream is going to be very noisy. And like twitter there is going to be a lot of the server equivalent of, "I just had dinner, that... was a lot of food" tweets.

HouWeb01.png

   HouWeb02.png

HouWeb06.png

So riviting. *thud*

When it comes to alerts you need to consider when you want to know something. Putting everything in email is a great way to ignore it, and maybe expose it to some rudimentary email-rule based noise-filters. But wouldn't it be better if that email stream were high quality? And you had an actual website or something you could query for the historical stuff? Yeah, that'd be great.

HouCluster.png

Wait, what? Crap. Why?

Here is a nasty truth: I don't give a damn about CPU/RAM/Swap/Disk. Not in an ACT NOW, MONKEY! kind of way, anyway. I care about that stuff for trending and for historical troubleshooting. Think about the things we want to know and when we want to know them. Once you have an idea about that, you can start defining alerts that will look more like a well curated twitter-feed you don't want to miss, and less like the Alerts folder with 1269 unread messages in it.

So how can you make your alert stream more like a well curated feed, and not the firehose of noise it likely is?

Rule 1: Not all monitorables need a defined alert.

Just because you can track it doesn't mean you need to figure out an alert threshold for it and figure out what text to put into the email/sms. Definitely keep track of it if you have an operational need, but don't try to bother humans with it unless you have a definite need. "I might want to know once," is not definite, it's paranoia. Some systems, especially single points of failure, really do need that kind of alerting so set it up. Yes, please tell me that the 6-figure router I have one of is having a high CPU event, I want to know that. But please don't tell me about high NIC usage on the main database when the backups are staging to the archive system.

While there are people who really will skim through 800 tweets over breakfast in order to catch up with what happened overnight, and there are sysadmins who will read all 800 messages in the Alerts folder over breakfast, the rest of us look at that and go "AAAG information overload!" And skimming? That's great for things you need to know about eventually, but absolute crap when you need to react RIGHT BLOODY NOW.

Rule 2: Not everyone treats alerts the same way you do. Account for that.

You may look at every alert as it arrives and determine if it needs action, but your peers may be more of the "only tell me if something is actually wrong" variety, and have a different definition of "actually wrong". Alert systems are supported by people, so how those people work needs to be dealt with. Come to a consensus, and keep maintaining it. If you're an alerts-over-breakfast type, your cube-mate may be a page-me-if-anything-breaks type. The two of you need to figure out common ground.

Rule 3: Spend the effort to build the kind of alerts you actually need.

The out-of-the-box alerts are almost always.... boring. I rarely, if ever, want to know about high CPU/RAM/Swap/Disk events the moment they happen. Some, like disk-space, I should be picking up on trending reports. Yes, we do have spikes we need to deal with (dumping core on a 768GB RAM box? Tell me), and root/c: drive filling, but that should be a targeted alert not one that goes on everything.

Another boring alert? Pingable. It backstops the actual thing I'm worried about, whether or not TCP/8443 is serving SSL and returns an HTTP/200 status code, but by itself isn't something I want to get an alert on. The big difference is if my monitoring system is smart enough to figure out the "IF !pingable THEN AlertSuppressDependentServices" logic. In that case, I really want pingable because I can trust that I know I have a whole fist of services down based on a single alert, and I'm not getting a storm of messages about the down services on the box.

Spend the time to build the alerts you actually want. If you're doing clustered services or horizontal scaling services you generally don't care much about single system outages, you care about whether or not the service is up and performing acceptably. These are harder alerts to build, but they're what you actually want in most cases.

Rule 4: Create different classes of alert based on urgency.

Some alerts are wake up a human right now alerts, and some are more "bother whoever is on duty right now" alerts. And some are "so long as someone gets to it within a few hours we're OK." This will probably require setting up some kind of on-call schedule with a push notice other than email, such as SMS or a mobile app. Email is good for a lot, but it only is as good as the built in filtering system's ability to isolate signal in a bunch of noise and forward to email-to-SMS gateways.

Sending everything to email and trusting each recipient has good enough mail filters to do the correct urgency handling doesn't scale. And isn't consistent.

Rule 4a: Format your alert text with 160 character-limits in mind.

The chances of your alerts ending up on mobile, filtered through SMS, is pretty high. Best to plan for that from the start.

Rule 5: Ask the right questions during post-mortems.

Post-mortem processes are there for a reason, and that's to do better next time. This is a perfect opportunity to identify improvements in your monitoring and alert systems! Some questions you should be asking:

  • Did the monitoring system pick up the problem?
  • Did we have an alert configured to notice the problem?
  • Did we react to the monitoring system, a customer report, or a sysadmin with a bad feeling about this?
  • What changes can we make to allow the alert system to notify us of this kind of problem?
  • What changes can we make to allow us to notice this event building up so we can deal with it before it becomes a problem?

I have seen many, many cases where the monitoring system DID pick up the problem and DID alert us to it, but no one noticed because it was one alert in a pile of 50 that arrived in a given hour. We reacted when a customer asked us about why their stuff was down. What were the other 49 messages? A few were side-effects of the problem and the rest were routine CPU/RAM/Swap/Disk high-usage notices that we all stopped paying attention to.

Oops.

For a very high profile event that can have some serious consequences for a sysadmin team. You don't want those consequences.

Rule 6: Create some kind of trend-tracking system.

Trends allow you to get ahead of problems. It may be a weekly report with graphs of historical usage that gets mailed out for humans to look at and go, "Hm, that looks kinda bad," over. It may be an actual analytics systems that puts in trend-lines and helpful, "vSphere cluster PROD will be 100% RAM in 39.4 days" text. Or it may be a weekly recurring helpdesk ticket to have someone look at charts for an hour.

Whatever it is, you need something or someone to keep track of trends. Put all those monitorables you're not alarming on to good use and get ahead of the ball for a change. Figure out 3 months before you run out of disk-space that you're going to need to add more. Notice that the latest hotfix has increased RAM usage across the web cluster by 17%. These are not know-right-now things, they're the kind of thing that is alarming, but on a scale of weeks not minutes. You need that kind of alerting too!


It's time to cull your twitter alert stream.

Other Blogs

My Other Stuff

Monthly Archives