September 2014 Archives

Redundancy in the Cloud

Strange as it might be to contemplate, but imagine what would happen if AWS went into receivership and was shut down to liquidate assets? What would that mean for your infrastructure? Project? Or even startup?

It would be pretty bad.

Startups have been deploying preferentially on AWS or other Cloud services for some time now, in part due to venture-capitalist push to not have physical infrastructure to liquidate should the startup go *pop* and to scale fast should a much desired rocket-launch happen. If AWS shut down fully for, say, a week, the impact to pretty much everything would be tremendous.

Or what if it was Azure? Fully debilitating for those that are on it, but the wide impacts would be less.

Cloud vendors are big things. In the old physical days we used to deal with the all-our-eggs-in-one-basket problem by putting eggs in multiple places. If you're on AWS, Amazon is very big about making sure you deploy across multiple Availability Zones and helping you become multi-region in the process if that's important to you. See? More than one basket for your eggs. I have to presume Azure and the others are similar, since I haven't used them.

Do you put your product on multiple cloud-vendors as your more-than-one-basket approach?

It isn't as easy as it was with datacenters, that's for sure.

This approach can work if you treat the Cloud vendors as nothing but Virtualization and block-storage vendors. The multiple-datacenter approach worked in large part because colos sell only a few things that impact the technology (power, space, network connectivity, physical access controls), though pricing and policies may differ wildly. Cloud vendors are not like that, they differentiate in areas that are technically relevant.

Do you deploy your own MySQL servers, or do you use RDS?
Do you deploy your now MongoDB servers, or do you use DynamoDB?
Do you deploy your own CDN, or do you use CloudFront?
Do you deploy your own Redis group, or do you use SQS?
Do you deploy your own Chef, or do you use OpsWorks?

The deeper down the hole of Managed Services you dive, and Amazon is very invested in pushing people to use them, the harder it is to take your toys and go elsewhere. Or run your toys on multiple Cloud infrastructures. Azure and the other vendors are building up their own managed service offerings because AWS is successfully differentiating from everyone else by having the widest offering. The end-game here is to have enough managed services offerings that virtual private servers don't need to be used at all.

Deploying your product on multiple cloud vendors requires either eschewing managed-services entirely, or accepting greater management overhead due to very significant differences in how certain parts of your stack are managed. Cloud vendors are very much Infrastructure-as-Code, and deploying on both AWS and Azure is like deploying the same application in Java and .NET; it takes a lot of work, the dialect differences can be insurmountable, and the expertise required means different people are going to be working on each environment which creates organizational challenges. Deploying on multiple cloud-vendors is far harder than deploying in multiple physical datacenters, and this is very much intentional.

It can be done, it just takes drive.

  • New features will be deployed on one infrastructure before the others, and the others will follow on as the integration teams figure out how to port it.
  • Some features may only ever live on one infrastructure as they're not deemed important enough to go to all of the effort to port to another infrastructure. Even if policy says everything must be multi-infrastructure, because that's how people work.
  • The extra overhead of running in multiple infrastructures is guaranteed to become a target during cost-cutting drives.

The ChannelRegister article's assertion that AWS is now in "too big to fail" territory, and thus requiring governmental support to prevent wide-spread industry collapse, is a reasonable assertion. It just plain costs too much to plan for that kind of disaster in corporate disaster-response planning.

The alerting problem

4100 emails.

That's the approximate number of alert emails that got auto-deleted while I was away on vacation. That number will rise further before I officially come back from vacation, but it's still a big number. The sad part is, 98% of those emails are for:

  • Problems I don't care about.
  • Unsnoozable known issues.
  • Repeated alarms for the first two points (puppet, I'm looking at you)

We've made great efforts in our attempt to cut down our monitoring fatigue problem, but we're not there yet. In part this is because the old, verbose monitoring system is still up and running, in part this is due to limitations in the alerting systems we have access to, and in part due to organizational habits that over-notify for alarms under the theory of, "if we tell everyone, someone will notice."

A couple weeks ago, PagerDuty had a nice blog-post about tackling alert fatigue, and had a lot of good points to consider. I want to spend some time on point 6:

Make sure the right people are getting alerts.

How many of you have a mailing list you dump random auto-generated crap like cron errors and backup failure notices to?

This pattern is very common in sysadmin teams, especially teams that began as one or a very few people. It just doesn't scale. Also, you learn to just ignore a bunch of things like backup "failures" for always-open files. You don't build an effective alerting system with the assumption that alerts can be ignored; if you find yourself telling new hires, "oh ignore those, they don't mean anything," you have a problem.

The failure mode of tell-everyone is that everyone can assume someone else saw it first and is working on it. And no one works on it.

I've seen exactly this failure mode many times. I've even perpetrated it, since I know certain coworkers are always on top of certain kinds of alerts so I can safely ignore actually-critical alerts. It breaks down if those people have a baby and are out of the office for four weeks. Or were on the Interstate for three hours and not checking mail at that moment.

When this happens and big stuff gets dropped, technical management gets kind of cranky. Which leads to hypervigilence and...

The failure mode of tell-everyone is that everyone will pile into the problem at the same time and make things worse.

I've seen this one too. A major-critical alarm is sent to a big distribution list, six admins immediately VPN in and start doing low-impact diagnostics. Diagnostics that aren't low impact if six people are doing them at the same time. Diagnostics that aren't meant to be run in parallel and can return non-deterministic results if run that way, which tells six admins different stories about what's actually wrong sending six admins into six different directions to solve not-actually-a-problem issues.

This is the Thundering Herd problem as it applies to sysadmins.

The usual fix for this is to build in a culture of, "I've got this," emails and to look for those messages before working on a problem.

The usual fix for this fails if admins do a little "verify the problem is actually a problem" work before sending the email and stomp on each other's toes in the process.

The usual fix for that is to build a culture of, "I'm looking into it," emails.

Which breaks down if a sysadmin is reasonably sure they're the only one who saw the alert and works on it anyway. Oops.


Really, these are all examples of telling the right people about the problem, but you really do need to go into more detail than "the right people". You need, "the right person". You need an on-call schedule that will notify one or two of the Right People about problems. Build that with the expectation that if you're in the hot seat you will answer ALL alerts, and build a rotation so no one is in the hotseat long enough to start ignoring alarms, and you have a far more reliable alerting system.

PagerDuty sells such a scheduling system. But what if you can't afford X-dollars a seat for something like that? You have some options. Here is one:

An on-call distribution-list and scheduler tasks
This recipe will provide an on-call rotation using nothing but free tools. It won't work with all environments. Scripting or API access to the email system is required.

Ingredients:

    • 1 on-call distribution list.
    • A list of names of people who can go into the DL.
    • A task scheduler such as cron or Windows Task Scheduler.
    • A database of who is supposed to be on-call when (can substitute a flat file if needed)
    • A scripting language that can talk to both email system management and database.

Instructions:

Build a script that can query the database (or flat-file) to determine who is supposed to be on-call right now, and can update the distribution-list with that name. Powershell can do all of this for full MS-stack environments. For non-MS environments more creativity may be needed.

Populate the database (or flat-file) with the times and names of who is to be on-call.

Schedule execution of the script using a task scheduler.

Configure your alert-emailing system to send mail to the on-call distribution list.

Nice and free! You don't get a GUI to manage the schedule and handling on-call shift swaps will be fully manual, but you at least are now sending alerts to people who know they need to respond to alarms. You can even build the watch-list so that it'll always include certain names that always want to know whenever something happens, such as managers. The thundering herd and circle-of-not-me problems are abated.

This system doesn't handle escalations at all, that's going to cost you either money or internal development time. You kind of do get what you pay for, after all.

How long should on-call shifts be?

That depends on your alert-frequency, how long it takes to remediate an alert, and the response time required.

Alert Frequency and Remediation:

  • Faster than once per 30 minutes:
    • They're a professional fire-fighter now. This is their full-time job, schedule them accordingly.
  • One every 30 minutes to an hour:
    • If remediation takes longer than 1 minute on average, the watch-stander can't do much of anything else but wait for alerts to show up. 8-12 hours is probably the most you can expect reasonable performance.
    • If remediation takes less than a minute, 16 hours is the most you can expect because this frequency ensures no sleep will be had by the watch-stander.
  • One every 1-2 hours:
    • If remediation takes longer than 10 minutes on average, the watch-stander probably can't sleep on their shift. 16 hours is probably the maximum shift length.
    • If remediation takes less than 10 minutes, sleep is more possible. However, if your watch-standers are the kind of people who don't fall asleep fast, you can't rely on that. 1 day for people who sleep at the drop of a hat, 16 hours for the rest of us.
  • One every 2-4 hours:
    • Sleep will be significantly disrupted by the watch. 2-4 days for people who sleep at the drop of a hat. 1 day for the rest of us.
  • One every 4-6 hours:
    • If remediation takes longer than an hour, 1 week for people who sleep at the drop of a hat. 2-4 days for the rest of us.
  • Slower than one every 6 hours:
    • 1 week

Response Time:

This is a fuzzy one, since it's about work/life balance. If all alerts need to be responded to within 5 minutes of their arrival, the watch-stander needs to be able to respond in 5 minutes. This means no driving or doing anything that requires not paying attention to the phone such as kid's performances or after-work meetups. For a watch-stander that drives to work, their on-call shift can't overlap their commute.

For 30 minute response, things are easier. Driving short trips is easier, and longer ones so long as the watch-stander pulls over to check what each alert is when they arrive. Kid performances are still problematic, and longer commutes just as much.

And then there is the curve-ball known as, "define 'response'". If Response is acking the alert, that's one thing and much less disruptive to off-hours lie. If Response is defined as "starts working on the problem," that's much more disruptive since the watch-stander has to have a laptop and bandwidth at all times.

The answers here will determine what a reasonable on-call shift looks like. A week of 5 minute time-to-work is going to cause the watch-stander to be house-bound for that entire week and that sucks a lot; there better be on-call pay associated with a schedule like that or you're going to get turnover as sysadmins go work for someone less annoying.


It's more than just make sure the right people are getting alerts, it's building a system of notifying the Right People in such a way that the alerts will get responded to and handled.

This will build a better alerting system overall.

Hiring for bias

I read yet another article on bias in in-person interviews lately. This comes as no surprise, bias is incredibly hard to overcome in hiring processes which is why US Civil Service hiring procedures look so arcane. The money-quote is this one:

Studies have shown that hiring randomly off of the short-list is just as effective as conducting interviews.

Which is a great way of saying that hiring without interviews works just as well as the expense and effort of doing it with interviews.

Now, humans are incredibly social creatures and the idea of hiring off of paper scares the willies out of us. We at least want to see who we're buying hiring so the drive to at least look at candidates is very strong. However, looking at candidates introduces a whole range of unconscious bias into the equation, especially if 'culture fit' is one of the top criteria being selected for. If those biases are not adequately controlled for, they're going to dominate the final pick off of the short-list. Also, candidates are very aware that this is a singular high-pressure marketing opportunity and will cheat every way possible to put themselves forward in a favorable way.

First, lets look at how candidates can skew things.