Redundancy in the Cloud

Strange as it might be to contemplate, but imagine what would happen if AWS went into receivership and was shut down to liquidate assets? What would that mean for your infrastructure? Project? Or even startup?

It would be pretty bad.

Startups have been deploying preferentially on AWS or other Cloud services for some time now, in part due to venture-capitalist push to not have physical infrastructure to liquidate should the startup go *pop* and to scale fast should a much desired rocket-launch happen. If AWS shut down fully for, say, a week, the impact to pretty much everything would be tremendous.

Or what if it was Azure? Fully debilitating for those that are on it, but the wide impacts would be less.

Cloud vendors are big things. In the old physical days we used to deal with the all-our-eggs-in-one-basket problem by putting eggs in multiple places. If you're on AWS, Amazon is very big about making sure you deploy across multiple Availability Zones and helping you become multi-region in the process if that's important to you. See? More than one basket for your eggs. I have to presume Azure and the others are similar, since I haven't used them.

Do you put your product on multiple cloud-vendors as your more-than-one-basket approach?

It isn't as easy as it was with datacenters, that's for sure.

This approach can work if you treat the Cloud vendors as nothing but Virtualization and block-storage vendors. The multiple-datacenter approach worked in large part because colos sell only a few things that impact the technology (power, space, network connectivity, physical access controls), though pricing and policies may differ wildly. Cloud vendors are not like that, they differentiate in areas that are technically relevant.

Do you deploy your own MySQL servers, or do you use RDS?
Do you deploy your now MongoDB servers, or do you use DynamoDB?
Do you deploy your own CDN, or do you use CloudFront?
Do you deploy your own Redis group, or do you use SQS?
Do you deploy your own Chef, or do you use OpsWorks?

The deeper down the hole of Managed Services you dive, and Amazon is very invested in pushing people to use them, the harder it is to take your toys and go elsewhere. Or run your toys on multiple Cloud infrastructures. Azure and the other vendors are building up their own managed service offerings because AWS is successfully differentiating from everyone else by having the widest offering. The end-game here is to have enough managed services offerings that virtual private servers don't need to be used at all.

Deploying your product on multiple cloud vendors requires either eschewing managed-services entirely, or accepting greater management overhead due to very significant differences in how certain parts of your stack are managed. Cloud vendors are very much Infrastructure-as-Code, and deploying on both AWS and Azure is like deploying the same application in Java and .NET; it takes a lot of work, the dialect differences can be insurmountable, and the expertise required means different people are going to be working on each environment which creates organizational challenges. Deploying on multiple cloud-vendors is far harder than deploying in multiple physical datacenters, and this is very much intentional.

It can be done, it just takes drive.

  • New features will be deployed on one infrastructure before the others, and the others will follow on as the integration teams figure out how to port it.
  • Some features may only ever live on one infrastructure as they're not deemed important enough to go to all of the effort to port to another infrastructure. Even if policy says everything must be multi-infrastructure, because that's how people work.
  • The extra overhead of running in multiple infrastructures is guaranteed to become a target during cost-cutting drives.

The ChannelRegister article's assertion that AWS is now in "too big to fail" territory, and thus requiring governmental support to prevent wide-spread industry collapse, is a reasonable assertion. It just plain costs too much to plan for that kind of disaster in corporate disaster-response planning.

The alerting problem

4100 emails.

That's the approximate number of alert emails that got auto-deleted while I was away on vacation. That number will rise further before I officially come back from vacation, but it's still a big number. The sad part is, 98% of those emails are for:

  • Problems I don't care about.
  • Unsnoozable known issues.
  • Repeated alarms for the first two points (puppet, I'm looking at you)

We've made great efforts in our attempt to cut down our monitoring fatigue problem, but we're not there yet. In part this is because the old, verbose monitoring system is still up and running, in part this is due to limitations in the alerting systems we have access to, and in part due to organizational habits that over-notify for alarms under the theory of, "if we tell everyone, someone will notice."

A couple weeks ago, PagerDuty had a nice blog-post about tackling alert fatigue, and had a lot of good points to consider. I want to spend some time on point 6:

Make sure the right people are getting alerts.

How many of you have a mailing list you dump random auto-generated crap like cron errors and backup failure notices to?

This pattern is very common in sysadmin teams, especially teams that began as one or a very few people. It just doesn't scale. Also, you learn to just ignore a bunch of things like backup "failures" for always-open files. You don't build an effective alerting system with the assumption that alerts can be ignored; if you find yourself telling new hires, "oh ignore those, they don't mean anything," you have a problem.

The failure mode of tell-everyone is that everyone can assume someone else saw it first and is working on it. And no one works on it.

I've seen exactly this failure mode many times. I've even perpetrated it, since I know certain coworkers are always on top of certain kinds of alerts so I can safely ignore actually-critical alerts. It breaks down if those people have a baby and are out of the office for four weeks. Or were on the Interstate for three hours and not checking mail at that moment.

When this happens and big stuff gets dropped, technical management gets kind of cranky. Which leads to hypervigilence and...

The failure mode of tell-everyone is that everyone will pile into the problem at the same time and make things worse.

I've seen this one too. A major-critical alarm is sent to a big distribution list, six admins immediately VPN in and start doing low-impact diagnostics. Diagnostics that aren't low impact if six people are doing them at the same time. Diagnostics that aren't meant to be run in parallel and can return non-deterministic results if run that way, which tells six admins different stories about what's actually wrong sending six admins into six different directions to solve not-actually-a-problem issues.

This is the Thundering Herd problem as it applies to sysadmins.

The usual fix for this is to build in a culture of, "I've got this," emails and to look for those messages before working on a problem.

The usual fix for this fails if admins do a little "verify the problem is actually a problem" work before sending the email and stomp on each other's toes in the process.

The usual fix for that is to build a culture of, "I'm looking into it," emails.

Which breaks down if a sysadmin is reasonably sure they're the only one who saw the alert and works on it anyway. Oops.


Really, these are all examples of telling the right people about the problem, but you really do need to go into more detail than "the right people". You need, "the right person". You need an on-call schedule that will notify one or two of the Right People about problems. Build that with the expectation that if you're in the hot seat you will answer ALL alerts, and build a rotation so no one is in the hotseat long enough to start ignoring alarms, and you have a far more reliable alerting system.

PagerDuty sells such a scheduling system. But what if you can't afford X-dollars a seat for something like that? You have some options. Here is one:

An on-call distribution-list and scheduler tasks
This recipe will provide an on-call rotation using nothing but free tools. It won't work with all environments. Scripting or API access to the email system is required.

Ingredients:

    • 1 on-call distribution list.
    • A list of names of people who can go into the DL.
    • A task scheduler such as cron or Windows Task Scheduler.
    • A database of who is supposed to be on-call when (can substitute a flat file if needed)
    • A scripting language that can talk to both email system management and database.

Instructions:

Build a script that can query the database (or flat-file) to determine who is supposed to be on-call right now, and can update the distribution-list with that name. Powershell can do all of this for full MS-stack environments. For non-MS environments more creativity may be needed.

Populate the database (or flat-file) with the times and names of who is to be on-call.

Schedule execution of the script using a task scheduler.

Configure your alert-emailing system to send mail to the on-call distribution list.

Nice and free! You don't get a GUI to manage the schedule and handling on-call shift swaps will be fully manual, but you at least are now sending alerts to people who know they need to respond to alarms. You can even build the watch-list so that it'll always include certain names that always want to know whenever something happens, such as managers. The thundering herd and circle-of-not-me problems are abated.

This system doesn't handle escalations at all, that's going to cost you either money or internal development time. You kind of do get what you pay for, after all.

How long should on-call shifts be?

That depends on your alert-frequency, how long it takes to remediate an alert, and the response time required.

Alert Frequency and Remediation:

  • Faster than once per 30 minutes:
    • They're a professional fire-fighter now. This is their full-time job, schedule them accordingly.
  • One every 30 minutes to an hour:
    • If remediation takes longer than 1 minute on average, the watch-stander can't do much of anything else but wait for alerts to show up. 8-12 hours is probably the most you can expect reasonable performance.
    • If remediation takes less than a minute, 16 hours is the most you can expect because this frequency ensures no sleep will be had by the watch-stander.
  • One every 1-2 hours:
    • If remediation takes longer than 10 minutes on average, the watch-stander probably can't sleep on their shift. 16 hours is probably the maximum shift length.
    • If remediation takes less than 10 minutes, sleep is more possible. However, if your watch-standers are the kind of people who don't fall asleep fast, you can't rely on that. 1 day for people who sleep at the drop of a hat, 16 hours for the rest of us.
  • One every 2-4 hours:
    • Sleep will be significantly disrupted by the watch. 2-4 days for people who sleep at the drop of a hat. 1 day for the rest of us.
  • One every 4-6 hours:
    • If remediation takes longer than an hour, 1 week for people who sleep at the drop of a hat. 2-4 days for the rest of us.
  • Slower than one every 6 hours:
    • 1 week

Response Time:

This is a fuzzy one, since it's about work/life balance. If all alerts need to be responded to within 5 minutes of their arrival, the watch-stander needs to be able to respond in 5 minutes. This means no driving or doing anything that requires not paying attention to the phone such as kid's performances or after-work meetups. For a watch-stander that drives to work, their on-call shift can't overlap their commute.

For 30 minute response, things are easier. Driving short trips is easier, and longer ones so long as the watch-stander pulls over to check what each alert is when they arrive. Kid performances are still problematic, and longer commutes just as much.

And then there is the curve-ball known as, "define 'response'". If Response is acking the alert, that's one thing and much less disruptive to off-hours lie. If Response is defined as "starts working on the problem," that's much more disruptive since the watch-stander has to have a laptop and bandwidth at all times.

The answers here will determine what a reasonable on-call shift looks like. A week of 5 minute time-to-work is going to cause the watch-stander to be house-bound for that entire week and that sucks a lot; there better be on-call pay associated with a schedule like that or you're going to get turnover as sysadmins go work for someone less annoying.


It's more than just make sure the right people are getting alerts, it's building a system of notifying the Right People in such a way that the alerts will get responded to and handled.

This will build a better alerting system overall.

Hiring for bias

I read yet another article on bias in in-person interviews lately. This comes as no surprise, bias is incredibly hard to overcome in hiring processes which is why US Civil Service hiring procedures look so arcane. The money-quote is this one:

Studies have shown that hiring randomly off of the short-list is just as effective as conducting interviews.

Which is a great way of saying that hiring without interviews works just as well as the expense and effort of doing it with interviews.

Now, humans are incredibly social creatures and the idea of hiring off of paper scares the willies out of us. We at least want to see who we're buying hiring so the drive to at least look at candidates is very strong. However, looking at candidates introduces a whole range of unconscious bias into the equation, especially if 'culture fit' is one of the top criteria being selected for. If those biases are not adequately controlled for, they're going to dominate the final pick off of the short-list. Also, candidates are very aware that this is a singular high-pressure marketing opportunity and will cheat every way possible to put themselves forward in a favorable way.

First, lets look at how candidates can skew things.

For some, every tweet is sacred. Each and every one is to be read, and caught up on if they've been away.

For some it's a stream of interesting, to be looked in upon when the whim strikes. There may be a list of really interesting people that gets checked more often.

For some it's ARG ARG NOISY GO AWAY ARG. This is why we have email.


For some sysadmin teams, every alert is sacred. To be acted upon the moment it arrives, and looked at when you've been away to be sure everything is handled.

For some sysadmin teams, the alerts are glanced at for suspicious patterns but otherwise ignored. Some really interesting ones may be forwarded by email-rules to phones.

For some sysadmin teams, the alerts are completely ignored. Problems are handled in email when other people notice them.


If you ever wondered how someone with a thousand twitter-follows can keep up, it's simple: they don't. It's lossy, it has to be. With that many follows you're dealing with a tweet every 30 seconds and you only keep up when you're bored. Or you make Lists of people you actually care about following, and only browse the master list when there isn't anything else going on.

The same dynamic holds true for a system with 600 individual servers with a variety of applications installed on them. Even if you only turn on OS basics like CPU/RAM/Swap/Disk, your alert stream is going to be very noisy. And like twitter there is going to be a lot of the server equivalent of, "I just had dinner, that... was a lot of food" tweets.

HouWeb01.png

   HouWeb02.png

HouWeb06.png

So riviting. *thud*

When it comes to alerts you need to consider when you want to know something. Putting everything in email is a great way to ignore it, and maybe expose it to some rudimentary email-rule based noise-filters. But wouldn't it be better if that email stream were high quality? And you had an actual website or something you could query for the historical stuff? Yeah, that'd be great.

HouCluster.png

Wait, what? Crap. Why?

Here is a nasty truth: I don't give a damn about CPU/RAM/Swap/Disk. Not in an ACT NOW, MONKEY! kind of way, anyway. I care about that stuff for trending and for historical troubleshooting. Think about the things we want to know and when we want to know them. Once you have an idea about that, you can start defining alerts that will look more like a well curated twitter-feed you don't want to miss, and less like the Alerts folder with 1269 unread messages in it.

So how can you make your alert stream more like a well curated feed, and not the firehose of noise it likely is?

Rule 1: Not all monitorables need a defined alert.

Just because you can track it doesn't mean you need to figure out an alert threshold for it and figure out what text to put into the email/sms. Definitely keep track of it if you have an operational need, but don't try to bother humans with it unless you have a definite need. "I might want to know once," is not definite, it's paranoia. Some systems, especially single points of failure, really do need that kind of alerting so set it up. Yes, please tell me that the 6-figure router I have one of is having a high CPU event, I want to know that. But please don't tell me about high NIC usage on the main database when the backups are staging to the archive system.

While there are people who really will skim through 800 tweets over breakfast in order to catch up with what happened overnight, and there are sysadmins who will read all 800 messages in the Alerts folder over breakfast, the rest of us look at that and go "AAAG information overload!" And skimming? That's great for things you need to know about eventually, but absolute crap when you need to react RIGHT BLOODY NOW.

Rule 2: Not everyone treats alerts the same way you do. Account for that.

You may look at every alert as it arrives and determine if it needs action, but your peers may be more of the "only tell me if something is actually wrong" variety, and have a different definition of "actually wrong". Alert systems are supported by people, so how those people work needs to be dealt with. Come to a consensus, and keep maintaining it. If you're an alerts-over-breakfast type, your cube-mate may be a page-me-if-anything-breaks type. The two of you need to figure out common ground.

Rule 3: Spend the effort to build the kind of alerts you actually need.

The out-of-the-box alerts are almost always.... boring. I rarely, if ever, want to know about high CPU/RAM/Swap/Disk events the moment they happen. Some, like disk-space, I should be picking up on trending reports. Yes, we do have spikes we need to deal with (dumping core on a 768GB RAM box? Tell me), and root/c: drive filling, but that should be a targeted alert not one that goes on everything.

Another boring alert? Pingable. It backstops the actual thing I'm worried about, whether or not TCP/8443 is serving SSL and returns an HTTP/200 status code, but by itself isn't something I want to get an alert on. The big difference is if my monitoring system is smart enough to figure out the "IF !pingable THEN AlertSuppressDependentServices" logic. In that case, I really want pingable because I can trust that I know I have a whole fist of services down based on a single alert, and I'm not getting a storm of messages about the down services on the box.

Spend the time to build the alerts you actually want. If you're doing clustered services or horizontal scaling services you generally don't care much about single system outages, you care about whether or not the service is up and performing acceptably. These are harder alerts to build, but they're what you actually want in most cases.

Rule 4: Create different classes of alert based on urgency.

Some alerts are wake up a human right now alerts, and some are more "bother whoever is on duty right now" alerts. And some are "so long as someone gets to it within a few hours we're OK." This will probably require setting up some kind of on-call schedule with a push notice other than email, such as SMS or a mobile app. Email is good for a lot, but it only is as good as the built in filtering system's ability to isolate signal in a bunch of noise and forward to email-to-SMS gateways.

Sending everything to email and trusting each recipient has good enough mail filters to do the correct urgency handling doesn't scale. And isn't consistent.

Rule 4a: Format your alert text with 160 character-limits in mind.

The chances of your alerts ending up on mobile, filtered through SMS, is pretty high. Best to plan for that from the start.

Rule 5: Ask the right questions during post-mortems.

Post-mortem processes are there for a reason, and that's to do better next time. This is a perfect opportunity to identify improvements in your monitoring and alert systems! Some questions you should be asking:

  • Did the monitoring system pick up the problem?
  • Did we have an alert configured to notice the problem?
  • Did we react to the monitoring system, a customer report, or a sysadmin with a bad feeling about this?
  • What changes can we make to allow the alert system to notify us of this kind of problem?
  • What changes can we make to allow us to notice this event building up so we can deal with it before it becomes a problem?

I have seen many, many cases where the monitoring system DID pick up the problem and DID alert us to it, but no one noticed because it was one alert in a pile of 50 that arrived in a given hour. We reacted when a customer asked us about why their stuff was down. What were the other 49 messages? A few were side-effects of the problem and the rest were routine CPU/RAM/Swap/Disk high-usage notices that we all stopped paying attention to.

Oops.

For a very high profile event that can have some serious consequences for a sysadmin team. You don't want those consequences.

Rule 6: Create some kind of trend-tracking system.

Trends allow you to get ahead of problems. It may be a weekly report with graphs of historical usage that gets mailed out for humans to look at and go, "Hm, that looks kinda bad," over. It may be an actual analytics systems that puts in trend-lines and helpful, "vSphere cluster PROD will be 100% RAM in 39.4 days" text. Or it may be a weekly recurring helpdesk ticket to have someone look at charts for an hour.

Whatever it is, you need something or someone to keep track of trends. Put all those monitorables you're not alarming on to good use and get ahead of the ball for a change. Figure out 3 months before you run out of disk-space that you're going to need to add more. Notice that the latest hotfix has increased RAM usage across the web cluster by 17%. These are not know-right-now things, they're the kind of thing that is alarming, but on a scale of weeks not minutes. You need that kind of alerting too!


It's time to cull your twitter alert stream.

Shifting priorities

| No Comments

I came out as genderqueer 11 months ago.

I started a new job 6 months ago. I haven't talked about it much, but I put in a lot of angst over how I'd present at interviews while I was between employment. The one thing you do not want to look like at an interview is a weirdo. I elected to grit my teeth and go for the suit with tie because I can do that, and a few other reasons I'm not entirely proud of. It worked though, I got a job with a nice raise over what I used to make, and it seems the kind of place where I might have actual advancement potential! I haven't had that in.. 11 years. Cool.

However, new job means a new chance to make first impressions and it was time to do it right. Also, this job has a written dress-code (professional, not just office-casual) which required a major wardrobe upgrade from me. I'd been living in t-shirt-casual land for 11 years, my office-casual wardrobe was w-a-y out of date and heavily pruned after two cross-country moves. I needed an update, so what would I get?

There was less angst this time

Orders of complexity

| No Comments

When automating a business process, be it figuring out when user meta-data needs to be eliminated or how to set up a certain type of server, there are certain orders of complexity you face:

  1. Do THAT.
  2. If THIS then DO THAT.
  3. If THIS then DO THAT, except WHEN.
  4. If THIS then DO THAT, except WHEN, but do it anyway IF.
  5. If THIS then DO THAT, except WHEN, but do it anyway IF so long as THAT isn't true.
  6. If THIS then DO THAT, except WHEN, but do it anyway IF so long as THAT isn't true or THIS is true.

An example for case 6:

IF a user is terminated, disable all of their accounts immediately and archive their data within 14 days; except if a manager puts a hold on it, but delete it anyway after 31 days, so long as the manager isn't a C-level or doing so would cause political problems at which point just don't bother deleting.

Try automating that.

This is the big problem with automating business processes: us humans glorify in our exception creation and handling processes, and we're damned bad at being consistent about it. The same is true about managing a fleet of servers; if it's fully manual the rules about what goes where are similarly complex. That 87 point checklist is a good sign.

When attempting to push out an Identity Management System project or a Configuration Management project a lot of the hard work goes into simplifying the complexity in the existing rules. This is not a strictly technical problem, it's a people problem.

You have six groups of developers deploying Tomcat apps. Yay standards! But each group has their own requirements for how they want Tomcat and the supporting JVM massaged.

The people problem here is figuring out how many of those differences are just mythology ("this blog on the internet said doing it this way would be bad, and had charts. We don't like bad.") and how many have technical reasons behind them ("we run out of private-bytes if we do it that way"). That's a lot of negotiation, gentle easings into new processes to smooth workflow, and a lot of technical handholding to reassure everyone that this really is a better way. Even after all that work you still may end up with Puppet classes like this:

  • tomcat6::tc_repos
  • tomcat6::helpdeskportal
  • tomcat6::SNARC
  • tomcat7::bbc_crawler
  • tomcat7::HR_APPS
  • tomcat7::buildSystem

Fighting entropy is hard, hard work. Technically hard, and socially hard. In this new devops era of programmable-everything, that entropy has to be encoded, cross-checked, regression tested, and maintained somehow. Entropy will win in the end, but you can at least kick the can down the road enough that it'll have a harder time stealing your weekend.

If you're lucky enough to face a greenfield environment deployment of some kind, maybe you're figuring out how applications and access will happen in a public cloud of some kind, you can at least put in rules and procedures at the start to help constrain the organic growth of exceptions. We... aren't always that lucky, sometimes we have to cram the entropy demon back into jail the hard way.

But if you do have the entropy demon in jail? Yay! But it's going to be a constant fight to keep the business rules encoded in the automation simple. Keep up the fight. You don't want the Eschaton bootstrapping on your watch.

A taxonomy of IT users

| No Comments

Over the years I've seen a small collection of fake-names crop up in the sysadmin space. Here is a list:

BOFH
An oldie, but a Sysadmin who has gone over to the dark-side.

Fred
Originally coined by Laura Chappell, Fred is the User From Hell. Or, The Power User who Isn't. Fred knows everything, or rather, thinks they do. They're wrong, but don't know it, and it makes your life all too interesting. Fred may be a manager, a peer, or a frequent-flier in the ticket queue.

Leeroy
Originally from a famous Warcraft video, this is the peer who just deploys stuff because it's cool. They... haven't learned (the hard way) how this can go wrong, so aren't naturally suspicious. This could be the rose-colored glasses of youth and exuberance, or it could be a trusting nature. They'll learn.

Brent
Coined by The Phoenix Project, Brent is the person that ends up with their hands in everything one way or the other. They may be a single-point-of-knowledge, the only person who knows anything about topic X, or just the person that gets handed the weird stuff because, well, "Brent probably knows". A lot of us are a Brent, and it sure as heck makes getting long vacations approved difficult. There may be more than one of them, depending on topics.


I used to be a Leeroy, then I learned better.

I've been a Brent (oddball-stuff troubleshooter variety) at my current and last three jobs.

Right now people have figured out that I know how to use Wireshark to discover oddball problems, so I'm having to do a lot of packet analysis lately to rule out oddball problems. This isn't something I can cross-train on very well, but I'm going to have to find a way; people's eyes tend to glaze over when you get into TCP RFCs and it's easier to make me do it and not have to learn for themselves.

I did this at a previous job, so here are a few tips for what will make it easier on everyone.

Plan at least 4, preferably 6, weeks out.

This gives your watch-standers the chance to arrange their lives around the schedule. At 6 weeks out, they'll be rearranging their lives around the schedule, rather than rearranging the watch-schedule around their lives. As the one managing the schedule, this means you'll be doing fewer weekend-swaps and people are more likely to just know who is on call.

Send out calendar events for the rotations.

This is more of a weekend-watch thing, but putting calender events in their calendar will further cement that they're obligated for that period. Also, it's a nice reminder in email (or whatever) that their shift has been scheduled.

Have the call list posted somewhere mobile-friendly.

Many times, the watch-stander is merely the first responder; it's their job to figure out what domain the problem sits in (app/database/storage/hypervisor/facilities/etc) and call the person who can actually fix the thingy. Having the call-list easily accessible from mobile is a really nice thing to have. This can be a Google Doc, or an app like PagerDuty. An Excel spreadsheet on Sharepoint is not so much.

Have the duties of the watch-stander clearly defined.

This seems obvious, but... it isn't. There are some questions you need answers to, otherwise you're going to experience sadness:

  • How fast must they respond to automated alerts?
  • Do they need to always answer the phone, or is voice-mail acceptable so long as the response is within a window?
  • How fast must they respond to emails?

The answers to these questions tell the watch-stander how much of a life they can fit in around the schedule. A movie is probably Right Out, but nipping out to the grocery store for a few things... maybe. Do they need to turn on bluetooth while driving, or can they wait until they stop (or just not drive at all)? How much 'response' can happen on a phone will greatly affect the quality-of-life questions.

What kind of sadness can you expect?

Missed alerts mostly. Without clearly defined response guidelines your watch-standers are going to sleep through their phones, miss emails, and otherwise fail to meet performance expectations. If you write those expectations down, they're far more likely to stick to them!

If you're doing automatic alert assignment, have an escalation policy.

You need a backstop in case the watch-stander sleeps through something. The backstop tier should never get called, but when they do it's an Event. An event people try to avoid, because something failed. Knowing that someone will notice if an automated alert gets ignored makes people more likely to respond in time.

If you're doing a 7-day watch, swap shifts on something other than Monday or Friday.

Depends on locality, but for US locations, Monday Holiday Law means that Mondays are occasionally vacation days and you don't want to do a watch-swap on a non-business day. In the same vein, many organizations have a rule in place stating that if the observed holiday in the case of an exempt holiday (New Years for example) lands on a Saturday is to have the day off be Friday.

At the same time, there is a US holiday that camps on top of Thursday (see next item). Tuesday or Wednesday are good choices.

If you're doing a weekend watch, have a policy in place for handling long weekends.

The 4 day Thanksgiving Holiday in the US is a great example, as that duty schedule is double what a normal one would be. Decide if you're going to create two shifts for it or allow one person to cover the whole thing, and decide well in advance. For some organizations the Friday after Thanksgiving is a major production day so this may be moot ;).

Having watched recent events unfold, I'm beginning to wonder what effect employment contracts are having on how companies and their employees respond to catastrophic reputation-loss events. A certain well known open-source company is undergoing this right now, which is why I'm thinking about it. Because they're big enough to have had lawyers go over their employment agreements for more than just intellectual property clauses, I'm guessing it's also picked up a few other goodies along the way.

The Setup

  1. $Company does something.
  2. $Activists say, "Hey, that's bullshit."
  3. $Supporters say, "Dude, not cool."
  4. $Defenders say, "Hey, no biggie, eh?"

Steps 2-4 can happen in 30 minutes these days. At this point the news is still expanding. But now the interesting things start to happen. As the $Defenders and $Supporters+$Activists start hammering on each other in social media the ranks of both camps increase and at some point, somewhere a subset of $Employee chimes in and after a while maybe $Company.Officer actually gives an official statement. By now the shit-storm is well and truly engaged.

Free Speech Means Freedom From Arrest (but not binding contracts between private parties)

Bloggers like me have known for over a decade now that mouthing off about one's employer is a great way to get fired. Some companies actually have clauses in their employment contracts that read, in effect:

You will only talk about the $Company in glowing terms. Or else.

The language is actually written like, "under no circumstances will you do or say anything that will reflect negatively on the company," but this works for now. This is called a non-disparagement clause, and is perfectly legal. What's more, it's common practice to use severance agreements to bind outgoing employees to those same clauses (if they weren't already bound by the employment agreement) in perpetuity to ensure that the now-ex employee doesn't mouth off about their old employer; less of a risk for voluntary departures, more of one for involuntary ones.

Your free speech has a price. Maybe it's $10K. Or $20K. $30K? $30K and 4 months health-insurance coverage to carry you to your next position? Okay, $75K, 5 months, and 10K shares of preferred stock. Have a nice life.

Shit-storm Meteorology

So you're in the $Activist+$Supporter camp and $Company is being strangely silent on the topic of what bonehead thing they did. The only people from the company talking about the thing are firmly in the $Defender camp, which only cements your opinion that they're just not getting it and are hopelessly out of touch.

What if you're a $Supporter that is also $Employee? If you have a non-disparagement agreement in your contract voicing that opinion is to risk your job and future employability. Unless you're also in $Company.Officer, speaking up is a very bad idea no matter how loudly the $Activists are crying for redress (in fact, speaking up even if you're a $Defender is a bad idea, but it's less likely to pothole your career-path). The Cyclone of Suck accelerates.

Stopping the Cyclone

It is possible to avoid the cyclone, or at least minimize it. It requires a fast response from $Corporate.Officer in a way that even the $Activists can recognize as meaningful. This is a hard step to take since it usually requires admitting fault (and thus, liability) which is why the first statement is almost always something like...

There there, we're not evil. We promise. We do good things too.

...and is lambasted by the $Activists as not addressing the problem. This is likely to accelerate the cyclone, not spin it down.

Another way to slow it down requires hard choices made by $Supporters who are also $Employees, by voluntarily severing employment due to whatever happened, refusing a severance agreement (and thus accept a period of no pay-check or even unemployment benefits), and saying why they left. It works better if more than one make this grand flounce.


This is just a theory of mine for how "never trash-talk your employer" clauses intersect with online debate. When I see people getting ever louder in indignation that some company or organization is remaining silent on some contentious topic, I do wonder if that's because the very people who would give the desired response have been preemptively legally gagged.

The number one piece of password advice is:

Only memorize a single complex password, use a password manager for everything else.

Gone is the time when you can plan on memorizing complex strings of characters using shift keys, letter substitution and all of that. The threats surrounding passwords, and the sheer number of things that require them, mean that human fragility is security's greatest enemy. The use of prosthetic memory is now required.

It could be a notebook you keep with you everywhere you go.
It could be a text file on a USB stick you carry around.
It could be a text file you keep in Dropbox and reference on all of your devices.
It could be an actual password manager like 1Password or LastPass that installs in all of your browsers.

There are certain accounts that act as keys to other accounts. The first account you need to protect like Fort Knox is the email accounts that receive activation-messages for everything else you use, since that vector can be used to gain access to those other accounts through the 'Forgotten Password' links.

ForgottenEmail.png

The second account you need to protect like Fort Knox are the identity services used by other sites so they don't have to bother with user account management, that would be all those "Log in with Twitter/Facebook/Google/Yahoo/Wordpress" buttons you see everywhere.

LoginEverywhere.png

The problem with prosthetic memory is that to beat out memorization it needs to be everywhere you ever need to log into anything. Your laptop, phone and tablet all can use the same manager, but the same isn't true of going to a friend's house and getting on their living-room machine to log into Hulu-Plus real quick since you have an account, they don't, but they have the awesome AV setup.

It's a hard problem. Your brain is always there, it's hard to beat that for convenience. But it's time to offload that particular bit of memorization to something else; your digital life and reputation depends on it.

Other Blogs

My Other Stuff

Monthly Archives