Sysadmins and risk-management

This crossed my timeline today:

This is a risk-management statement that contains all of a sysadmin's cynical-bastard outlook on IT infrastructure.

Disappointed because all of their suggestions for making the system more resilient to failure are shot down by management. Or, some of them are, which is like all in that there are disasters that are uncovered. Commence drinking heavily to compensate.

Frantically busy because they're trying to mitigate all the failure-modes their own damned self using not enough resources, all the while dealing with continual change as the mission of the infrastructure shifts over time.

A good #sysadmin always expects the worst.

Yes, we do. Because all too often, we're the only risk-management professionals a system has. We better understand the risks to the system than anyone else. A sysadmin who plans for failure is one who isn't first on the block when a beheading is called for by the outage-enraged user-base.

However, there are a few failure-modes in this setup that many, many sysadmins fall foul of.

Perfection is the standard.

And no system is perfect.

Humans are shit at gut-level risk-assessment, part 1: If you've had friends eaten by a lion, you see lions everywhere.

This abstract threat has been made all too real, and now lions. Lions everywhere. For sysadmins it's things like multi-disk RAID failures, UPS batteries blowing up, and restoration failures because an application changed its behavior and the existing backup solution no longer was adequate to restore state.

Sysadmins become sensitized to failure. Those once-in-ten-years failures, like datacenter transfer-switch failures or Amazon region-outages, seem immediate and real. I knew a sysadmin who was paralyzed in fear over a multi-disk RAID failure in their infrastructure. They used big disks, who weren't 100% 'enterprise' grade. Recoveries from a single-disk failure were long as a result. Too long. A disk going bad during the recovery was a near certainty in their point of view, never mind that the disks in question were less than 3 years old, and the RAID system they were using had bad-block detection as a background process. That window of outage was too damned long.

Humans are shit at gut-level risk-assessment, part 2: Leeroy Jenkins sometimes gets the jackpot, so maybe you'll get that lucky...

This is why people think they can win mega-million lotterys and in casinos playing roulette. Because sometimes, you have to take a risk for a big payoff.

To sysadmins who have had friends eaten by lions, this way of thinking is completely alien. This is the developer who suggests swapping out the quite functional MySQL databases for Postgres. Or the peer sysadmin who really wants central IT to move away from forklift SAN-based disk-arrays for a bunch of commodity hardware, FreeBSD, and ZFS.

Mm hm. No.

Leeroy Jenkins management and lion-eaten sysadmins make for really unhappy sysadmins.

When it isn't a dev or a peer sysadmin asking, but a manager...

Sysadmin team: It may be a better solution. But do you know how many lions are lurking in the transition process??

Management team: It's a better platform. Do it anyway.

Cue heavy drinking as everyone prepares to lose a friend to lions.


This is why I suggest rewording that statement:

A good #sysadmin always expects the worst.
A great #sysadmin doesn't let that rule their whole outlook.

A great sysadmin has awareness of business risk, not just IT risks. A sysadmin who has been scarred by lions and sees large felines lurking everywhere will be completely miserable in an early or mid-stage startup. In an early stage startup, the big risk on everyone's mind is running out of money and losing their jobs; so that once-in-three-years disaster we feel so acutely is not the big problem it seems. Yeah, it can happen and it could shutter the company if it does happen; but the money remediating that problem would be better spent by expanding marketshare enough that we can assume we'll still be in business 2 years from now. A failure-obsessed sysadmin will not have job satisfaction in such a workplace.

One who has awareness of business risk will wait until the funding runway is long enough that pitching redundancy improvements will actually defend the business. This is a hard skill to learn, especially for people who've been pigeon-holed worker-units their entire carer. I find that asking myself one question helps:

How likely is it that this company will still be here in 2 years? 5? 7? 10?

If the answer to that is anything less than 'definitely', then there are failures that you can accept into your infrastructure.