Recently in backup Category

Sysadmins and risk-management

This crossed my timeline today:

This is a risk-management statement that contains all of a sysadmin's cynical-bastard outlook on IT infrastructure.

Disappointed because all of their suggestions for making the system more resilient to failure are shot down by management. Or, some of them are, which is like all in that there are disasters that are uncovered. Commence drinking heavily to compensate.

Frantically busy because they're trying to mitigate all the failure-modes their own damned self using not enough resources, all the while dealing with continual change as the mission of the infrastructure shifts over time.

A good #sysadmin always expects the worst.

Yes, we do. Because all too often, we're the only risk-management professionals a system has. We better understand the risks to the system than anyone else. A sysadmin who plans for failure is one who isn't first on the block when a beheading is called for by the outage-enraged user-base.

However, there are a few failure-modes in this setup that many, many sysadmins fall foul of.

Perfection is the standard.

And no system is perfect.

Humans are shit at gut-level risk-assessment, part 1: If you've had friends eaten by a lion, you see lions everywhere.

This abstract threat has been made all too real, and now lions. Lions everywhere. For sysadmins it's things like multi-disk RAID failures, UPS batteries blowing up, and restoration failures because an application changed its behavior and the existing backup solution no longer was adequate to restore state.

Sysadmins become sensitized to failure. Those once-in-ten-years failures, like datacenter transfer-switch failures or Amazon region-outages, seem immediate and real. I knew a sysadmin who was paralyzed in fear over a multi-disk RAID failure in their infrastructure. They used big disks, who weren't 100% 'enterprise' grade. Recoveries from a single-disk failure were long as a result. Too long. A disk going bad during the recovery was a near certainty in their point of view, never mind that the disks in question were less than 3 years old, and the RAID system they were using had bad-block detection as a background process. That window of outage was too damned long.

Humans are shit at gut-level risk-assessment, part 2: Leeroy Jenkins sometimes gets the jackpot, so maybe you'll get that lucky...

This is why people think they can win mega-million lotterys and in casinos playing roulette. Because sometimes, you have to take a risk for a big payoff.

To sysadmins who have had friends eaten by lions, this way of thinking is completely alien. This is the developer who suggests swapping out the quite functional MySQL databases for Postgres. Or the peer sysadmin who really wants central IT to move away from forklift SAN-based disk-arrays for a bunch of commodity hardware, FreeBSD, and ZFS.

Mm hm. No.

Leeroy Jenkins management and lion-eaten sysadmins make for really unhappy sysadmins.

When it isn't a dev or a peer sysadmin asking, but a manager...

Sysadmin team: It may be a better solution. But do you know how many lions are lurking in the transition process??

Management team: It's a better platform. Do it anyway.

Cue heavy drinking as everyone prepares to lose a friend to lions.


This is why I suggest rewording that statement:

A good #sysadmin always expects the worst.
A great #sysadmin doesn't let that rule their whole outlook.

A great sysadmin has awareness of business risk, not just IT risks. A sysadmin who has been scarred by lions and sees large felines lurking everywhere will be completely miserable in an early or mid-stage startup. In an early stage startup, the big risk on everyone's mind is running out of money and losing their jobs; so that once-in-three-years disaster we feel so acutely is not the big problem it seems. Yeah, it can happen and it could shutter the company if it does happen; but the money remediating that problem would be better spent by expanding marketshare enough that we can assume we'll still be in business 2 years from now. A failure-obsessed sysadmin will not have job satisfaction in such a workplace.

One who has awareness of business risk will wait until the funding runway is long enough that pitching redundancy improvements will actually defend the business. This is a hard skill to learn, especially for people who've been pigeon-holed worker-units their entire carer. I find that asking myself one question helps:

How likely is it that this company will still be here in 2 years? 5? 7? 10?

If the answer to that is anything less than 'definitely', then there are failures that you can accept into your infrastructure.

As I look around the industry with an eye towards further employment, I've noticed a difference of philosophy between startups and the more established players. One easy way to see this difference is on their job postings.

  • If it says RHEL and VMWare on it, they believe in support contracts.
  • If it says CentOS and OpenStack on it, they believe in community support.

For the same reason that tech startups almost never use Windows if they can get away with it, they steer clear of other technologies that come with license costs or mandatory support contracts. Why pay the extra support cost when you can get the same service by hiring extremely smart people and use products with a large peer support community? Startups run lean, and all that extra cost is... cost.

And yet some companies find that they prefer to run with that extra cost. Some, like StackExchange, don't mind the extra licensing costs of their platform (Windows) because they're experts in it and can make it do exactly what they want it to do with a minimum of friction, which means the Minimum Viable Product gets kicked out the door sooner. A quicker MVP means quicker profitability, and that can pay for the added base-cost right there.

Other companies treat support contracts like insurance: something you carry just in case, as a hedge against disaster. Once you grow to a certain size, business continuity insurance investments start making a lot more sense. Running for the brass ring of market dominance without a net makes sense, but once you've grabbed it keeping it needs investment. Backup vendors love to quote statistics on the percentage of business that fail after a major data-loss incident (it's a high percentage), and once you have a business worth protecting it's good to start protecting it.

This is part of why I'm finding that the long established companies tend to use technologies that come with support. Once you've dominated your sector, keeping that dominance means a contract to have technology experts on call 24/7 from the people who wrote it.

"We may not have to call RedHat very often, but when we do they know it'll be a weird one."


So what happens when startups turn into market dominators? All that no-support Open Source stuff is still there...

They start investing in business continuity, just the form may be different from company to company.

  • Some may make the leap from CentOS to RHEL.
  • Some may contract for 3rd party support for their OSS technologies (such as with 10gen for MongoDB).
  • Some may implement more robust backup solutions.
  • Some may extend their existing high-availability systems to handle large-scale local failures (like datacenter or availability-zone outages).
  • Some may acquire actual Business Continuity Insurance.

Investors may drive adoption of some BC investment, or may actively discourage it. I don't know, I haven't been in those board meetings and can argue both ways on it.

Which one do I prefer?

Honestly, I can work for either style. Lean OSS means a steep learning curve and a strong incentive to become a deep-dive troubleshooter of the platform, which I like to be. Insured means someone has my back if I can't figure it out myself, and I'll learn from watching them solve the problem. I'm easy that way.

Anyone taking DevOps to heart should read about Normal Accidents. The book is about failure modes of nuclear power plants; those highly automated and extremely instrumented things that they are still manage to fail in spite of everything that we do. The lessons here carry well into the highly automated environments we try to build in our distributed systems.

There are a couple of key learnings to take from this book and theory:

  • Root cause can be something seemingly completely unrelated to the actual problem.
  • Contributing causes can sneak in and make what would be a well handled event into something that gets you bad press.
  • Monitoring instrumentation failures can be sneaky contributing causes.
  • Single-failure events are easily handled, and may be invisible.
  • Multiple-failure events are much harder to handle.
  • Multiple-failure events can take months to show up if the individual failures happened over the course of months and were invisible.

The book had a failure mode much like this one:

After analysis, it was known that the flow direction of a specific coolant pipe was a critical item. If backflow occurred, hot fluid could enter areas not designed for handling it. As a result, a system was put in place to monitor flow direction, and automation put in place to close a valve on the pipe if backflow was detected.

After analyzing the entire system after a major event, it was discovered that the flow-sensor had correctly identified backflow, and had activated the valve close automation. However, it was also discovered that the valve had frozen open due to corrosion several months prior to the event. Additionally, the actuator had broken when the solenoid moved to close the valve. As a result, the valve was reported closed, and showed as such on the Operator panel, when in fact it was open.

  • The valve had been subjected to manual examination 9 months before the event, and was due to be checked again in 3 more months. However, it had failed between checks.
  • The actuator system was checked monthly and had passed every check. The actuator breakage happened during one of these monthly checks.
  • The sensor on the actuator was monitoring power draw for the actuator. If the valve was frozen, the actuator should notice an above-normal current draw. However, as the actuator arm was disconnected from the valve it experienced a below-normal current draw and did not detect this as an alarm condition.
  • The breaking of the actuator arm was noted in the maintenance report during the monthly check as a "brief flicker of the lamp" and put down as a 'blip'. The arm failed before the current meter triggered its event. As the system passed later tests, the event was disregarded.
  • The backflow sensor actually installed was not directional. It alarmed on zero-flow, not negative-flow.

Remediations:

  • Instrument the valve itself for open/close state.
  • Introduce new logic so that if the backflow sensor continues to detect backflow, raise alarms.
  • Replace the backflow sensor with a directional one as originally called for.
  • Add a new flow sensor behind the valve.
  • Change the alerting on the actuator sensor to alarm on too-low voltages.
  • Increase the frequency of visual inspection of the physical plant

That valve being open caused Fun Times To Be Had. If that valve system had been operating correctly, the fault that caused the backflow would have been isolated as the system designers intended and the overall damage contained. However, this contributing cause, one that happened months before the triggering event, turned a minor problem into a major one.

So, why did that reactor release radioactive materials into the environment? Well, it's complicated...

And yet, after reading the post-mortem report you look at what actually failed and think, 'and these are the jokers running our nuclear power plants? We're lucky we're not all glowing in the dark!'

We get the same kind of fault-trees in massively automated distributed systems. Take this entirely fictional, but oh-so-plausible failure cascade:

ExampleCorp was notified by their datacenter provider of the need for emergency power maintenance in their primary datacenter. ExampleCorp (EC) operated a backup datacenter and had implemented a hot failover method, tested twice a year, for moving production to the backup facility. EC elected to perform a hot failover to the backup facility prior to the power work in their primary facility.

Shortly after the failover completed the backup facility crashed hard. Automation attempted to fail back to the primary facility, but technicians at the primary facility had already begun, but not yet completed, safe-shutdown procedures. As a result, the fail-back was interrupted part way through, and production stopped cold.

Service recovery happened at the primary site after power maintenance completed. However, the cold-start script was out of date by over a year so restoration was hampered by differences that came up during the startup process.

Analysis after the fact isolated several causes of the extensive downtime:

  • In the time between the last hot-failover test, EC had deployed a new three-node management cluster for their network switch configuration and software management system, one three node cluster for each site.
  • The EC-built DNS synchronization script used to keep the backup and primary sites in sync was transaction oriented. A network fault 5 weeks ago meant the transactions related to the DNS update for the cluster deployment were dropped and not noticed.
  • The old three-node clusters were kept online "just in case".
  • The differences in cluster software versions between the two sites was displayed in EC's monitoring panel, but was not alarmed, and disregarded as a 'glitch' by Operations. Interviews show that Ops staff are aware that the monitoring system will sometimes hold onto stale data if it isn't part of an alarm.
  • At the time of the cluster migration Operations was testing a new switch firmware image. The image on the old cluster was determined to have a critical loading bug, which required attention from the switch vendor.
  • Two weeks prior to the event EC performed an update of switch firmware using new code that passed validation. The new firmware was replicated to all cluster members in both sites using automation based on the IP addresses of the cluster members. The old cluster members were not updated.
  • The automation driving the switch firmware update relied on the non-synchronized DNS entries, and reported no problems applying updates. The primary site got the known-good firmware, the backup site got the known-bad firmware.
  • The hot-swap network load triggered the fault in the backup site's switch firmware, causing switches to reboot every 5 minutes.
  • Recovery logic in the application attempted to work around the massive network faults and ended up duplicating some database transactions, and losing others. Some corrupted data was transferred to the primary site before it was fully shut down.
  • Lack of technical personnel physically at the backup site hampered recovery from the backup site and extended the outage.
  • Out of date documentation hampered efforts restart services from a cold stop.
  • The inconsistent state of the databases further delayed recovery.

That is a terrible-horrible-no-good-very-bad-day, yes indeed. However, it shows what I'm talking about here. Several small errors crept in to make what was supposed to be a perfectly handleable fault something that caused many hours of downtime. This fault would have been discovered during the next routine test, but that hadn't happened yet.

Just like the nuke-plant failure, reading this list makes you go "what kind of cowboy outfit allows this kind of thing to happen?"

Or maybe, if it has happened to you, "Oh crimeny, I've so been there. Here's hoping I retire before it happens again."

It happens to us all. Netfix reduces this through the Chaos Monkey, using it to visibly trigger these small failures before they can cascade into big ones. And yet even they fall over when a really big failure happens naturally.

What can you do?

  • Accept that the multiple-failure combinatorics problem is infinite and you won't be able to capture every fail case.
  • Build your system to be as disaster resilient as possible.
  • Test your remediations, and do so regularly.
  • Validate your instrumentation is returning good results, and do so regularly.
  • Cross-check where possible.
  • Investigate glitches, and keep doing it after it gets tediously boring.
  • Cause small failures and force your system to respond to them.

These are all known best-practices, and yet people are lazy, or can't get sufficient management buy-in to do it (a 'minimum viable product' is likely excessively vulnerable to this kind of thing). We do what we can, snark at those who visibly can't, and hope our turn doesn't come up.

Natural inconvenience

| 3 Comments
Because really, 'natural disaster' is not what we're dealing with.

Yes, I felt the earthquake. It was me and the one other ex-west-coastie in the office who realized what that was first. It was quite unmistakable in this unreinforced brick-building. The ground just picked us up and dropped us a few times, the kind of motion a building like this just wasn't designed to handle. Some existing damage to the building may have gotten worse. There was a lot of brick dust thanks to the wood rafters dancing around. That's it.

At home we had two casualties. A drinking glass shook itself off a shelf and took a diver, breaking. And a mason-jar leaned against a cabinet door, so when I opened it last night it fell at me, requiring me to parry it into a metal popcorn bowl (loud!) and causing a large bruise on my left middle finger.

And we have a hurricane heading our way.

I'm not worried about the hurricane. On its current track, if we DO get a dead-center hit the eye will have tracked across 300-400 miles of land and will be significantly weakened ("tropical depression"). Far more likely is a glancing blow where it spins off the coast. Irene is a big storm so we would just get some of the western rain-bands. From what I hear, this'll be a lot like the big Winter rain-storms the PacNW gets, so nothing I haven't lived through before.

Data Protector deduplication

| 1 Comment
When last I looked at HP Data Protector deduplication, I was not impressed. The client-side requirements were a resource hungry joke, and were seriously compromised by Microsoft failover clusters.

I found a use-case for it last week. We have a server in a remote office across a slow WAN link. Backing that up has always been a trial, but I had the right resources to try and use dedupe and at least get the data off-site.

Sometime between then and now (v6.11) HP changed things.

The 'enhincrdb' directory I railed against is empty. Having just finished a full and incremental backup I see that the amount of data transferred over the wire is exactly the same for full and incremental backups, but the amount of data stashed away is markedly different. Apparently the processing to figure out what needs to be in the backup has been moved from the client to the backup server, which make this useless over slow links.

It means that enhanced incremental backups will take just as long as the fulls do, and we don't have time for that on our larger servers. We're still going to stick with an old fashioned Full/Incremental rotation.

It's an improvement in that it doesn't require such horrible client-side resources. However, this implementation still has enough quirks that I just do not see the use-case.

Shopping for a datacenter

| 1 Comment
A nice check-list for things you want to have in your new datacenter was posted today on ServerFault. Some good things in there, and all in one list!

However, there is one thing that is not quite right.

"It should have Halon fire suppression, not sprinklers."
Actually, it should have both FM200 (or equivalent) AND sprinklers, at least according to most fire-codes.

Sprinklers in the datacenter? Yep. One of the last things I did at my old job (11/2003 to be exact) was help move into a newly built datacenter, and I had quite a shock when I saw sprinkler heads over the server racks. I asked my boss about that, she had been neck deep in the building of it, and she replied that yes, it was indeed fire-code and that was new since the last time she helped build a datacenter in the 80's.

Building and fire codes are nigh impossible to get at in their entirety online, it seems they're put behind pay-walls so I can't link directly to them. Considering which governmental agency was going to be putting machines in that room, I consider that fire-code ruling to be highly trustworthy. FM200 protects all that very expensive gear in the datacenter. The sprinklers protect the room and building.

Those sprinklers should be dry-pipe, pre-action sprinklers. No sense tempting fate more than required.

Tape is dead, Long live Disk!

Except if you're using HP Data Protector.

Much as I'd like to jump on the backup-to-disk de-dup bandwagon, I can't. Can't afford it. It all comes down to the cost-per-GB of storage in the backup system.

With tape, Data Protector licenses on the following items:
  • Per tape-drive over 2
  • Per tape library with a capacity between 50 and 250 slots
  • Per tape library that exceeds 250 slots
  • Per media pool with more than some-big-number of tapes
With disk, DP licenses on the following items:
  • Per TB in the backup-to-disk system
Obviously, the Disk side is much easier to license. In our environment we had something like 500 SDLT320 tapes, and our library had 6 drives and 45 slots. We only had to license the 4 extra tape drives.

Then our library started crapping out, and we outgrew it anyway. Prime time to figure out what the future holds for our backup environment. TO DISK!

HOLY CRAP that's expensive.

HP licenses their B2D space by the Terabyte. After you do the math it comes down to about $5/GB. Without using a de-duplication technology, you can easily make 10 copies or more of every bit of data subject to backup. Which means that for every 1 GB of data in the primary storage, 10GB of data is in the B2D system, and that'll set us back a whopping $50/GB. So... about the de-duplication system...

Too bad it doesn't work for non-file data, and kinda sorta explicitly doesn't work for clustered systems. Since 70% or so of our backup data is sourced from clustered file-servers or is non-file data (Exchange, SQL backups), this means the gains from HP's de-dup technology are pretty minor. Looks like we're stuck doing standard backups at $50/GB (or more).

So, about that 'dead' tape technology! We've already shelled out for the tape-drive licenses so that's a sunk cost. The library we want doesn't have enough slots to force us to get that license. All that's left is the media costs. Math math math, and the amortized cost of the entire library and media set comes to about $0.25/GB. Niiice. Factor in the magnification factor, and each 1 GB of backup will cost $2.50/GB, a far, far cry from $50/GB.

We still have SOME backup to disk space. This is needed since these LTO4 drives are HUNGRY critters, and the only way to feed them fast enough to prevent shoe-shining is to back everything up to disk, and then copy the jobs to tape directly from disk. So long as we have a week's worth of free-space, we're good. This is a sunk cost too, happily.

So. To-disk backups may be the greatest thing since the invention of the tape-changing robot, but our software isn't letting us take advantage of it. 

New backup hardware

Friday represented the first production use of our new LTO4-based tape library. This replaced the old SDLT320 based Scalar 100 we've had for entirely too long. The simple fact that all of the media and drives are BRAND NEW should make our completion rate go very close to 100%. This excites me.

Friday we did a backup of our main file-serving cluster and the Blackboard content volume in a single job that streamed to a single tape drive.

Total data backed up: 6.41TB
Total time: 1475 minutes
Speed: 4669 MB/Min, or 77 MB/s

Still not flank speed for LTO4 (that's closer to 120 MB/s) but still markedly faster than the SDLT stuff we had been doing. The similar backup on the Scalar 100 took around 36 hours (2160 minutes) instead of the 24ish hours this one took, and it used 4 tape drives to do it.

Ahhhh, modern technology, how I've desired you.

*pets it*

Now to resist taking a fire ax to the old library. We have to surplus it through official channels, and they won't take it if it has been "obviously defaced". Ah well.

Spent money

| 2 Comments
It has been a week plus since we spent a lot of money and the question has been raised, what are we doing with that storage? Exactly?

It isn't fancy storage. In fact, it is the cheapest performant storage we could budget for. It's not SATA, but it is 7.2K RPM SAS. And there are 35TB of it. It's a server, with direct attached storage. Not a dedicated storage unit. Not fibre channel. An off the shelf server, a high quality RAID card, a bunch of storage shelves, and a pair of network ports.

A final decision hasn't been made yet for how we're presenting this storage to consumers, but iSCSI of some kind is the 90% likely choice. Whether that's Linux (a.k.a. the free option) or something else (a.k.a. the pay option) remains to be seen. The whole point of this storage is to be cheap per GB.

We're also adding a pair of Fibre Channel Drive enclosures to our EVA4400 to provide true high-speed low-latency service at a much reduced cost versus the EVA6100. Yes FC drives are EOL in a very short while, but the EVA4400 doesn't support SAS (yet). This is where our ESX cluster is likely to expand into when the time comes, that kind of stuff.

And a new tape library. It's an HP StorageWorks MSL4048. It is LTO4, fibre-attached, and has lots of slots. The native capacity of this guy is 37.5TB which is a whole lot larger than the 7.9TB of our current SDLT320 unit. It only has two drives for now which will limit flexibility somewhat, but it is upgradeable to four drives later when the money tree starts producing again. If we really need to, we can stack another MSL4048 on top of it for even more storage. Because it is drive-limited, we kind of have to stage all backups to disk and then copy from disk to tape; we won't be doing any backups directly to tape.

When it'll get here is anyone's guess. Purchasing is currently digging out of an avalanche of last-minute orders just like ours, so they're w-a-y backed up down there.

Spending money

| 2 Comments
Today we spent more money in one day than I've ever seen done here. Why? Well substantiated rumor had it that the Governor had a spending freeze directive on her desk. Unlike last year's freeze, this one would be the sort passed down during the 2001-02 recession; nothing gets spent without OFM approval. Veterans of that era noted that such approval took a really long time, and only sometimes came. Office scuttle-butt was mum on whether or not consumable purchases like backup tapes would be covered.

We cut purchase orders today and rushed them through Purchasing. A Purchasing who was immensely snowed under, as can be well expected. I think final signatures get signed tomorrow.

What are we getting? Three big things:
  1. A new LTO4 tape library. I try not to gush lovingly at the thought, but keep in mind I've been dealing with SDLT320 and old tapes. I'm trying not to let all that space go to my head. 2 drives, 40-50 slots, fibre attached. Made of love. No gushing, no gushing...
  2. Fast, cheap storage. Our EVA6100 is just too expensive to keep feeding. So we're getting 8TB of 15K fast storage. We needs it, precious.
  3. Really cheap storage. Since the storage area networking options all came in above our stated price-point, we're ending up with direct-attached. Depending on how we slice it, between 30-35TB of it. Probably software ISCSI and all the faults inherent in the setup. We still need to dicker over software.
But... that's all we're getting for the next 15 months at least. Now when vendors cold call me I can say quite truthfully, "No money, talk to me in July 2011."

The last thing we have is an email archiving system. We already know what we want, but we're waiting on determination of whether or not we can spend that already ear-marked money.

Unfortunately, I'll be finding out a week from Monday. I'll be out of the office all next week. Bad timing for it, but can't be avoided.