When perfection is the standard

The disaster recovery infrastructure is an area where perfection is the standard, and anything less than perfection is a fault that needs fixing. It shares this distinction with other things like Air Traffic Control and sports officiating. In any area where perfection is the standard, any failure of any kind brings wincing. There are ways to manage around faults, but there really shouldn't be faults in the first place.

In ATC there are constant cross-checks and procedures to ensure that true life-safety faults only happen after a series of faults. In sports officiating, the advent of 'instant replay' rules assist officials in seeing what actually happened from angles other than the ones they saw, all as a way to improve the results. In DR, any time a backup or replication process fails, it leaves an opening through which major data-loss can possibly occur. Each of these have their unavoidable, "Oh *****," moments. Which leads to frustration when it happens too often.

At my old job we had taken some paperwork steps towards documenting DR failures. We didn't have anything like a business-continuity process, but we did have tape backup. When backups failed, there was a form that needed to be filled out and filed, explaining why the fault happened and what can be done to help it not happen again. I filled out a lot of those forms.

Yeah, perfection is the standard for backups. We haven't come even remotely close to perfection for many, many months. Some of it is simple technology faults, like DataProtector and NetWare needing tweaking to talk to each other well or over-used tape drives giving up the ghost and requiring replacement. Some of it is people faults, like forgetting to change out the tapes on Friday so all the weekend fulls fail due to a lack of non-scratch media. Some of it is management process faults, like discovering the sole tape library fell off of support and no one noticed. Some of it is market-place faults, like discovering the sole tape library will be end-of-lifed by the vendor in 10 months. Some of these haven't happened yet, but they are areas that can fail.

If the stimulus fairy visits us, backup infrastructure is top of the list for spending.