Monday, December 28, 2009

Bad tapes

It seems that HP Data Protector and BackupExec 10 have different opinions on what constitutes a bad tape. BackupExec seems to survive them better. This means that as we cycle old media into the new Data Protector environment we're getting the occasional bad tape. We've been averaging 3 bad tapes per 40 tape rotation.

While that not may sound like a lot, it really is. Our very large backups are extremely vulnerable to bad tapes, since all it takes is one bad tape to kill an entire backup session. When you're doing a backup of 1.3TB of data, you don't want those backups to fail.

Take that 1.3TB backup. We're backing up to SDL320 media, so we're averaging somewhere between 180GB and 220GB a tape depending on what kinds of files are being backed up. So that's 7-8 tapes for this one backup. How likely is it that this 7 to 8 tape backup will include at least one of the 3 bad tapes?

When the first tape is picked the chance is 3 in 40 (7.5%).
When the second tape is picked, assuming the first tape was good, the chance is 3 in 39 (7.69%).
When the third tape is picked, presuming the first two were good, the chance is 3 in 38 (7.89%).
When the 7th tape is picked, presuming the first six were good, the chance has increased to 3 in 34 (8.82%)

8.82% doesn't sound like much. However, the probability is cumulative. The true probability can be computed:

(3/40)+(3/39)+(3/38)+(3/37)+(3/36)+(3/35)+(3/34) = 0.56923444 or 56.92%

So with 3 bad tapes in a given 40 tape set, the chance of this one 7 tape backup having at least one of them in the tape set is over 50%. For an 8 tape backup the probability increases to 66.01%.

The true probability is a different number, since these backups are taken concurrent with other backups. So when the 7th tape gets picked, the number of available tapes is much less than 34, and the number of bad tapes still waiting to be found may not be 3. Also, these backups are mutliplexed so the true tape set may be as high as 9 tapes for this backups if that one backup target is slow in sending data to the backup server.

So the true probability is not 56.92%, it changes on a week to week basis. However, 56.92% (or 66%) is a good baseline. Some weeks it'll be a lot more. Others, such as weeks where the bad tapes are found by other processes and the target server is streaming fast, less.

We have a couple more weeks until we've cycled through all of our short-retention media. At that point our error rate should drop a lot. Until then, it's like dodging artillery shells.

Labels:


Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?