Bad tapes

| 4 Comments
It seems that HP Data Protector and BackupExec 10 have different opinions on what constitutes a bad tape. BackupExec seems to survive them better. This means that as we cycle old media into the new Data Protector environment we're getting the occasional bad tape. We've been averaging 3 bad tapes per 40 tape rotation.

While that not may sound like a lot, it really is. Our very large backups are extremely vulnerable to bad tapes, since all it takes is one bad tape to kill an entire backup session. When you're doing a backup of 1.3TB of data, you don't want those backups to fail.

Take that 1.3TB backup. We're backing up to SDL320 media, so we're averaging somewhere between 180GB and 220GB a tape depending on what kinds of files are being backed up. So that's 7-8 tapes for this one backup. How likely is it that this 7 to 8 tape backup will include at least one of the 3 bad tapes?

When the first tape is picked the chance is 3 in 40 (7.5%).
When the second tape is picked, assuming the first tape was good, the chance is 3 in 39 (7.69%).
When the third tape is picked, presuming the first two were good, the chance is 3 in 38 (7.89%).
When the 7th tape is picked, presuming the first six were good, the chance has increased to 3 in 34 (8.82%)

8.82% doesn't sound like much. However, the probability is cumulative. The true probability can be computed:

(3/40)+(3/39)+(3/38)+(3/37)+(3/36)+(3/35)+(3/34) = 0.56923444 or 56.92%

So with 3 bad tapes in a given 40 tape set, the chance of this one 7 tape backup having at least one of them in the tape set is over 50%. For an 8 tape backup the probability increases to 66.01%.

The true probability is a different number, since these backups are taken concurrent with other backups. So when the 7th tape gets picked, the number of available tapes is much less than 34, and the number of bad tapes still waiting to be found may not be 3. Also, these backups are mutliplexed so the true tape set may be as high as 9 tapes for this backups if that one backup target is slow in sending data to the backup server.

So the true probability is not 56.92%, it changes on a week to week basis. However, 56.92% (or 66%) is a good baseline. Some weeks it'll be a lot more. Others, such as weeks where the bad tapes are found by other processes and the target server is streaming fast, less.

We have a couple more weeks until we've cycled through all of our short-retention media. At that point our error rate should drop a lot. Until then, it's like dodging artillery shells.

4 Comments

Have you considered using a RAIT type setup? It seems specifically designed for your situation. It's essentially RAID-5 (or 3, depending on the implementation) except with tapes.

I've seen that technology around, but neither HP Data Protector or our tape library support RAIT setups. Otherwise, it'd be exactly what we'd need to survive this first onslaught of bad media.

Or perhaps time to swap out to a different format. Granted you are only postponing the issue until your data grows again, but there are other benefits; speed, fewer medium required, etc.LTO4 performs rather nicely in our environment. Can clock up 800GB to 1.6TB (at full 2:1) and will happily do it in a few hours - provided your source is fast enough.

Oh, we need a new tape library. That much we already know. Due to the budget crisis, prying loose funds to pay for one is going to take a LOT of work on the part of the managers above me. The absolute earliest we could get funds would be July. Which won't help the problem I'll have in the next two weeks.Us techs really really really wanted to go LTO when we bought this library. However the LTO cost came in about $1200 over the absolute budget ceiling for a replacement unit, so we had to go SDLT320 if we were going to get anything at all.