Multi-disk failures: follow-up

By far the biggest criticism to that piece are the following two ideas.

That's what the background scan process is for. It comes across a bad sector, it reallocates the block. That gets rid of the bad block w-a-y early so you don't ever actually get this problem.

And

That never happens with ZFS. It checksums blocks so it'll even recover the lost data as it's reallocating it.

Which are both very true. That's exactly what those background scanning processes are for, to catch this exact kind of bit-rot before it gets bad enough to trigger the multi-disk failure case I illustrated. Those background processes are important.

Even so, they also have their own failure modes.

  • Some only run when externally initiated I/O is quiet, which never happens for some arrays.
  • Some run constantly, but at low I/O priority. So for very big storage systems, each GB of space may only get scanned once a month if that often.
  • Some run just fine, thank you; they're just built wrong.
    • They only mark a sector is bad if it completely fails to read it; sectors that read just fine after the 1st or 2nd retry are passed.
    • They use an ERROR_COUNTER with thresholds set too high.
    • Successful retry-reads don't increment ERROR_COUNTER.
    • Scanning I/O doesn't use the same error-recovery heuristics as Recovery I/O. If Recovery I/O rereads a sector 16 times before declaring defeat, but Scanning only tries 3 times, you can hit an ERROR_COUNTER overflow during a RAID Recovery you didn't expect.
  • Some are only run on-demand (ZFS), and, well, never are. Or are run rarely because it's expensive.

I had mentioned I had seen this kind of fault recently. I have. My storage systems use just these background scanning processes, and it still happened to me.

Those background scanning processes are not perfect, even ZFS's. It's a balance between the ultimate paranoia of if there is any error ever, fail it! and the prudence of rebuilds are expensive, so only do them when we need to. Where your storage systems fall on that continuum is something you need to be aware of.

Disks age! Bad blocks tend to come in groups, so if each block is only getting scanned every few weeks, or worse every other month, a bad spot can take a disk out well before the scanning process detects it. This is the kind of problem that a system with 100 disks faces; back when it was a 24 disk system things worked fine, but as it grew and I/O loads increased those original 24 disks aren't scanned as often and they should be.


As I said at the end of the piece this only touches on one way you can get multi-disk failures. There are others, definitely.