Recently in sata Category

Multi-disk failures: follow-up

By far the biggest criticism to that piece are the following two ideas.

That's what the background scan process is for. It comes across a bad sector, it reallocates the block. That gets rid of the bad block w-a-y early so you don't ever actually get this problem.

And

That never happens with ZFS. It checksums blocks so it'll even recover the lost data as it's reallocating it.

Which are both very true. That's exactly what those background scanning processes are for, to catch this exact kind of bit-rot before it gets bad enough to trigger the multi-disk failure case I illustrated. Those background processes are important.

Even so, they also have their own failure modes.

  • Some only run when externally initiated I/O is quiet, which never happens for some arrays.
  • Some run constantly, but at low I/O priority. So for very big storage systems, each GB of space may only get scanned once a month if that often.
  • Some run just fine, thank you; they're just built wrong.
    • They only mark a sector is bad if it completely fails to read it; sectors that read just fine after the 1st or 2nd retry are passed.
    • They use an ERROR_COUNTER with thresholds set too high.
    • Successful retry-reads don't increment ERROR_COUNTER.
    • Scanning I/O doesn't use the same error-recovery heuristics as Recovery I/O. If Recovery I/O rereads a sector 16 times before declaring defeat, but Scanning only tries 3 times, you can hit an ERROR_COUNTER overflow during a RAID Recovery you didn't expect.
  • Some are only run on-demand (ZFS), and, well, never are. Or are run rarely because it's expensive.

I had mentioned I had seen this kind of fault recently. I have. My storage systems use just these background scanning processes, and it still happened to me.

Those background scanning processes are not perfect, even ZFS's. It's a balance between the ultimate paranoia of if there is any error ever, fail it! and the prudence of rebuilds are expensive, so only do them when we need to. Where your storage systems fall on that continuum is something you need to be aware of.

Disks age! Bad blocks tend to come in groups, so if each block is only getting scanned every few weeks, or worse every other month, a bad spot can take a disk out well before the scanning process detects it. This is the kind of problem that a system with 100 disks faces; back when it was a 24 disk system things worked fine, but as it grew and I/O loads increased those original 24 disks aren't scanned as often and they should be.


As I said at the end of the piece this only touches on one way you can get multi-disk failures. There are others, definitely.

An old theme made new

Yesterday on Slashdot was a link to an article that sounds a lot like one I published two years ago tomorrow. The main point in the article is that due to the unrecoverable-read-error rate in your standard SATA drive (10^14 bits, or 12.5TB), and the ever increasing sizes of SATA drives means that Raid 5 arrays can get to 12.5TB pretty quickly. Heck, high-end home media servers chock full of HD content can get there very fast.

While it doesn't say this in the specs page for that new Seagate drive, if you look on page 18 of the accompanying manual you can see the "Nonrecoverable read error" rate of the same 10^14 as I talked about two years ago. So, no improvement in reliability. However.... For their enterprise-class "Savvio" drives, they list a "Nonrecoverable Read Error" rate of 10^16 (1 in 1.25PB), which is better than the 10^15 (125TB) they were doing two years ago on their FC disks. So clearly, enterprise users are juuuust fine for large RAID5 arrays.

As I said before, the people who are going to be bitten by this will be home media servers. Also, whiteboxed homebrew servers for small/medium businesses will be at risk. So those of you who have to justify buying the really expensive disks, when there are el-cheepo 1.5TB drives out there? You can use this!

That darned 32-bit limit

| 1 Comment
Today I learned that the disk-space counters NetWare provides in SNMP use signed integers for its disk-space monitoring. These are stats published at a table at OID .1.3.6.1.4.1.23.2.28.2.14.1.3. Having just expanded our FacShare volume past 2TB, it went negative-space according to the monitors. A simple integer overflow since apparently Novell is using a signed integer for a number that can never be legitimately negative.

I've pointed this out on an enhancement request. This being NetWare, they may not chose to fix it if it is more than a two-line fix. We'll see.

This also means that volumes over 4TB can not be effectively monitored with SNMP. Since NSS can have up to 8TB volumes on NetWare, this could potentially be a problem. We're not there yet.
Anandtech ran an article recently about enterprise storage. In it they go over SATA vs. SCSI vs. SAS. Most of it I already knew, but towards the back was a kernel of information that I hadn't caught before.

We know that generally speaking SATA drives can't quite keep up to the same kind of workloads that SCSI can. Differences in the manufacturing process, quality control, and the like. I don't fully understand it, which irks me, but there it is. One of those areas is something called 'nonrecoverable read error' rate.

Take a look at this Seagate drive. It's almost the last thing on the spec page. The Nonrecoverable Read Error rate is 1 bit in 1014 bits, or 1 bad read in 12.5TB. Mainline SCSI and FC drives have that error rate as high as 10 to the 15th or 16th. Every 12.5TB of data transferred includes a corrupted bit.

We don't see this as a problem in most enterprise situations because they all run in some form of redundant array setup. RAID5 drivers, usually in the RAID controller, see the bad bit and go to the parity data to fill in the real value. RAID1 drivers go to the mirror. No biggie. The problem comes with RAID5 rebuilds, when the entire array is read in order to generate the parity data. If you have 14 500GB drives in your RAID5 array, that means during a rebuild you transfer around 7TB of data. If a bad bit shows up during the rebuild process, a 56% chance, game over. That's a from-tape rebuild.

This is why systems such as RAID6 are showing up. That's a double parity system, so rebuilding one bad disk does not risk the whole array if a nonrecoverable read error occurs. You lose two disks to parity, but you can still have a 30-disk array without much risk.

One more reason why SATA isn't quite ready for realtime data applications. Nearline, yes, but not realtime. This'll play hob with our ideas for our BCC cluster.

Tags: