Disk failure rates

My boss forwarded us this article:

Opinion: Real-world disk failure rates offer surprises.

Apparently a pair of studies of large populations of disks have been released. Both have over 100,000 disk drives in the study, and they looked at real world failure rates. What they show is that the MTBF reported by drive manufacturers is incorrect. It also shows several other things as well.

Remember the long standing SysAdmin wisdom that you get a few drive failures within a few months, then not many, then more as the drives age? Bunk. The failure curve over time doesn't look like that at all.

The study also shows that the real-world MTBF for SATA is no different than SCSI. And the real-world MTBF of SCSI drives is no different than Fibre Channel. They also see failure rates increasing significantly after 3 years of age, not the 5 years of age that the MTBF numbers would suggest.

Another thing indicated is that S.M.A.R.T. errors do correlate with a much greater chance of failure in the near term, but such drives have a solid chance of running for another year without a hitch. That said, many failures are not presaged by SMART errors at all. Customers with massive RAID systems (think Raid 6) may not care about SMART failures as internal redundancy renders such predictive failures moot. On the other hand, home users really should replace drives after the first SMART error.

Another interesting item they found is that environmental temperature does not affect drive failure rates, to a point. Get too cold, under 17C (63F), and failure rates increase. Get too high, and high is really high, and you get increased failure rates as well. Other systems go kablam at high temperatures, so disk failures are not the top thing to worry about if you have a "heat event" in your datacenter.

As for failure rates, the study uses what they call the Annualized Replacement Rate (ARR). This is the likelihood of any given disk failing in a given year, during its 5 year lifespan. The observed ARR came to about 3%, where the ARR based on datasheet information puts it under 1%. The observed ARR at different sites can change markedly, and the study did not theorize as to why that would be. As an anecdote, one dataset had drives that were 7 years old at the end of the study, and that population had an ARR of 24%.
Observation 1: Variance between datasheet MTTF and disk replacement rates in the field was larger than we expected. The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours.

Observation 2: For older systems (5-8 years of age), data sheet MTTFs underestimated replacement rates by as much as a factor of 30.

Observation 3: Even during the first few years of a system's lifetime ($< style="font-weight: bold;">Observation 4: In our data sets, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks. This may indicate that disk-independent factors, such as operating conditions, usage and environmental factors, affect replacement rates more than component specific factors. However, the only evidence we have of a bad batch of disks was found in a collection of SATA disks experiencing high media error rates. We have too little data on bad batches to estimate the relative frequency of bad batches by type of disk, although there is plenty of anecdotal evidence that bad batches are not unique to SATA disks.
Very interesting stuff! Studies like these can lead to new methods of labeling for drives.