A real world example of SSD endurance

| 1 Comment

Last year I created a commodity hardware based storage system. It was cheap, and I had confidence in the software that made it work. Still do, in fact.

I built the thing for massive scaling because that's just plain smart. We haven't hit the massive scaling part of the growth curve yet, but it's a lot closer to now than I was last year. So I thought big.

The base filesystem I chose is XFS, since I have experience with it, and it's designed from the bolts out for big. ZFS wasn't an option for a couple of reasons, and BTRFS wasn't mature enough for me to bet the business on it. One of the quirks of XFS is that it can bottleneck on journal and superblock writes, so I had to ensure that wouldn't get in the way.

Easy!

Put the XFS journal on an external device based on a SSD! Blazing fast writes, won't get in the way. Awesome.

But how to ensure that SSD could survive what was in essence a pure-write workload?

The decision cascade kind of went like this:

  1. Huh, pure-write is hard on SSDs. We'll hit endurance problems pretty fast that way.
  2. If endurance is a problem, I should look at SLC based SSDs not MLC, what with them being able to endure hundreds of thousands to millions of writes.
  3. A filesystem journal is weeny, I don't need a massive device for that. That one 50GB one would be more than enough, and I could even put the OS partition on it. Yay affordability!

Made total sense!

Now 10ish months later, why don't I take a look at the actual performance data as reported by SMART attributes. The attribute I'm most interested in is "Program/Erase Cycles", which records the number of full-disk writes that have happened to the device.

In the 10ish months since these devices went into service, the average P/E Cycle count is a whopping 726. That's more than 1 a day, but less than 2.

Er, hm. Clearly I'm not dealing with as much writes as I thought I would. In fact, I'm about two orders of magnitude under my engineering maximum since these drives are rated at 50PB total write endurance. Doing the math, and each cell would have to be written to 547 times a day for 5 years to reach that.

SLC was w-a-y overkill for this application, much cheaper MLC would have been fine.

That said, between one and two full-writes a day is still a good amount of activity. A dumb MLC device with only 3,000 P/E cycles per cell and no fancy controller magic to minimize P/E cycles would hit that line in... well, math-time:

  • 3,000 P/E cycles until death.
  • 1.3 full-writes per day
  • 3000/1.3 = 2307 days = 6.3 years

Huh.

Of course, write-amplification is going to play a factor here but figuring that out is something I'll cover later.

That dumb MLC device would last 6.3 years at current utilization, but would die a lot sooner once we hit massive scaling and its corresponding increase in utilization. So dumb MLC isn't an option. What about the enterprisey MLC offerings out there? What do they offer for endurance these days?

Vendor
Line
Endurance
STEC s840 10x day for 5 years (18,250)
Seagate 600 Pro SSD 1080TB (2,764)
Seagate 1200 SSD 7300TB (18,688)
Samsung SM843 1PB (8,738)
Samsung SM843T 2.4PB (20,971)

The number in parens is the calculated write-endurance of the cells in question.

Clearly, the Seagate 600 Pro SSD isn't up to the task, but everything else would be. Of course, this can be gamed. If I got a 600 Pro SSD that was 4x larger than I need, I'd have enough spare cells around that the effective endurance matches that of the Samsung SM843 (thanks to load-spreading, more unallocated blocks means more spare-area to spread writes around).

When we get to the point of making more nodes to scale out this system, I won't be reaching for the SLC devices again. MLC is good enough!

1 Comment

You should be able to directly ask the SSD how far along it is in its lifetime with respect to P/E cycles. You just need a recent-ish version of smartmontools:


$ sudo smartctl -l ssd /dev/sda
smartctl 6.1 2013-01-13 r3746 [x86_64-linux-3.2.0-45-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

Device Statistics (GP Log 0x04)
Page Offset Size Value Description
7 ===== = = == Solid State Device Statistics (rev 1) ==
7 0x008 1 3~ Percentage Used Endurance Indicator
|_ ~ normalized value