Recently in storage Category

The new era of big storage...

| 2 Comments

...is full of flash. And that changes things.

Not a surprise at all to anyone paying attention, but there it is. Flash is changing things in many ways:

  • Hybrid SSD+HD drives are now out there on the market, bringing storage tiering to the consumer space.
  • SSD is now kind of a standard for Laptops, or should be. The cheap option still has HD on it, but... SSD man. Just do it.
  • One SSD can fully saturate a 6Gb SATA or SAS link. This changes things:
    • A channel with 12 of those things is going to seriously under-utilize the individual drives.
    • There is no way a RAID setup (hardware, software, or ZFS) can keep up with parity calculations and still keep the drives performant, so parity RAID of any stripe is a bad choice.
    • A system with a hundred of these things on it, channeled appropriately of course, won't have enough system-bus speed to keep them fed.
  • Large scale enterprise systems are increasingly using a SSD tier for either caching or top-level tiering (not all solutions are created equal).
    • ZFS L2ARC + Log
  • They're now coming in PCIe flavors so you don't even have to bother with a HBA.
    • Don't have to worry about that SAS speed-limit anymore.
    • Do have to worry about how many PCIe slots you've got.

Way back in elder days, when Windows NT was a scrappy newcomer challenging the industry dominant incumbent and and said incumbent was making a mint on selling certifications, I got one of those certifications to be a player in the job market (it actually helped). In the studying for that certification I was exposed to a concept I had never seen before:

The Hierarchical Storage Management System.

NetWare had hooks for it. In short, it does for files what Storage Tiering does for blocks. Pretty easy concept, but required some tricky engineering when the bottom layer of the HSM tree was a tape library(1). All scaled-out (note, not distributed(2)) storage these days is going to end up using some kind of HSM-like system. At they very tippy-top you'll get your SSDs. They may even be in the next layer down as well. Spinning rust (disks) will likely form the tier that used to belong to spooling rust (tape), but they'll still be there.

And that tier? It can RAID5 all it wants. It may be 5 disk sets, but it'll have umpty different R5 sets to stripe across so it's all good. The famous R5 write-penalty won't be a big issue, since this tier is only written to when the higher tier is demoting data. It's not like the HSM systems of yore where data had to be promoted to the top tier before it could even be read, we can read directly from the slow/crappy stuff now!(3)

All flash solutions will exist, and heck, are already on the market. Not the best choice for bulk-storage, which is why they're frequently paired with big deduplication engines, but for things like, say, being the Centralized Storage Array for a large VM (sorry, "private cloud") deployment featuring hundreds/thousands of nearly identical VMs... they pay off.

Spinning disks will stick around the way spooling tape has stuck around. Farther and farther from the primary storage role, but still very much used.


[1]: Yes, these systems really did have a tape drive as part of a random-access storage system. If you needed a file off of tape, you waited. Things were slower back then, OK? And let us not speak of what happened when Google Desktop showed up and tried to index 15 years worth of archival data, and did so on 200 end-user workstations within a month.

[2]: Distributed storage is another animal. The flash presence there is less convincing, but it'll probably happen anyway.

[3]: Remember that bit about Google Desktop? Well... "How did we go from 60% used to 95% used on the home-directory volumes in a week? OUR USERS HAVEN'T BEEN THAT USERY!!!" That's what happened. All those brought-from-archive files now landed on the precious, precious hard-drives. Pain teaches, and we figured out how to access the lower tiers.

Also, I'm on twitter now. Thanks for reading.

Last year I created a commodity hardware based storage system. It was cheap, and I had confidence in the software that made it work. Still do, in fact.

I built the thing for massive scaling because that's just plain smart. We haven't hit the massive scaling part of the growth curve yet, but it's a lot closer to now than I was last year. So I thought big.

The base filesystem I chose is XFS, since I have experience with it, and it's designed from the bolts out for big. ZFS wasn't an option for a couple of reasons, and BTRFS wasn't mature enough for me to bet the business on it. One of the quirks of XFS is that it can bottleneck on journal and superblock writes, so I had to ensure that wouldn't get in the way.

Easy!

Put the XFS journal on an external device based on a SSD! Blazing fast writes, won't get in the way. Awesome.

But how to ensure that SSD could survive what was in essence a pure-write workload?

Multi-disk failures: follow-up

| No Comments

By far the biggest criticism to that piece are the following two ideas.

That's what the background scan process is for. It comes across a bad sector, it reallocates the block. That gets rid of the bad block w-a-y early so you don't ever actually get this problem.

And

That never happens with ZFS. It checksums blocks so it'll even recover the lost data as it's reallocating it.

Which are both very true. That's exactly what those background scanning processes are for, to catch this exact kind of bit-rot before it gets bad enough to trigger the multi-disk failure case I illustrated. Those background processes are important.

Even so, they also have their own failure modes.

  • Some only run when externally initiated I/O is quiet, which never happens for some arrays.
  • Some run constantly, but at low I/O priority. So for very big storage systems, each GB of space may only get scanned once a month if that often.
  • Some run just fine, thank you; they're just built wrong.
    • They only mark a sector is bad if it completely fails to read it; sectors that read just fine after the 1st or 2nd retry are passed.
    • They use an ERROR_COUNTER with thresholds set too high.
    • Successful retry-reads don't increment ERROR_COUNTER.
    • Scanning I/O doesn't use the same error-recovery heuristics as Recovery I/O. If Recovery I/O rereads a sector 16 times before declaring defeat, but Scanning only tries 3 times, you can hit an ERROR_COUNTER overflow during a RAID Recovery you didn't expect.
  • Some are only run on-demand (ZFS), and, well, never are. Or are run rarely because it's expensive.

I had mentioned I had seen this kind of fault recently. I have. My storage systems use just these background scanning processes, and it still happened to me.

Those background scanning processes are not perfect, even ZFS's. It's a balance between the ultimate paranoia of if there is any error ever, fail it! and the prudence of rebuilds are expensive, so only do them when we need to. Where your storage systems fall on that continuum is something you need to be aware of.

Disks age! Bad blocks tend to come in groups, so if each block is only getting scanned every few weeks, or worse every other month, a bad spot can take a disk out well before the scanning process detects it. This is the kind of problem that a system with 100 disks faces; back when it was a 24 disk system things worked fine, but as it grew and I/O loads increased those original 24 disks aren't scanned as often and they should be.


As I said at the end of the piece this only touches on one way you can get multi-disk failures. There are others, definitely.

How multi-disk failures happen

| 7 Comments

Having seen this failure mode happen a couple times now, it's time to share. Yes, Virgil, multi-disk failures DO happen during RAID rebuilds. I have pictures, so it MUST be true!

First, let's take a group of disks.
00-disks.png


Eight, 2TB drives in a 7-disk RAID5 set. With hot-spare! 10.92TB of usable space! Not going to fill that in a hurry.

On this array we have defined several Volumes.
01-vols.png
15 of them, in fact. One of which is two volumes merged together at the OS level (vols 2 & 3). That happens.

It just so happens that the particular use-case for this array is somewhat sequential. Most of the data stored on this bad boy is actually archival. The vasty majority of I/O is performed against the newest volume, with the older ones just sitting there for reference. Right now, with Vol 15 being the newest, Vol 1 hasn't had anything done to it in a couple of years.

That said, time is not kind to hard-drives.

Tape!

| No Comments
Tape isn't going away, much like mainframes never actually went away. However, their utility is evolving somewhat.

The emergence of the Linear Tape File System is quite interesting. It's an open tape format (nothing new there) that looks to have broad acceptance (the new part). Especially since the LTO governing body has adopted it, and LTO is the de-facto standard for tape in the industry right now.

Open standards make the long-term archive problem easier to tackle, since the implementation details are widely understood and are more likely to either still be in use or have industry expertise available to make it work should an old archive need to be read. They also allow interoperability; a "tier-5" storage tier consisting of tape could allow duplicates of the tape media to be housed in an archive built by another vendor.

In my current line of work, a data-tape using LTFS would be a much cheaper carrier for a couriered TB of data than a hard-drive would. We haven't seen this yet, but it remains a decided possibility.

Understandably, the video content industry is a big fan of this kind of thing since their files are really big and they need to keep them around for years. The same technology could be used to build a computer-image jukebox for a homebrew University computer-lab imaging system.

Looking around, it seems people are already using this with CloneZilla. Heh.

It's hard to get excited about tape, but I'll settle for 'interested'.

"How do I make my own Dropbox without using Dropbox" is a question we get a lot on ServerFault.

And judging by the Dropbox Alternatives question, the answer is pretty clear.

iFolder.

Yes, that Novell thingy.

I've used the commercial version, but the open-source version does most of what the paid one does. I suspect the end-to-end encryption option is not included, possibly due to licensing concerns. But the whole, "I have this one directory on multiple machines that exists on all of 'em, and files just go to all of them and I don't have to think about it," thing is totally iFolder.

The best part is that it has native clients for both Windows and Mac, so no futzing around with Cygwin or other Gnu compatibility layers.

An older problem

| 1 Comment
I deal with some large file-systems. Because of what we do, we get shipped archives with a lot of data in them. Hundreds of gigs sometimes. These are data provided by clients for processing, which we then do. Processing sometimes doubles, or even triples or more, the file-count in these filesystems depending on what our clients want done with their data.

One 10GB Outlook archive file can contain a huge number of emails. If a client desires these to be turned into .TIFF files for legal processes, that one 10GB .pst file can turn into hundreds of thousands of files, if not millions.

I've had cause to change some permissions at the top of some of these very large filesystems. By large, I mean larger than the big FacShare volume at WWU in terms of file-counts. As this is on a Windows NTFS volume, it has to walk the entire file-system to update permissions changes at the top.

This isn't the exact problem I'm fixing, but it's much like in some companies where granting permissions to specific users is done instead of to groups, and then that one user goes elsewhere and suddenly all the rights are broken and it takes a day and half to get the rights update processed (and heaven help you if it stops half-way for some reason).

Big file-systems take a long time to update rights inheritance. This has been a fact of life on Windows since the NT days. Nothing new here.

But... it doesn't have to be this way. I explain under the cut.
HP has been transitioning away from the cciss Linux kernel-driver for a while now, but there hasn't been much information about what it all means. Just on the name alone the module needed a rename (one possible acronym of cciss: Compaq Command Interface for SCSI-3 Support), and it is a driver that has been in the Linux ecosystem a really long time (at least in the 2.2 kernel era). A lot has changed in the kernel.

HP has finally released a PDF describing the whole cciss vs. hpsa thing.

Read it here: http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02677069/c02677069.pdf

The key differences:
  • HPSA is a SCSI driver, not a block-driver like CCISS
  • This means that the devices are moving from /dev/cciss to /dev
  • Device mode numbers will change
  • New controllers will increment kernel names, so a second controller will be /dev/sda, not /dev/sdb, so use udev names (partition ID, disk-ID, that kind of thing) to avoid pain.
  • For newer kernels (2.6.36+) cciss and hpsa can load at the same time if the system contains hardware that needs those drivers.

Is network now faster than disk?

| No Comments
Way back in college, when I was earning my Computer Science degree, the latencies of computer storage were taught like so:

  1. On CPU register
  2. CPU L1/L2 cache (this was before L3 existed)
  3. Main Memory
  4. Disk
  5. Network
This question came up today, so I thought I'd explore it.

The answer is complicated. The advent of Storage Area Networking was made possible because a mass of shared disk is faster, even over a network, than a few local disks. Nearly all of our I/O operations here at WWU are over a fibre-channel fabric, which is disk-over-the-network no matter how you dice it. With iSCSI and FC over Ethernet this domain is getting even busier.

That said, there are some constraints. "Network" in this case is still subject to distance limitations. A storage array 40km from the processing node will still see more storage latencies than the same type of over-the-network I/O 100m away. Our accesses are fast enough these days that the speed-of-light round-trip time for 40km is measurable versus 100m.

A very key difference here is that the 'network' component is handled by the operating system and not application code. For SAN an application requests certain portions of a file, the OS translates that into block requests, which are then translated into storage bus requests; the application doesn't know that the request was served over a network.

For application development the above tiers of storage are generally well represented.

  1. Registers, unless the programming is in assembly, most programmers just trust the compiler and OS to handles these right.
  2. L1/2/3 Cache, as above, although well tuned code can maximize the benefit this storage tier can provide.
  3. Main memory, this is directly handled through code. One might argue that at a low level memory handling constitutes a majority of what code does.
  4. Disk, This is represented by file-access or sometimes file-as-memory API calls. These tend to be discrete calls from main memory.
  5. Network, This is yet another completely separate call structure, which means using it requires explicit programming.
Storage Area Networking is parked in step 4 up there. Network can include things like making NFS connections and then using file-level calls to access data, or actual Layer 7 stuff like passing SQL over the network.

For massively scaled out applications, the network has even crept into step 3 thanks to things like memcached and single-system-image frameworks.

Network is now competitive with disk, though so far the best use-cases let the OS handle the network part instead of the application doing it.

Rogue file-servers

| No Comments
Being the person who manages our centralized file-server, I also have to deal with storage requests. The requests get directed to a layer or two higher than me, but I'm the one who has to make it so, or add new when the time comes. People never have enough storage, and when they ask for more sticker-shock means they often decide they can't have it.

It's a bad situation. End-users have a hard time realizing that the $0.07/GB hard-drive they can get from NewEgg has no bearing on what storage costs for us. My cheap-ass storage tier is about $1.50/GB, and that's not including backup infrastructure costs. So when we present a bill that's much more than they're expecting, the temptation to buy one of those 3.5TB 7.2K RPM SATA drives from NewEgg and slap it in a PC-turned-fileserver is high.

Fortunately(?) due to the decentralized nature of the University environment, what usually happens is that users go to their college IT department and ask for storage there. For individual colleges that have their own IT people, this works for them. I know of major storage concentrations that I have absolutely nothing to do with in the Libraries and the College of Science and Technology, and a smaller but still significant amount in Huxley. CST may have as much storage under management as I do, but I can't tell from here.

Which is to say, we generally don't have to worry about this problem. That problem? That's what happens when you have a central storage system that can't meet demand, and no recourse for end-users to fix it some other way.

And I'd hate to be the sysadmin who has to come down on that person like a ton of bricks. I'd do it, I won't like it, because I also hate not meeting my user's needs that flagrantly, but I'd still do it. Having users do that kind of end-run leads to pain everywhere in time.

Other Blogs

My Other Stuff

Monthly Archives