Recently in storage Category

I've talked about this before, and I'm sure I'll do it again. We do need to reduce some of the excessive packaging on the things we get. I can completely understand the need to swaddle a $57,000 storage controller in enough packaging to survive a 3 meter drop. What I don't understand is shipping the 24 hard drives that go with that storage controller in individual boxes. It wouldn't take much engineering to come up with a 6-pack foam holder for hard-drives. It would seriously reduce bulk, which makes it easier and cheaper to ship, and there is less material used in the whole process. But I guess that extra SKU is too much effort.

Today I turned this:
HP-BoxesA.jpg

Into this:

HP-BoxesB.jpg

The big box at the top of the stack contained 24 individual hard-drive boxes. Each box had:
  • 1 hard-drive.
  • 1 anti-static bag requiring a knife to open.
  • 2 foam end-pieces to hold the drive in place in the box.
  • 1 piece of paper of some kind, white.
  • 1 cardboard box, requiring a knife to open.
When I was done slotting all of those in, I had a large pile of cardboard boxes, a big jumble of green foam bits, a slippery pile of anti-static bags, and a neat pile of paper. The paper and cardboard can easily be recycled. The anti-static bags and foam bits... not so much. Although, the foam bits were marked type 4 plastic (LDPE), which means they were possibly made from recyclable materials, right?

Right?

I'd still like to use less of it.

TCP problems

| 3 Comments | No TrackBacks
My testing for a cheap NAS solution has progressed to the option that costs the most money, Windows 2008 running KernSafe's iStorage. As it happens, it works really well when the iSCSI initiator is Windows but Linux clients don't really want to talk to it. Windows: 30-50 MB/s. Linux: 3-5 MB/s. Biiiig difference there.

Looking at packets I'm noticing a similar pattern on the wire to one I'd seen before. Back when I was troubleshooting exactly why NetWare backups to DataProtector were horrible I came across this problem. It seems that TCP Windowing is fundamentally broken between Server 2008 and NetWare which leads to really bad throughputs, which in turn is very bad for half TB backups. The receiving server seemed to feel the need to ACK after every two packets, which in turn really slowed things down. And that's what the Linux clients are doing for iSCSI to Server 2008.

It has to be something affecting basic TCP services but not complex protocols. Using smbclient to upload a 4GB DVD iso runs at 50MB/s but the iSCSI throughput on the same client is a piddly 3-5MB/s. I'm sure some kind of tuning on either side might be able to jar things loose, heaven knows Linux 2.6.31 is a heck of a lot more current on TCP settings than NetWare 6.5 SP8 is. I just haven't found it yet.

Conversely, Server 2008 talking to a Linux iSCSI client works at line speed pretty much. I'm testing this for completeness's sake. We need something that can serve up to 30TB via both iSCSI and SMB. My findings aren't fully complete yet, but in general:
  • OpenFiler: GREAT iSCSI host, completely blows for SMB in our environment.
  • OpenSolaris: Great iSCSI host, just can't convince the kernel-mode CIFS to join our domain. Also, worst-of-breed random I/O performance.
  • OpenFiler + Windows: OpenFiler for iSCSI, Windows (mounting an iSCSI share) for SMB. Should work GREAT. Current best-best for the future.
  • OpenSolaris + Windows: As previous option, but I/O problems make it less attractive.
  • Windows + KernSafe: GREAT SMB performance, solid iSCSI for Windows hosts. Linux hosts will take lots of tuning (perhaps, or it could be intractable).
There is a misconception about solid-state drives that's rather pernicious. Some people have grabbed onto the paranoia surrounding the SSDs of several years ago and have hung onto that as gospel truth. What am I referring to? The fundamental truth that our current flash-drive technology has an upper limit on the number of writes per memory cell, coupled with a lack of faith in ingenuity.

Once Upon a Time, it was commonly bandied about that SSD memory cells only had 100,000 writes in them before going bad. Cross that with past painful experience with HD bad sectors and you scared off a whole generation of storage administrators. These scars seem to linger.

The main problem cited is/was hot-spotting on the drive itself. Certain blocks get written to a LOT (the journal, the freespace bitmap, certain critical inodes, etc) and once those wear out, well... you have a brick.

This perception has some basis in truth, most especially in the el-cheapo SSD drives of several years ago, but not any more. The enterprise class solid-state drive has not had this problem for a very long time. The exact technical details have been covered quite a lot in the media and Anandtech has had several good articles on it.

Part of the problem here is the misconception that a storage block as seen by the operating system corresponds with a single block on the storage device itself. This hasn't been the case since the 1980's when SCSI drives introduced sector reallocation as a way to handle bad sectors. Back in the day, and heck right now for rotational media, each hard drive keeps a stash of reallocation sectors that act as substitutes for actual bad sectors. When this happens most operating systems will throw alarms about pre-fail, but the data is still intact. What's more, the operating system doesn't necessarily know which sectors got reallocated. What looks like a contiguous block on the file allocation table actually has a sector significantly apart from the rest, which can impact performance of that file.

Solid state disks take this to another level. SSD vendors know darned well that flash ages, so they allocate a much larger chunk of storage for this exact reallocation scheme. The enterprise SSD drives out there have a larger percentage of this reserve space than consumer-grade SSD drives. As blocks wear out, they're substituted in real time from the reallocation block, and since solid-state-drives don't cause I/O latency increases when accessing non-contiguous blocks you'll never know the difference.

The other thing SSDs do is something called wear-leveling. The exact methods vary by manufacturer, but they all do it. The chipset on the drive itself makes sure that no cell get pounded with writes more than others. For instance, It'll write to a new block and mark the old block as free, while handling an 'overwrite' operation. The physical block corresponding to a logical block can change on a daily basis thanks to this. Blocks that get written to constantly, that darned journal again, will be constantly on the move.

The really high end SSD drives have a super-capacitor built into them and onboard cache. The chipset moves the high-write blocks to that cache to further reduce write-wear. The super-cap is there in case of sudden power loss where it'll commit blocks in the cache into flash. When you're paying over $2K for 512GB of space, this is the kind of thing you're buying.

All of these techniques combine to ensure your shiny new SSD will NOT wear itself out after only 100K writes. Depending on your workload, these drives can happily last three years or more. Obviously, if the workload is 100% writes they won't last as long, but you generally don't want SSD for 100% write loads anyway; you use SSDs for the blazing fast reads.

For modern SSD drives:
  • You do NOT need special SSD-aware filesystems. Generally, these are only for stupid SSD drives like a RAID array of MicroSD cards.
  • For most common workloads you do NOT need to worry about write-minimization.
  • They can handle in the millions to tens of millions of write operations per logical block (yes, it'll consume multiple physical blocks over its lifetime for that, but that's how this works).
It's time to move on.

OpenSolaris

| 6 Comments | No TrackBacks
I've been checking out OpenSolaris for a NAS possibility, and it's pretty nifty. A different dialect than I'm used to, but still nifty.

Unfortunately, it seems to have a nasty problem in file I/O. Here are some metrics (40GB file, with 32K and 64K record-sizes).

OpenFiler                                 random  random
              KB  reclen   write    read    read   write
        41943040      32  296238  118598   15682   62388
        41943040      64  297141  118861   23731   86620

OpenSolaris                               random  random
              KB  reclen   write    read    read   write
        41943040      32  259170 1179515    8458    7461
        41943040      64  244747 1133916   13894   13001
The identical hardware, but different operating system. I've figured out that the stellar Read performance is due to the zfs 'recordsize' being 128k. When I drop it down to 4k, similar to the block-size of XFS in OpenFiler, the Read performance is very similar. What I don't get is what's causing the large difference in random I/O. Random-write is exceedingly bad. With the recordsize dropped to 4K on XFS the random-read gets even worse; I haven't stuck through it enough to see what it does to random-write.

Poking into iostats show that both OpenFiler and OpenSolaris are striping I/O across the four logical disks available to them. I know the storage side is able to pump the I/O, as witnessed by the random-write speed on OpenFiler. The chosen file-size is larger than local RAM so local caching effects are minimized.

As I mentioned back in the know-your-IO article series, random-read is the best analog of the type of I/O pattern your backup process follows when backing up large disorganized piles of files. Cache/pre-fetch will help with this to some extent, but the above numbers give a fair idea as to the lower bound of speed. OpenSolaris is w-a-y too slow. At least, how I've got it configured, which is largely out-of-the-box.

Unfortunately, I don't know if this bottleneck is a driver issue (HP's fault) or an OS issue. I don't know enough of the internals of ZFS to hazard a guess.

More than OpenFiler

| 7 Comments | No TrackBacks
I've received better requirements than I had before, and OpenFiler by itself doesn't meet them. The requirements are, roughly:
  • Must support both file-based and block-based storage serving.
  • Must have some kind of non-hierarchical backup capability.
  • Able to create a mirror copy of the storage in a remote location.
This distills down to:
  • Must support both iSCSI and SMB serving.
  • Must have snapshots, or some other copy-on-write technology.
  • DRBD or some other replication technology.
Since OpenFiler's SMB integration just doesn't work in our environment, I can't use just that. Also, Samba's annoying habit of requiring a smb-daemon reset to add shares makes it annoying to work with. We can't risk pissing off the Access database users (not to mention PST users) who'd be most peeved when they have to do DB recovery on their files after a reset. Nothing a little change-management can't fix, but our users are already used to instant gratification.

Another option, less free, is to use a combination of Windows Server 2008 and KernSafe iStorage. It has the features we need, and the entire environment is still cheaper per GB than the other storage options we already have.

A second potential is the combination of OpenFiler in pure iSCSI mode and then a Windows Server 2008 instance in the ESX cluster to front-end iSCSI storage for SMB sharing. This has its problems as well, as filers are memory hungry, and we're currently bandwidth-constrained in the ESX cluster right now (this is changing, but we're still a month or two out from fixing that). Once you amortize utilized resources for this ESX-based filer you get a price that's pretty close to the KernSafe/Windows combo if not a bit more expensive.

I'm open to other ideas, but in the mean time KernSafe's free option has enough of the right features that I can at least test the thing.
I've been playing around with OpenFiler the last week. It seems to fit our need for a free-to-us software package that allows us to serve both CIFS and iSCSI from the same host, in an easy to manage package. I haven't done much serious testing with it, but I have done enough to get a feel for how it works.

One thing is pretty clear, if we domain this thing certain UI elements become unusable due to timeouts building the page. Because we have so many groups in our AD tree, and the fact that it has to list every single group in the system in one big pick-list on the Share Permission screen, that page takes a very very long time to load. Long enough that it won't show the network-based permissions dialog at the bottom of the page, and is critical for enabling CIFS sharing in the first place. Unless I can find a tweak somewhere, that's a pretty serious road-block for CIFSiness.

iSCSI, on the other hand, just flies like a dream. I haven't had a chance to try out real complexity with it, I lack enough servers with GigE NICs capable of an MTU larger than 1500b that can be used for testing, so I can't say how robust it is. But I can say that I can saturate the GigE NIC in the OpenFiler box.

This does suggest a solution, though. We'll need another (%!#$!) server, upon which we'll install a Winders of some flavor and use an iSCSI presentation for the storage. Or, if we feel like we need more hand-holding in our lives, a Linux box of some flavor and hand-roll the Samba config needed.

This thing can also do NFS, but we have limited demand for that. The same for 1980's style FTP. There is also a WebDAV option, but I shiver at the notion of turning that on; the WebDAV setup in our existing file-cluster has already caused enough hair loss thank you.

It can also do snapshots. Since this thing is Linux based, these are LVM-level snapshots. That could be useful.

File-systems are restricted to Ext3 and XFS, which is good to a point. These are not the filesystems you want for multi-million-file shares. However, if all you want is bulk storage for disk images, they're just peachy. Or a departmental share space (hundreds of thousands of files). Neither of these are terribly great at handling the "bajillion files in one directory" problem, but we have few of those as it is.

But if we can't figure out a way to make the CIFS sharing useful, file-system choice is mostly moot.

Anyway, more testing!
Today I ran into another post that goes into a practical example of diagnosing I/O problems on a linux host. It includes actual math, unlike what I did earlier.

http://www.cmdln.org/2010/04/22/analyzing-io-performance-in-linux/

The author also included a series of links at the bottom of the post for 'further reading' about storage issues. Including a series of articles much like the one I just got done with, but with more of a virtualization point of view than I had.
Except if you're using HP Data Protector.

Much as I'd like to jump on the backup-to-disk de-dup bandwagon, I can't. Can't afford it. It all comes down to the cost-per-GB of storage in the backup system.

With tape, Data Protector licenses on the following items:
  • Per tape-drive over 2
  • Per tape library with a capacity between 50 and 250 slots
  • Per tape library that exceeds 250 slots
  • Per media pool with more than some-big-number of tapes
With disk, DP licenses on the following items:
  • Per TB in the backup-to-disk system
Obviously, the Disk side is much easier to license. In our environment we had something like 500 SDLT320 tapes, and our library had 6 drives and 45 slots. We only had to license the 4 extra tape drives.

Then our library started crapping out, and we outgrew it anyway. Prime time to figure out what the future holds for our backup environment. TO DISK!

HOLY CRAP that's expensive.

HP licenses their B2D space by the Terabyte. After you do the math it comes down to about $5/GB. Without using a de-duplication technology, you can easily make 10 copies or more of every bit of data subject to backup. Which means that for every 1 GB of data in the primary storage, 10GB of data is in the B2D system, and that'll set us back a whopping $50/GB. So... about the de-duplication system...

Too bad it doesn't work for non-file data, and kinda sorta explicitly doesn't work for clustered systems. Since 70% or so of our backup data is sourced from clustered file-servers or is non-file data (Exchange, SQL backups), this means the gains from HP's de-dup technology are pretty minor. Looks like we're stuck doing standard backups at $50/GB (or more).

So, about that 'dead' tape technology! We've already shelled out for the tape-drive licenses so that's a sunk cost. The library we want doesn't have enough slots to force us to get that license. All that's left is the media costs. Math math math, and the amortized cost of the entire library and media set comes to about $0.25/GB. Niiice. Factor in the magnification factor, and each 1 GB of backup will cost $2.50/GB, a far, far cry from $50/GB.

We still have SOME backup to disk space. This is needed since these LTO4 drives are HUNGRY critters, and the only way to feed them fast enough to prevent shoe-shining is to back everything up to disk, and then copy the jobs to tape directly from disk. So long as we have a week's worth of free-space, we're good. This is a sunk cost too, happily.

So. To-disk backups may be the greatest thing since the invention of the tape-changing robot, but our software isn't letting us take advantage of it. 

New backup hardware

| No Comments | No TrackBacks
Friday represented the first production use of our new LTO4-based tape library. This replaced the old SDLT320 based Scalar 100 we've had for entirely too long. The simple fact that all of the media and drives are BRAND NEW should make our completion rate go very close to 100%. This excites me.

Friday we did a backup of our main file-serving cluster and the Blackboard content volume in a single job that streamed to a single tape drive.

Total data backed up: 6.41TB
Total time: 1475 minutes
Speed: 4669 MB/Min, or 77 MB/s

Still not flank speed for LTO4 (that's closer to 120 MB/s) but still markedly faster than the SDLT stuff we had been doing. The similar backup on the Scalar 100 took around 36 hours (2160 minutes) instead of the 24ish hours this one took, and it used 4 tape drives to do it.

Ahhhh, modern technology, how I've desired you.

*pets it*

Now to resist taking a fire ax to the old library. We have to surplus it through official channels, and they won't take it if it has been "obviously defaced". Ah well.
That last series of articles might suggest I've been doing storage administration for a while. And I have. But every so often I run across an article that just reminds me that I'm still in the shallow end.

Like this article from The Register, going over Quantum's new mega-library, the i6000. I have a buddy who has an i2000 and I've petted it. Lovingly. *sigh* This new baby can store 8PB. Petabyes, baaybee. LTO5. Mmmm. Sexy.

Storage is a major concern just now. One of the main reasons that there are still IT stacks on campus that aren't centralized is storage. We have researchers, generally in the College of Science and Technology, that use departmental, rather than central, resources for storing their data. Departmental means servers, so CST represents the biggest non-ITS concentration of IT at WWU. They don't have any shared storage arrays over there, so they make do with large direct-attach-storage servers over there. A quick back-of-envelope calculation says that they have about as much storage in DAS as we have on our fastest SAN-attached storage array. Combine that with the chronic storage shortages central IT has had for the past, oh, 15 years and you have an entrenched set of servers over there.

If they were to join us in the borg ITS my area just might crack 100TB in disk space. Ooo. An i2000 with LTO4 still would be overkill for a storage network that large. And the i2000 can expand to several cabinets.

Yeeeah. WWU is still strictly small time when it comes to storage. In a lot of ways I'm a Stand Alone Storage Administrator.

Other Blogs

My Other Stuff

About this Archive

This page is an archive of entries from June 2010 listed from newest to oldest.

May 2010 is the previous archive.

July 2010 is the next archive.

Find recent content on the main index or look in the archives to find all content.