Recently in msa Category

Storage that makes you think

| 2 Comments
Anandtech has a nice article up right now that compares SAS, SATA, and SSD drives in a database environment. Go read it. I'll wait.

While the bulk of the article is about how much the SSD drives blow the pants off of rotational magnetic media, the charts show how SAS performs versus SATA. As they said at the end of the article:
Our testing also shows that choosing the "cheaper but more SATA spindles" strategy only makes sense for applications that perform mostly sequential accesses. Once random access comes into play, you need two to three times more SATA drives - and there are limits to how far you can improve performance by adding spindles.
Which matches my experience. SATA is great for sequential loads, but is bottom of the pack when it comes to random I/O. In a real world example, take this MSA1500CS we have. It has SATA drives in it

If you have a single disk group with 14 1TB drives in it, this gives a theoretical maximum capacity of 12.7TB (that storage industry TB vs OS TB problem again). Since you can only have LUNs as large as 2TB due to the 32-bit signed integer problem, this would mean this disk group would have to be carved into 7 LUNs. So how do you go about getting maximum performance from this set up?

You'll have to configure a logical volume on your server such that each LUN appends to the logical volume in order, and then make sure your I/O writes (or reads) sequentially across the logical volume. Since all 7 LUNs are on the same physical disks, any out-of-order arrangement of LUN on that spanned logical volume would result in semi-random I/O and throughput would drop. Striping the logical volume just ensures that every other access requires a significant drive-arm move, and would seriously drop throughput. It is for this reason that HP doesn't recommend using SATA drives in 'online' applications.

Another thing in the article that piqued my interest is there on page 11. This is where they did a test of various data/log volume combinations between SAS and SSD. The conclusion they draw is interesting, but I want to talk about it:
Transactional logs are written in a sequential and synchronous manner. Since SAS disks are capable of delivering very respectable sequential data rates, it is not surprising that replacing the SAS "log disks" with SSDs does not boost performance at all.
This is true, to a point. If you have only one transaction log, this is very true. If you put multiple transaction logs on the same disk, though, SSD becomes the much better choice. They did not try this configuration. I would have liked to have seen a test like this one:
  • Three Data volumes running on SAS drives
  • One Log volume running on an SSD with all three database logs on it
I'm willing to bet that the performance of the above would match, if not exceed, running three separate log volumes running on SAS.

The most transactional database in my area is probably Exchange. If we were able to move the Logs to SSD's, we very possibly could improve performance of those databases significantly. I can't prove it, but I suspect we may have some performance issues in that database.

And finally, it does raise the question of file-system journals. If I were to go out and buy a high quality 16GB SSD for my work-rig, I could use that as an external journal for my SATA-based filesystems. As it is an SSD, running multiple journals on it should be no biggie. Plus, offloading the journal-writes should make the I/O on the SATA drives just a bit more sequential and should improve speeds. But would it even be perceptible? I just don't know.

MSA performance in the new config

Today I reconfigured the MSA1500 to run in Active/Active mode. While there, I also rearranged our disk arrays. We have 41, 500GB, 7.2K RPM drives in there. I created two, 20 disk Arrays, and filled each array with Raid 0+1 LUNs. This yielded 9TB of useful space. That extra drive will stay extra until we get an odd number of new drives.

Yes, a profligate waste of space but at least it'll be fast. It also had the added advantage of not needing to stripe in like Raid5 or Raid6 would have. This alone saved us close to two weeks flow time to get it back into service.

Another benefit to not using a parity RAID is that the MSA is no longer controller-CPU bound for I/O speeds. Right now I have a pair of writes, each effectively going to a separate controller, and the combined I/O is on the order of 100Mbs while controller CPU loads are under 80%. Also, more importantly, Average Command Latency is still in the 20-30ms range.

The limiting factor here appears to be how fast the controllers can commit I/O to the physical drives, rather than how fast the controllers can do parity-calcs. CPU not being saturated suggests this, but a "show perf physical" on the CLI shows the queue depth on individual drives:
Queue depth chart
The drives with a zero are associated with LUNs being served by the other controller, and thus not listed here. But a high queue depth is a good sign of I/O saturation on the actual drives themselves. This is encouraging to me, since it means we're finally, finally, after two years, getting the performance we need out of this device. We had to go to an active/active config with a non-parity RAID to do it, but we got it.

Moving storage around

| 1 Comment
The EVA6100 went in just fine with that one hitch I mentioned, and now comes all the work we need to do now that we have actual space again. We're still arguing over how much space to add to which volumes, but once we decide all but Blackboard will be very easy to add.

Blackboard needs more space on both the SQL server and the Content server, and as the Content server is clustered it'll require an outage to manage the increase. And it'll be a long outage, as 300GB of weensy files takes a LONG time to copy. The SQL server uses plain old Basic partitions, so I don't think we can expand that partition, so we may have to do another full LUN copy which will require an outage. That has yet to be scheduled, but needs to happen before we get through much of the quarter.

Over on the EVA4400 side, I'm evacuating data off of the MSA1500cs onto the 4400. Once I'm done with that, I'm going to be:
  1. Rebuilding all of the Disk Arrays.
  2. Creating LUNs expressly for Backup-to-Disk functionality.
  3. Flashing the Active/Active firmware on to it, the 7.00 firmware rev.
  4. Get the two Backup servers installed with the right MPIO widgetry to take advantage of active/active on the MSA>
But first we need the DataProtector licensing updates to beat its way through the forest of paperwork and get ordered. Otherwise, we can't use more than 5TB of disk, and that's WAY wimpy. I need at LEAST 20, and preferably 40TB. Once that licensing is in place, we can finally decommission the out-of-license BackupExec server and use the 6 slot tape library with DataProtector instead. This should significantly increase how much data we can throw at backup devices during our backup window.

What has yet to be fully determined is exactly how we're going to use the 4400 in this scheme. I expect to get between 15-20TB of space out of the MSA once I'm done with it, and we have around 20TB on the 4400 for backup. Which is why I'd really like that 40TB license please.

Going Active/Active should do really good things for how fast the MSA can throw data at disk. As I've proven before the MSA is significantly CPU bound for I/O to parity LUNs (Raid5 and Raid6), so having another CPU in the loop should increase write throughput significantly. We couldn't do Active/Active before since you can only do Active/Active in a homogeneous OS environment, and we had Windows and NetWare pointed at the MSA (plus one non-production Linux box).

In the mean time, I watch progress bars. TB of data takes a long time to copy if you're not doing it at the block level. Which I can't.

That darned budget

| 1 Comment
This is where I whine about not having enough money.

It has been a common complaint amongst my co-workers that WWU wants enterprise level service for a SOHO budget. Especially for the Win/Novell environments. Our Solaris stuff is tied in closely to our ERP product, SCT Banner, and that gets big budget every 5 years to replace. We really need the same kind of thing for the Win/Novell side of the house, such as this disk-array replacement project we're doing right now.

The new EVAs are being paid for by Student Tech Fee, and not out of a general budget request. This is not how these devices should be funded, since the scope of this array is much wider than just student-related features. Unfortunately, STF is the only way we could get them funded, and we desperately need the new arrays. Without the new arrays, student service would be significantly impacted over the next fiscal year.

The problem is that the EVA3000 contains between 40-45% directly student-related storage. The other 55-60% is Fac/Staff storage. And yet, the EVA3000 was paid for by STF funds in 2003. Huh.

The summer of 2007 saw a Banner Upgrade Project, when the servers that support SCT Banner were upgraded. This was a quarter million dollar project and it happens every 5 years. They also got a disk-array upgrade to a pair of StorageTek (SUN, remember) arrays, DR replicated between our building and the DR site in Bond Hall. I believe they're using Solaris-level replication rather than Array-level replication.

The disk-array upgrade we're doing now got through the President's office just before the boom went down on big expensive purchases. It languished in the Purchasing department due to summer-vacation related under-staffing. I hate to think how late it would have gone had it been subjected to the added paperwork we now have to go through for any purchase over $1000. Under no circumstances could we have done it before Fall quarter. Which would have been bad, since we were too short to deal with the expected growth of storage for Fall quarter.

Now that we're going deep into the land of VMWare ESX, centralized storage-arrays are line of business. Without the STF funded arrays, we'd be stuck with "Departmental" and "Entry-level" arrays such as the much maligned MSA1500, or building our own iSCSI SAN from component parts (a DL385, with 2x 4-channel SmartArray controller cards, 8x MSA70 drive enclosures, running NetWare or Linux as an iSCSI target, with bonded GigE ports for throughput). Which would blow chunks. As it is, we're still stuck using SATA drives for certain 'online' uses, such as a pair of volumes on our NetWare cluster that are low usage but big consumers of space. Such systems are not designed for the workloads we'd have to subject them to, and are very poor performers when doing things like LUN expansions.

The EVA is exactly what we need to do what we're already doing for high-availability computing, yet is always treated as an exceptional budget request when it comes time to do anything big with it. Since these things are hella expensive, the budgetary powers-that-be balk at approving them and like to defer them for a year or two. We asked for a replacement EVA in time for last year's academic year, but the general-budget request got denied. For this year we went, IIRC, both with general-fund and STF proposals. The general fund got denied, but STF approved it. This needs to change.

By October, every person between and Governor Gregoir will be new. My boss is retiring in October. My grandboss was replaced last year, my great grand boss also has been replaced in the last year, and the University President stepped down on September 1st. Perhaps the new people will have a broader perspective on things and might permit the budget priorities to be realigned to the point that our disk-arrays are classified as the critical line-of-business investments they are.

Disk-array migrations done right

We have two new HP EVA systems. An EVA4400 with FATA drives that we'll be putting into our DR datacenter in Bond Hall, and upgrading our EVA3000 into an EVA6100 + 2 new enclosures. The 4400 is a brand new device, so is sitting idle right now (officially). It will be replacing the MSA1500 we purchased two years ago, and will fulfill the duties the MSA should have been doing but is too stupid to do.

We've set up the 4400 already, and as part of that we had to upgrade our CommandView version from the 4.something it was with the EVA3000 to CommandView 8. As a side effect of this, we lost licensing for the 3000 but that's OK since we're replacing that this weekend. I'm assuming the license codes for the 6100 are in the boxes the 6100 parts are in. We'll find that out Friday night, eh?

One of the OMG NICE things that comes with the new CommandView is a 60 day license for both ContinuousAccess EVA and BusinessCopy EVA. ContinuousAccess is the EVA to EVA replication software, and is the only way to go for EVA to EVA migrations. We started replicating LUNs on the 6100 to the 4400 on Monday, and they just got done replicating this morning. This way, if the upgrade process craters and we lose everything, we have a full block-level replica on the 4400. So long as we get it all done by 10/26/2008, which we should do.

On a lark we priced out what purchasing both products would cost. About $90,000, and that's with our .edu discount. That's a bit over half the price of the HARDWARE, which we had to fight tooth and nail to get approved in the first place. So. Not getting it for production.

But the 60 day license is the only way to do EVA to EVA migrations. In 5 years when the 6100 falls off of maintenance and we have to forklift replace a new EVA in, it will be ContinuousAccess EVA (eval) that we'll use to replicate the LUNs over to the new hardware. Then on migration date we'll shut everything down ("quiesce I/O"), make sure all the LUN presentations on the new array look good, break the replication groups, and rezone the old array out. Done! Should be a 30 minute outage.

Without the eval license it'd be a backup-restore migration, and that'd take a week.

EVA4400 + FATA

| 5 Comments
Some edited excerpts of internal reports I've generated over the last (looks at watch) week. The referenced testing operations involve either a single stream of writes, or two streams of writes in various configurations:
Key points I've learned:
  • The I/O controllers in the 4400 are able to efficiently handle more data than a single host can throw at it.
  • The FATA drives introduce enough I/O bottlenecks that multiple disk-groups yield greater gains than a single big disk-group.
  • Restripe operations do not cause anywhere near the problems they did on the MSA1500.
  • The 4400 should not block-on-write the way the MSA did, so the NetWare cluster can have clustered volumes on it.
The "Same LUN" test showed that Write speeds are about half that of the single threaded test, which gives about equal total throughput to disk. The Read speeds are roughly comperable, giving a small net increase in total throughput from disk. Again, not sure why. The Random Read tests continue to perform very poorly, though total throughput in parallel is better than the single threaded test.

The "Different LUN, same disk-group," test showed similar results to the "Same LUN" test in that Write speeds were about half of single threaded yielding a total Write throughput that closely matches single-threaded. Read speeds saw a difference, with significant increases in Read throughput (about 25%). The Random Read test also saw significant increases in throughput, about 37%, but still is uncomfortably small at a net throughput of 11 MB/s.

The "Different LUN, different disk-group," test did show some I/O contention. For Write speeds, the two writers showed speeds that were 67% and 75% of the single-threaded speeds, yet showed a total throughput to disk of 174 MB/s. Compare that with the fasted single-threaded Write speed of 130 MB/s. Read performance was similar, with the two readers showing speeds that were 90% and 115% of the single-threaded performance. This gave an aggregate throughput of 133 MB/s, which is significantly faster than the 113 MB/s turned in by the fastest Reader test.

Adding disks to a disk-group appears to not significantly impact Write speeds, but significantly impact Read speeds. The Read speed dropped from 28 MB/s to 15 MB/s. Again, a backup-to-disk operation wouldn't notice this sort of activity. The Random Read test showed a similar reduction in performance. As Write speeds were not affected by restripe, the sort of cluster hard-locks we saw with the MSA1500 on the NetWare cluster will not occur with the EVA4400.

And finally, a word about controller CPU usage. In all of my testing I've yet to saturate a controller, even during restripe operations. It was the restripe ops that killed the MSA, and the EVA doesn't seem to block nearly as hard. Yes, read performance is dinged, but not nearly to the levels that the MSA does. This is because the EVA keeps its cache enabled during restripe-ops, unlike the MSA.
One thing I alluded to in the above is that Random Read performance is rather bad. And yes, it is. Unfortunately, I don't yet know if this is a feature of testing methodology or what, but it is worrysome enough that I'm figuring it into planning. The fastest random-read speed turned in for a 10GB file, 64KB nibbles, came to around 11 MB/s. This was on a 32-disk disk-group on a Raid5 vdisk. Random Read is the test that closest approximates file-server or database loads, so it is important.

HP has done an excellent job tuning the caches for the EVA4400, which makes Write performance exceed Read performance in most cases. Unfortunately, you can't do the same reordering optimization tricks for Read access that you can for Writes, so Random Read is something of a worst-case scenario for these sorts of disks. HP's own documentation says that FATA drives should not be used for 'online' access such as file-servers or transactional databases. And it turns out they really meant that!

That said, these drives sequential write performance is excellent, making them very good candidates for Backup-to-Disk loads so long as fragmentation is constrained. The EVA4400 is what we really wanted two years ago, instead of the MSA1500.

Still no word on whether we're upgrading the EVA3000 to a EVA6100 this weekend, or next weekend. We should know by end-of-business today.

EVA4400 testing

Right before I left Friday I started a test on the EVA4400 with a 100GB file. This is the same file-size I configured DataProtector to use for the backup-to-disk files, so it's a good test size.

Sequential Write speed: 79,065 KB/s
Sequential Read speed: 52,107 KB/s

That's a VERY good number. The Write speed above is about the same speed as I got on the MSA1500 when running against a Raid0 volume, and this is a Raid5 volume on the 4400. The 10GB file-size test I did before this one I also watched the EVA performance on the monitoring server, and controller CPU during that time was 15-20% max. Also, it really used both controllers (thanks to MPIO).

Random Write speed: 46,427 KB/s
Random Read speed: 3,721 KB/s

Now we see why HP strongly recommends against using FATA drives for random I/O. For a file server that's 80% read I/O, it would be a very poor choice. This particular random-read test is worst-case, since a 100GB file can't be cached in RAM so this represents pure array performance. File-level caching on the server itself would greatly improve performance. The same test with a 512MB file turns in a random read number of 1,633,538 KB/s which represents serving the whole test in cache-RAM on the testing station itself.

This does suggest a few other tests:
  • As above, but two 100MB files at the same time on the same LUN
  • As above, but two 100MB files at the same time on different LUNs in the same Disk Group
  • As above, but two 100MB files at the same time on different LUNs in different Disk Groups

Disk-space over time

I've mentioned before that I do SNMP-based queries against NetWare and drop the resulting disk-usage data into a database. The current incarnation of this database went in August of 2004, so I have just over 4 years of data in it now. You can see some real trends in how we manage data in the charts.

To show you what I'm talking about, I'm going to post a chart based on the student-home-directory data. We have three home-directory volumes for students, which run between 7000-8000 home directories on them. We load-balance by number of directories rather than least-size. The chart:

Chart showing student home directory disk space usage, carved up by quarter.

As you can see, I've marked up our quarters. Winter/Spring is one segment on this chart since Spring Break is hard to isolate on these scales. We JUST started Winter 2008, so the last dot on the chart is data from this week. If you squint in (or zoom in like I can) you can see that last dot is elevated from the dot before it, reflecting this week's classes.

There are several sudden jumps on the chart. Fall 2005. Spring 2005. Spring 2007 was a big one. Fall 2007 just as large. These reflect student delete processes. Once a student hasn't been registered for classes for a specified period of time (I don't know what it is off hand, but I think 2 terms) their account goes on the 'ineligible' list and gets purged. We do the purge once a quarter except for Summer. The Fall purge is generally the biggest in terms of numbers, but not always. Sometimes the number of students purged is so small it doesn't show on this chart.

We do get some growth over the summer, which is to be expected. The only time when classes are not in session is generally from the last half of August to the first half of September. Our printing volumes are also w-a-y down during that time.

Because the Winter purge is so tiny, Winter quarter tends to see the biggest net-gain in used disk-space. Fall quarter's net-gain sometimes comes out a wash due to the size of that purge. Yet if you look at the slopes of the lines for Fall, correcting for the purge of course, you see it matches Winter/Spring.

Somewhere in here, and I can't remember where, we increased the default student directory-quota from 200MB to 500MB. We've found Directory Quotas to be a much better method of managing student directory sizes than User Quotas. If I remember my architectures right, directory quotas are only possible because of how NSS is designed.

If you take a look at the "Last Modified Times" chart in the Volume Inventory for one of the student home-directory volumes you get another interesting picture:
Chart showing the Last Modified Times for one student volume.
We have a big whack of data aged 12 months or newer. That said, we have non-trivial amounts of data aged 12 months or older. This represents where we'd get big savings when we move to OES2 and can use Dynamic Storage Technology (formerly known as 'shadowvolumes'). Because these are students and students only stick around for so long, we don't have a lot of stuff in the "older than 2 years" column that is very present on the Faculty/Staff volumes.

Being the 'slow, cheap,' storage device is a role well suited to the MSA1500 that has been plaguing me. If for some reason we fail to scare up funding to replace our EVA3000 with another EVA less filled-to-capacity, this could buy a couple of years of life on the EVA3000. Unfortunately, we can't go to OES2 until Novell ships an edirectory enabled AFP server for Linux, currently scheduled for late 2008 at the earliest.

Anyway, here is some insight into some of our storage challenges! Hope it has been interesting.
I've spoken before about my latency problems on the MSA1500cs. Since my last update I've spoken with Novell at length. Their own back-line HP people were thinking firmware issues to, and recommended I open another case with HP support. And if HP again tries to lay the blame on NetWare, to point their techs at the NetWare backline tech. Who will then have a talk about why exactly it is that NetWare isn't the problem in this case.

This time when I opened the case I mentioned that we see performance problems on the backup-to-disk server, which is Windows. Which is true, when the problem occurs B2D speeds drop through the floor; last Friday a 525GB backup that normally completes in 6 hours took about 50 hours. Since I'm seeing problems on more than one operating system, clearly this is a problem with the storage device.

The first line tech agreed, and escalated. The 2nd line tech said (paraphrased):
I'm seeing a lot of parity RAID LUNs out there. This sort of RAID uses CPU on the MSA1000 controllers, so the results you're seeing are normal for this storage system.
Which, if true, puts the onus of putting up with a badly behaved I/O system onto NetWare again. The tech went on to recommend RAID1 for the LUNs that need high performance when doing array operations that disable the internal cache. Which, as far as I can figure, would work. We're not bottlenecking on I/O to the physical disks, the bottleneck is CPU on the MSA1000 controller that's active. Going RAID1 on the LUNs would keep speeds very fast even when doing array operations.

That may be where we have to go with this. Unfortunately, I don't think we have 16TB of disk-drives available to fully mirror the cluster. That'll be a significant expense. So, I think we have some rethinking to do regarding what we use this device for.

I/O starvation on NetWare, HP update

Last week I talked about a problem we're having with the HP MSA1500cs and our NetWare cluster. The problem is still there, of course. I've opened cases with both HP and Novell to handle this one. HP because I really thing that such command latencies are a defect, and Novell since they're having starvation issues with clusters.

This morning I got a voice-mail from HP, an update for our case. Greatly summarized:
The MSA team has determined that your device is working perfectly, and can find no defects. They've referred the case to the NetWare software team.
Or...
Working as designed. Fix your software. Talk to Novell.
Which I'm doing. Now to see if I can light a fire on the back-channels, or if we've just made HP admit that these sorts of command latencies are part of the design and need to be engineered around in software. Highly frustrating.

Especially since I don't think I've made back-line on the Novell case yet. They're involved, but I haven't been referred to a new support engineer yet.