SysAdmin1138 Expounds

Thursday, February 11, 2010

Spending money

Today we spent more money in one day than I've ever seen done here. Why? Well substantiated rumor had it that the Governor had a spending freeze directive on her desk. Unlike last year's freeze, this one would be the sort passed down during the 2001-02 recession; nothing gets spent without OFM approval. Veterans of that era noted that such approval took a really long time, and only sometimes came. Office scuttle-butt was mum on whether or not consumable purchases like backup tapes would be covered.

We cut purchase orders today and rushed them through Purchasing. A Purchasing who was immensely snowed under, as can be well expected. I think final signatures get signed tomorrow.

What are we getting? Three big things:

A new LTO4 tape library. I try not to gush lovingly at the thought, but keep in mind I've been dealing with SDLT320 and old tapes. I'm trying not to let all that space go to my head. 2 drives, 40-50 slots, fibre attached. ~~Made of love~~. No gushing, no gushing...
Fast, cheap storage. Our EVA6100 is just too expensive to keep feeding. So we're getting 8TB of 15K fast storage. We needs it, precious.
Really cheap storage. Since the storage area networking options all came in above our stated price-point, we're ending up with direct-attached. Depending on how we slice it, between 30-35TB of it. Probably software ISCSI and all the faults inherent in the setup. We still need to dicker over software.

But... that's all we're getting for the next 15 months at least. Now when vendors cold call me I can say quite truthfully, "No money, talk to me in July 2011."

The last thing we have is an email archiving system. We already know what we want, but we're waiting on determination of whether or not we can spend that already ear-marked money.

Unfortunately, I'll be finding out a week from Monday. I'll be out of the office all next week. Bad timing for it, but can't be avoided.

Labels: backup, budget, storage, sysadmin

<permalink> posted by riedesg : 2/11/2010 07:36:00 PM 0 comments

Tuesday, February 02, 2010

Budget plans

Washington State has a $2.6 Billion deficit for this year. In fact, the finance people point out that if something isn't done the WA treasury will run dry some time in September and we'll have to rely on short-term loans. As this is not good, the Legislature is attempting to come up with some way to fill the hole.

As far as WWU is concerned, we know we'll be passed some kind of cut. We don't know the size, nor do we know what other strings may be attached to the money we do get. So we're planning for various sizes of cuts.

One thing that is definitely getting bandied about is the idea of 'sweeping' unused funds at end-of-year in order to reduce the deficits. As anyone who has ever worked in a department subject to a budget knows, the idea of having your money taken away from you for being good with your money runs counter to every bureaucratic instinct. I have yet to meet the IT department that considers themselves fully funded. My old job did that; our Fiscal year ended 12/15, which meant that we bought a lot of stuff in October and November with the funds we'd otherwise have to give back (a.k.a. "Christmas in October"). Since WWU's fiscal year starts 7/1, this means that April and May will become 'use it or lose it' time.

Sweeping funds is a great way to reduce fiscal efficiency.

In the end, what this means is that the money tree is actually producing at the moment. We have a couple of crying needs that may actually get addressed this year. It's enough to completely fix our backup environment, OR do some other things. We still have to dicker over what exactly we'll fix. The backup environment needs to be made better at least somewhat, that much I know. We have a raft of servers that fall off of cheap maintenance in May (i.e. they turn 5). We have a need for storage that costs under $5/GB but is still fast enough for 'online' storage (i.e. not SATA). As always, the needs are many, and the resources few.

At least we HAVE resources at the moment. It's a bad sign when you have to commiserate with your end-users over not being able to do cool stuff, or tell researchers they can't do that particular research since we have no where to store their data. Baaaaaad. We haven't quite gotten there yet, but we can see it from where we are.

Labels: backup, budget, storage, sysadmin

<permalink> posted by riedesg : 2/02/2010 09:56:00 AM 0 comments

Monday, January 25, 2010

Storage tiers

Events have pushed us to give a serious look at cheaper storage solutions. What's got our attention most recently is HP's new LeftHand products. That's some nice looking kit, there. But there was an exchange there that really demonstrated how the storage market has changed in the last two years:

HP: What kind disk are you thinking of?
US: Oh, probably mid tier. 10K SAS would be good enough.
HP: Well, SAS only comes in 15K, and the next option down is 7.2K SATA. And really, the entire storage market is moving to SAS.

Note the lack of Fibre Channel drives. Those it seems are being depreciated. Two years ago the storage tier looked like this:

SATA
SAS/SCSI
FC

Now the top end has been replaced.

SATA
SAS
SSD

We don't have anything that requires SSD-levels of performance. Our VMWare stack could run quite happily on sufficient SAS drives.

Back in 2003 when we bought that EVA3000 for the new 6 node NetWare cluster, clustering required shared storage. In 2003, shared storage meant one of two things:

SCSI and SCSI disks, if using 2 nodes.
Fibre Channel and FC Disks if using more than 2 nodes.

With 6 nodes in the cluster, Fibre Channel was our only choice. So that's what we have. Here we are 6+ years later, and our I/O loads are very much mid-tier. We don't need HPC-level I/O ops. CPU on our EVA controllers rarely goes above 20%. Our I/O is significantly randomized, so SATA is no good. But we need a lot of it, so SSDs become prohibitive. Therefore SAS is what we should be using if we buy new.

Now if only we had some LTO drives to back it all up.

Labels: storage, sysadmin

<permalink> posted by riedesg : 1/25/2010 06:25:00 PM 0 comments

Wednesday, January 13, 2010

NTFS and fragmentation

I've known for a while that filesystem fragmentation can seriously affect NTFS performance. This isn't just run of the mill, "frag means using random access patterns for what should be sequential I/O," performance degradation. I'm seeing it on volumes backed by EVA arrays which purposely randomize I/O to the disk spindles. Clearly something in the meta-data handling degrades significantly when frag gets to a certain point.

Today I found out why that is.

http://blogs.technet.com/askcore/archive/2009/10/16/the-four-stages-of-ntfs-file-growth.aspx

As the number of fragments increase, the MFT table has to track more and more fragments. Once the number of fragments exceeds how much can be stored directly into the MFT, it starts adding indirection layers to track the file extents.

If you have a, say, 10GB file on your backup-to-disk system, and that file has 50,000 fragments, you are absolutely at the 'stage 4' listed in that blog post. Meta-data operations on that file, such as tracking down the next extent to read from if you're doing a restore or copy, will be correspondingly more expensive than a 10GB file with 4 fragments. At the same time, attempting to write a large file that requires such massive fragmentation in turn requires a LOT more meta-data operations than a big-write on an empty filesystem.

And this, boys and girls, is why you really really really want to avoid large fragmentation on your NTFS-based backup-to-disk directories. Really.

Labels: storage

<permalink> posted by riedesg : 1/13/2010 10:53:00 AM 0 comments

Tuesday, January 12, 2010

The costs of backup upgrades

Our tape library is showing its years, and it's time to start moving the mountain required to get it replaced with something. So this afternoon I spent some quality time with google, a spread-sheet, and some oldish quotes from HP. The question I was trying to answer is what's the optimal mix of backup to tape and backup to disk using HP Data Protector. The results were astounding.

Data Protector licenses backup-to-disk capacity by the amount of space consumed in the B2D directories. You have 15TB parked in your backup-to-disk archives, you pay for 15TB of space.

Data Protector has a few licenses for tape libraries. They have costs for each tape drive over 2, another license for libraries with between 61-250 slots, and another license for unlimited slots. There is no license for fibre-attached libraries like BackupExec and others do.

Data Protector does not license per backed up host, which is theoretically a cost savings.

When all is said and done, DP costs about $1.50 per GB in your backup to disk directories. In our case the price is a bit different since we've sunk some of those costs already, but they're pretty close to a buck fiddy per GB for Data Protector licensing alone. I haven't even gotten to physical storage costs yet, this is just licensing.

Going with an HP tape library (easy for me to spec, which is why I put it into the estimates), we can get an LTO4-based tape library that should meet our storage growth needs for the next 5 years. After adding in the needed DP licenses, the total cost per GB (uncompressed, mind) is on the order of $0.10 per GB. Holy buckets!

Calming down some, taking our current backup volume and apportioning the price of largest tape library I estimated over that backup volume and the price rises to $1.01/GB. Which means that as we grow our storage, the price-per-GB drops as less of the infrastructure is being apportioned to each GB. That's a rather shocking difference in price.

Clearly, HP really really wants you to use their de-duplication features for backup-to-disk. Unfortunately for HP, their de-duplication technology has some serious deficiencies when presented with our environment so we can't use it for our largest backup targets.

But to answer the question I started out with, what kind of mix should we have, the answer is pretty clear. As little backup-to-disk space as we can get away with. The stuff has some real benefits, as it allows us to stage backups to disk and then copy to tape during the day. But for long term storage, tape is by far the more cost-effective storage medium. By far.

Labels: backup, stats, storage

<permalink> posted by riedesg : 1/12/2010 03:30:00 PM 0 comments

Friday, September 18, 2009

The end of the line for RAID?

Regarding this: http://www.enterprisestorageforum.com/technology/features/article.php/3839636

He has a point. Storage sizes are increasing faster than reliability figures, and the combination is a very bad thing for parity RAID. Size by itself means that large RAID sets will take a long time to rebuild. I ran into this directly with the MSA1500 I was working with a while back, where it would take a week (7 whole days!) to rework a 7TB disk-array. The same firmware also very strongly recommended against RAID5 LUNs on more than 7TB of of disks due to the non-recoverable read error rate on the SATA drives being used. RAID6 increase the durability of parity RAID, but at the cost of increased overhead.

Unfortunately, there are no clear answers. What you need really depends on what you're using it for. For very high performance storage where random I/O latency during high speed transfers are your prime performance metric, lots of cheap-ass SATA drives on randomized RAID1 pairs will probably not be enough to keep up. Data-retention archives where sequential write speeds are your prime metric is more forgiving and can take a much different storage architecture, even though it may involved an order of magnitude more space than the first option here.

One comment deserves attention, though:

The fact is that 20 years ago, a large chunk of storage was a 300MB ESDI drive for $1500, but now a large drive is hard to find above $200.

Well, for large hard drives that may be true but for medium size drives I can show you many options that break the $200 barrier. 450GB FC drives? Over $200 by quite a lot. Anything SSD-Enterprise? Over $200 by a lot, and the 'large drive' segment is at an order of magnitude over that.

We're going to see some interesting storage architectures in the near future. That much is for sure.

Labels: storage

<permalink> posted by riedesg : 9/18/2009 12:27:00 PM 0 comments

Thursday, September 10, 2009

Lemonade

Two days ago (but it seems longer) the drive that holds my VM images started vomiting bad sectors. Even more unfortunately, one of the bad sectors took out the MFT clusters on my main Win XP management VM. So far that's the only data-loss, but it's a doozy. I said unto my manager, "Help, for I have no VM drive any more, and am woe." Meanwhile I evacuated what data I could. Being what passes for a Storage Administrator around here, finding the space was dead easy.

Yesterday bossman gave me a 500GB Western Digital drive and I got to work restoring service. This drive has Native Command Queueing, unlike the now-dead 320GB drive. I didn't expect that to make much of a difference, but it has. My Vista VMs (undamaged) run noticibly faster now. "iostat -x" shows await times markedly lower than they were before when running multiple VMs.

NCQ isn't the kind of feature that generally speeds up desktop performance, but in this case it does. Perhaps lots of VM's are a 'server' type load afterall.

Labels: storage, virtualization

<permalink> posted by riedesg : 9/10/2009 01:38:00 PM 0 comments

Friday, August 28, 2009

Fabric merges

When doing a fabric merge with Brocade gear, when they say that the Zone configuration needs to be exactly the same on both switches, they mean that. The merge process does no parsing, it just compares the zone config. If the metaphorical diff returns anything it doesn't merge. So if one zone has a swapped order of two nodes but is otherwise identical, it'll not merge.

Yes, this is very conservative. And I'm glad for it, since failure here would have brought down our ESX cluster and that's a very wince-worthy collection of highly visible services. But it took a lot of hacking to get the config on the switch I'm trying to merge into the fabric to be exactly right.

Labels: storage, sysadmin, virtualization

<permalink> posted by riedesg : 8/28/2009 10:37:00 AM 0 comments

Tuesday, August 11, 2009

Changing the CommandView SSL certificate

One of the increasingly annoying things that IT shops have to put up with is web based administration portals using self-signed SSL certificates. Browsers are increasingly making this setup annoying, and for a good reason. Which is why I try and get these pages signed with a real key if they allow me to.

HP's Command View EVA administration portal annoyingly overwrites the custom SSL files when it does an upgrade. So you'll have to do this every time you apply a patch or otherwise update your CV install.

Generate a SSL certificate with the correct data.
Extract the certificate into base-64 form (a.k.a. PEM format) in separate 'certificate' and 'private key' files.
On your command view server overwrite the %ProgramFiles%\Hewlett-Packard\sanworks\Element Manager for StorageWorks HSV\server.cert file with the 'certificate' file
Overwrite the %ProgramFiles%\Hewlett-Packard\sanworks\Element Manager for StorageWorks HSV\server.pkey file with the 'private key' file
Restart the CommandView service

At that point, CV should be using your generated certificates. Keep these copied somewhere else on the server so you can quickly copy them back in when you update Command View.

Labels: hp, security, storage, sysadmin

<permalink> posted by riedesg : 8/11/2009 02:52:00 PM 0 comments

Friday, July 24, 2009

Migrations

We'll be getting our hardware early next week. Still no ETA on the storage we need to manage things.

Also, with Server 2008R2 being released on the 14th, that allows us to pick which OS we want to run. Microsoft just made it under the bar for us to consider that rev for fall quarter. Server 2008 R2 has a lot of new stuff in it, such as even more PowerShell (yes, it really is all that and a bag of chips) stuff, and PowerShell 2.0! "Printer Driver Isolation," should help minimize spooler faults, something I fear already even though I'm not running Windows Printing in production yet. Built in MPIO gets improvements as well, something that is very nice to have.

And yet... if we don't get it in time, we may not do it for the file serving cluster. Especially if pcounter doesn't work with the rebuilt printing paths for some reason. Won't know until we get stuff.

But once we get a delivery date for the storage pieces, we can start talking actual timelines beyond, "we hope to get it all done by fall start." Like, which weekends we'll need to block out to babysit the data migrations.

Labels: microsoft, printing, storage

<permalink> posted by riedesg : 7/24/2009 01:52:00 PM 0 comments

Wednesday, June 24, 2009

Why ext4 is very interesting

The new ext4 file-system is one I'm very interested in, especially since the state of reiserfs in the Linux community's gestalt is so in flux. Unlike ext3, it does a lot of things nicer:

Large directories. It can now handle very large directories. One source says unlimited directory sizes, and another less reliable source says sizes up to 64,000 files. This is primo for, say, a mailer. GroupWise would be able to run on this, if one was doing that.
Extent-mapped files. More of a feature for file-systems that contain a lot of big files rather than tens of millions of 4k small files. This may be an optional feature. Even so, it replicates an XFS feature which makes ext4 a top contender for the MythTV and other linux-based PVR products. Even SD video can be 2GB/hour, HD can be a lot larger.
Delayed allocation. Somewhat controversial, but if you have very high faith in your power environment, it can make for some nice performance gains. It allows the system to hold off on block allocation until a process either finishes writing, or triggers a dirty-cache limit, which further allows a decrease in potential file-frag and helps optimize disk I/O to some extent. Also, for highly transient files, such as mail queue files, it may allow some files to never actually hit disk between write and erase, which further increases performance. On the down side, in case of sudden power loss, delayed writes are simply lost completely.
Persistent pre-allocation. This is a feature XFS has that allows a process to block out a defined size of disk without actually writing to it. For a PVR application, the app could pre-allocate a 1GB hunk of disk at the start of recording and be assured that it would be as contiguous as possible. Certain bittorrent clients 'preallocate' space for downloads, though some fake it by writing 4.4GB of zeros and overwriting the zeros with the true data as it comes down. This would allow such clients to truly pre-allocate space in a much more performant way.
Online defrag. On traditional rotational media, and especially when there isn't a big chunk of storage virtualization between the server and the disk such as that provided by a modern storage array, this is much nicer. This is something else that XFS had, and ext is now picking up. Going back to the PVR again, if that recording pre-allocated 1GB of space and the recording is 2 hours long, at 2GB/hour it'll need a total of 4 GB. In theory each 1GB chunk could be on a completely diffrent part of the drive. An online defrag allows that file to be made contiguous without file-system downtime.

SSD: A bad idea for SSD media, where you want to conserve your writes, and doesn't suffer a performance penalty for frag. Hopefully, the efstools used to make the filesystem will be smart enough to detect SSDs and set features appropriately.
Storage Arrays: Not a good idea for extensive disk arrays like the HP EVA line. These arrays virtualize storage I/O to such an extent that file fragmentation doesn't cause the same performance problems as would a simple 6-disk RAID5 array would experience.

Higher resolution time-stamps. This doesn't yet have much support anywhere, but it allows timestamps with nanosecond resolution. This may sound like a strange thing to put into a 'nifty features' list, but this does make me go "ooo!".

Also, as I mentioned before, it looks like openSUSE 11.2 will make ext4 the default filesystem.

Labels: linux, opensuse, storage

<permalink> posted by riedesg : 6/24/2009 03:06:00 PM 0 comments

Friday, March 20, 2009

Storage that makes you think

Anandtech has a nice article up right now that compares SAS, SATA, and SSD drives in a database environment. Go read it. I'll wait.

While the bulk of the article is about how much the SSD drives blow the pants off of rotational magnetic media, the charts show how SAS performs versus SATA. As they said at the end of the article:

Our testing also shows that choosing the "cheaper but more SATA spindles" strategy only makes sense for applications that perform mostly sequential accesses. Once random access comes into play, you need two to three times more SATA drives - and there are limits to how far you can improve performance by adding spindles.

Which matches my experience. SATA is great for sequential loads, but is bottom of the pack when it comes to random I/O. In a real world example, take this MSA1500CS we have. It has SATA drives in it

If you have a single disk group with 14 1TB drives in it, this gives a theoretical maximum capacity of 12.7TB (that storage industry TB vs OS TB problem again). Since you can only have LUNs as large as 2TB due to the 32-bit signed integer problem, this would mean this disk group would have to be carved into 7 LUNs. So how do you go about getting maximum performance from this set up?

You'll have to configure a logical volume on your server such that each LUN appends to the logical volume in order, and then make sure your I/O writes (or reads) sequentially across the logical volume. Since all 7 LUNs are on the same physical disks, any out-of-order arrangement of LUN on that spanned logical volume would result in semi-random I/O and throughput would drop. Striping the logical volume just ensures that every other access requires a significant drive-arm move, and would seriously drop throughput. It is for this reason that HP doesn't recommend using SATA drives in 'online' applications.

Another thing in the article that piqued my interest is there on page 11. This is where they did a test of various data/log volume combinations between SAS and SSD. The conclusion they draw is interesting, but I want to talk about it:

Transactional logs are written in a sequential and synchronous manner. Since SAS disks are capable of delivering very respectable sequential data rates, it is not surprising that replacing the SAS "log disks" with SSDs does not boost performance at all.

This is true, to a point. If you have only one transaction log, this is very true. If you put multiple transaction logs on the same disk, though, SSD becomes the much better choice. They did not try this configuration. I would have liked to have seen a test like this one:

Three Data volumes running on SAS drives
One Log volume running on an SSD with all three database logs on it

I'm willing to bet that the performance of the above would match, if not exceed, running three separate log volumes running on SAS.

The most transactional database in my area is probably Exchange. If we were able to move the Logs to SSD's, we very possibly could improve performance of those databases significantly. I can't prove it, but I suspect we may have some performance issues in that database.

And finally, it does raise the question of file-system journals. If I were to go out and buy a high quality 16GB SSD for my work-rig, I could use that as an external journal for my SATA-based filesystems. As it is an SSD, running multiple journals on it should be no biggie. Plus, offloading the journal-writes should make the I/O on the SATA drives just a bit more sequential and should improve speeds. But would it even be perceptible? I just don't know.

Labels: exchange, msa, storage

<permalink> posted by riedesg : 3/20/2009 10:58:00 AM 0 comments

Tuesday, February 10, 2009

Bad graphs

One of my pet peeves about the Brocade performance graphs is the bad legend. Take this example:

Performance graph showing GB/Gb inconsistency

The line across the bottom is clearly labeled "Gb/sec". Following the rule of, "big B is byte, little b is bit," this should mean gigabit, not gigabyte. Port 13 is shoving data at 45.9Mb/sec speeds, and is on a 2Gb port.

This is wrong. Port 13 is a NetWare server where I'm running a TSATEST virtual backup against a locally mounted volume. The reported data-rate was 2203 MB/Minute, or 36.7 MB/Second, or 293 Mb/sec. The difference between the reported data rate of 36.7 and the Brocade rate of 45.9 probably has to do with caching, but I'm not 100% certain of that.

More generally, I've seen these charts when I've been pushing data at 140MB/Second (a RAID0 volume, I was doing a speed test) which is over 1Gb/s. As it happens, that was the first time I'd ever seen a Fibre Channel port exceed 1Gb, so it was nifty to see. But the performance reported by the Brocade performance monitor was 140 Mb/s. Clearly, that legend on the bottom is mis-labeled.

I can't figure out why that would be, though. The fastest port we have in the system are those 4Gb ports, and their max data-rate would be 512 MB/s. A speed that wouldn't even reach "1.2" on their chart. I don't get that.

Labels: storage

<permalink> posted by riedesg : 2/10/2009 12:48:00 PM 0 comments

Wednesday, February 04, 2009

The mystery of lsimpe.cdm

Last night I turned on multi-path support for the main NetWare file cluster. This has been a long time coming. When we upgraded the EVA3000 to an EVA6100 it gained the ability to do active/active IO on the controllers, something that the new EVA4400 can also do.

What's more, the two Windows backup-to-disk servers we've attached to the EVA4400 (and the MSA1500 for that matter) have the HP MPIO drivers installed, which are extensions of the Microsoft MPIO stack. Looking at the bandwidth chart on the fiber-channel fabric I see that these Windows servers are also doing load balancing over both of the paths. This is nifty! Also, when I last updated the XCS code on the EVA4400 both of those servers didn't even notice the controller reboots. EVEN NICER!

I want to do the same thing with NetWare. On the surface, turning on MPIO support is dead easy:

Startup.ncf file:

SET MULTI-PATH SUPPORT = ON
LOAD QL2X00.HAM SLOT=10001 /LUNS /ALLPATHS /PORTNAMES

Tada. Reboot, present both paths in your zoning, and issue the "list failover devices" command on the console, and you'll get a list. In theory should one go away, it'll seamlessly move over to the other.

But what it won't do is load-balance. Unfortunately, the documentation on NetWare's multi-path support is rather scanty, focusing more on configuring path failover priority. The fact that the QL2X00.HAM driver itself can do it all on its own without letting NetWare know (the "allpaths" and "portnames" options tell it to not do that and let NetWare do the work) is a strong hint that MPIO is a fairly light weight protocol.

On the support forums you'll get several references to the LSIMPE.CDM file. With interesting phrases like, "that's the multipath driver", and, "Yeah, it isn't well documented." The data on the file itself is scanty, but suggestive:

LSIMPE.CDM
Loaded from [C:\NWSERVER\DRIVERS\] on Feb 4, 2009   3:32:13 am
(Address Space = OS)
LSI Multipath Enhancer
Version 1.02.02 September 5, 2006
Copyright 1989-2006 Novell, Inc.

But the exact details of what it does remain unclear. One thing I do know, it doesn't do the load-balancing trick.

Labels: clustering, netware, novell, storage

<permalink> posted by riedesg : 2/04/2009 01:48:00 PM 0 comments

Tuesday, February 03, 2009

More on DataProtector 6.10

We've had DP6.10 installed for several weeks now and have some experience with it. Yesterday I configured a Linux Installation Server so I can push agents out to Linux hosts without having to go through the truly horrendous install process that DP6.00 forced you to do when not using an Installation Server. This process taught me that DataProtector grew up in the land of UNIX, not Linux.

One of the new features of DP6.10 is that they now have a method for pushing backup agents to Linux/HP-UX/Solaris hosts over SSH. This is very civilized of them. It uses public key and the keychain tool to make it workable.

The DP6.00 method involved tools that make me cringe. Like rlogin/rsh. These are just like telnet in that the username and password is transmitted over the wire in the clear. For several years now we've had a policy in place that states that protocols that require cleartext transmission of security principles like this are not to be used. We are not alone in this. I am very happy HP managed to get DP updated to a 21st century security posture.

Last Friday we also pointed DP at one of our larger volumes on the 6-node cluster. Backup rates from that volume blew our socks off! It pulled data at about 1600GB/Minute (a hair under 27MB/Second). For comparison, SDL320's native transfer rate (the drive we have in our tape library, which DP isn't attached to yet) is 16MB/Second. Considering the 1:1.2 to 1:1.4 compression ratios typical of this sort of file data, the max speed it can back up is still faster than tape.

The old backup software didn't even come close to these speeds, typically running in the 400MB/Min range (7MB/Sec). The difference is that the old software is using straight up TSA, where DP is using an agent. This is the difference an agent makes!

Labels: backup, hp, storage

<permalink> posted by riedesg : 2/03/2009 09:48:00 AM 0 comments

Tuesday, January 06, 2009

DataProtector 6.00 vs 6.10

A new version of HP DataProtector is out. One of the nicest new features is that they've greatly optimized the object/session copy speeds.

No matter what you do for a copy, DataProtector will have to read all of one Disk Media (50GB by default) to do the copy. So if you multiplex 6 backups into one Disk Writer device, it'll have to look through the entire media for the slices it needs. If you're doing a session copy, it'll copy the whole session. But object copies have to be demuxed.

DP6.00 did not handle this well. Consistently, each Data Reader device consumed 100% of one CPU for a speed of about 300 MB/Minute. This blows serious chunks, and is completely unworkable for any data-migration policy framework that takes the initial backup to disk, then spools the backup to tape during daytime hours.

DP6.10 does this a lot better. CPU usage is a lot lower, it no longer pegs one CPU at 100%. Also, network speeds vary between 10-40% of GigE speeds (750 to 3000 MB/Minute), which is vastly more reasonable. DP6.10, unlike DP6.00, can actually be used for data migration policies.

Labels: backup, benchmarking, hp, storage, sysadmin

<permalink> posted by riedesg : 1/06/2009 09:21:00 AM 0 comments

Tuesday, December 09, 2008

The price of storage

I've had cause to do the math lately, which I'll spare you :). But as of the best numbers I have, the cost of 1GB of space on the EVA6100 is about $16.22. Probably more, since this 6100 was created out of the carcass of an EVA3000, and I don't know what percentage of parts from the old 3000 are still in the 6100 and thus can't apportion the costs right.

For the EVA4400, which we have filled with FATA drives, the cost is $3.03.

Suddenly, the case for Dynamic Storage Technology (formerly known as Shadow Volumes) in OES can be made in economic terms. Yowza.

The above numbers do not include backup rotation costs. Those costs can vary from $3/GB to $15/GB depending on what you're doing with the data in the backup rotation.

Why is the cost of the EVA6100 so much greater than the EVA4400?

The EVA6100 uses 10K RPM 300GB FibreChannel disks, where the the EVA4400 uses 7.2K RPM 1TB (or is it 450GB?) FATA drives. The cost-per-gig on FC is vastly higher than it is on fibre-ATA.
Most of the drives in the EVA6100 were purchased back when 300GB FC drives cost over $2000 each.
The EVA6100 controller and cabinets just plain cost more than the EVA4400, because it can expand farther.

To put it into a bit of perspective, lets take the example of a 1TB volume of, "unorganized file data", the seemingly official term for "file-server". If you place that 1TB of data on the EVA6100, that data consumes $16609.28 worth of storage. So what if 70% of that data hasn't been modified in a year (not unreasonable), and is then put on the EVA4400 instead? So you'd have 307GB on the 6100 and 717GB on the 4400. Your storage cost now drops to $5909.75. That's real money.

Labels: hp, novell, OES, shadowvolumes, storage

<permalink> posted by riedesg : 12/09/2008 01:30:00 PM 0 comments

Monday, November 10, 2008

NetStorage, WebDav, and Vista

I figured out how to get it working! You need KB907306. This updates the Web Folders in Vista to support how Novell does WebDav through NetStorage.

In our case you'll also need to add the CA that serves the SSL certificate that's on top of NetStorage (a.k.a. MyFiles). But, it works.

Labels: microsoft, netware, novell, OES, storage

<permalink> posted by riedesg : 11/10/2008 04:22:00 PM 0 comments

Friday, November 07, 2008

Alarming error notices

CommandView EVA can throw some alarming notices to the unwary.

Take this example.

Without knowing what's going on, the very first line of that is panic-inducing. However, this notice was generated after I clicked the "continue deletion" button in CV-EVA after we had a disk-failure scrozz a trio of RAID0 LUNs I had been using for pure testing purposes. So, while expected, it did cause a rush to my office to see what just went wrong.

Labels: hp, storage

<permalink> posted by riedesg : 11/07/2008 02:45:00 PM 0 comments

Wednesday, October 22, 2008

An old theme made new

Yesterday on Slashdot was a link to an article that sounds a lot like one I published two years ago tomorrow. The main point in the article is that due to the unrecoverable-read-error rate in your standard SATA drive (10^14 bits, or 12.5TB), and the ever increasing sizes of SATA drives means that Raid 5 arrays can get to 12.5TB pretty quickly. Heck, high-end home media servers chock full of HD content can get there very fast.

While it doesn't say this in the specs page for that new Seagate drive, if you look on page 18 of the accompanying manual you can see the "Nonrecoverable read error" rate of the same 10^14 as I talked about two years ago. So, no improvement in reliability. However.... For their enterprise-class "Savvio" drives, they list a "Nonrecoverable Read Error" rate of 10^16 (1 in 1.25PB), which is better than the 10^15 (125TB) they were doing two years ago on their FC disks. So clearly, enterprise users are juuuust fine for large RAID5 arrays.

As I said before, the people who are going to be bitten by this will be home media servers. Also, whiteboxed homebrew servers for small/medium businesses will be at risk. So those of you who have to justify buying the really expensive disks, when there are el-cheepo 1.5TB drives out there? You can use this!

Labels: sata, storage, sysadmin

<permalink> posted by riedesg : 10/22/2008 12:42:00 PM 0 comments

Tuesday, October 07, 2008

Erm, about the budget

From an email sent to all points from the U President this afternoon:

In the OFM spreadsheets received today, we were stunned to find that targets had been set for higher education. Western, today, is now expected, from the sorts of measures outlined in the August 4 memorandum, to "save" $1,827,000 in the current fiscal year. (This major reduction applies across all budgets, including instructional budgets.)

Add that to the earlier number, and our total budget reduction is NOT the $176,000 representing 1% of non-instructional budgets. It is $2,003,682.

Aaaaaand...

Further, we have been advised to expect these reductions to be permanent; that is, to also be a part of our 2009-11 budget.

Pardon me whilst I mutter things.

This means that it is nearly certain that we will NOT be getting any new hardware for the Novell cluster next summer. We'll have to do it on hardware we already own right now. This means I won't be able to partake of that lovely 64-bit goodness. Drat drat drat.

We're already under-funded for where we need to be, this won't help. Even with the storage arrays we just bought, in terms of total disk-space we've managed to fully commit all of it. There is no excess capacity. What's more, there is no easy way to ADD new capacity since any significant amounts will require purchasing new storage shelves.

In the intermediate term, this means that WWU will now descend into bureaucratic charge-back warfare. As service-providing departments like ours try to find ways to finance the needed growth, we'll start being hard-ass about charging for exceptional services. And they'll do it to us too. So if the College of Arts and Sciences comes to us and asks us for space to host 2TB of, say, NASA data, we'll have to bill them for it. And that cost will be a 'total cost' which will by necessity include the backup costs. In return, if we need 16 ethernet jacks added to the AC datacenter, Telecom may start billing us.

And I get a new boss Thursday. Happily, since there is overlap between outgoing and incoming they've been briefing a lot. This is to prepare the new guy for the challenges he'll face in his first few weeks flying solo. There may even be the odd phone-call for advice, we'll see.

Gonna get real interesting around here.

Labels: budget, novell, OES, storage

<permalink> posted by riedesg : 10/07/2008 09:24:00 PM 0 comments

Thursday, October 02, 2008

MSA performance in the new config

Today I reconfigured the MSA1500 to run in Active/Active mode. While there, I also rearranged our disk arrays. We have 41, 500GB, 7.2K RPM drives in there. I created two, 20 disk Arrays, and filled each array with Raid 0+1 LUNs. This yielded 9TB of useful space. That extra drive will stay extra until we get an odd number of new drives.

Yes, a profligate waste of space but at least it'll be fast. It also had the added advantage of not needing to stripe in like Raid5 or Raid6 would have. This alone saved us close to two weeks flow time to get it back into service.

Another benefit to not using a parity RAID is that the MSA is no longer controller-CPU bound for I/O speeds. Right now I have a pair of writes, each effectively going to a separate controller, and the combined I/O is on the order of 100Mbs while controller CPU loads are under 80%. Also, more importantly, Average Command Latency is still in the 20-30ms range.

The limiting factor here appears to be how fast the controllers can commit I/O to the physical drives, rather than how fast the controllers can do parity-calcs. CPU not being saturated suggests this, but a "show perf physical" on the CLI shows the queue depth on individual drives:

The drives with a zero are associated with LUNs being served by the other controller, and thus not listed here. But a high queue depth is a good sign of I/O saturation on the actual drives themselves. This is encouraging to me, since it means we're finally, finally, after two years, getting the performance we need out of this device. We had to go to an active/active config with a non-parity RAID to do it, but we got it.

Labels: benchmarking, msa, storage, sysadmin

<permalink> posted by riedesg : 10/02/2008 03:54:00 PM 0 comments

Wednesday, September 24, 2008

Fickle fortune

I lost a RAID card in one of my Beta servers. Crap. These beasties are all old beasties since that's the only hardware that could be released for the beta. And with crap servers, comes a crap failure rate. This is the second RAID card I've lost, and I've lost one hard-drive too. It isn't common to lose more RAID cards than hard-drives. Arrg.

This puts a kink into things. This was going to be an edirectory host, so I could host my replicas on one set of servers and abuse the crap out of the non-replica application servers. I may have to dual host. Icky icky.

Labels: novell, OES, storage

<permalink> posted by riedesg : 9/24/2008 02:38:00 PM 0 comments

Tuesday, September 23, 2008

That darned 32-bit limit

Today I learned that the disk-space counters NetWare provides in SNMP use signed integers for its disk-space monitoring. These are stats published at a table at OID .1.3.6.1.4.1.23.2.28.2.14.1.3. Having just expanded our FacShare volume past 2TB, it went negative-space according to the monitors. A simple integer overflow since apparently Novell is using a signed integer for a number that can never be legitimately negative.

I've pointed this out on an enhancement request. This being NetWare, they may not chose to fix it if it is more than a two-line fix. We'll see.

This also means that volumes over 4TB can not be effectively monitored with SNMP. Since NSS can have up to 8TB volumes on NetWare, this could potentially be a problem. We're not there yet.

Labels: netware, sata, storage

<permalink> posted by riedesg : 9/23/2008 03:40:00 PM 0 comments

Friday, September 19, 2008

Monitoring ESX datacenter volume stats

A long while back I mentioned I had a perl script that we use to track certain disk space details on my NetWare and Windows servers. That goes into a database, and it can make for some pretty charts. A short while back I got asked if I could do something like that for the ESX datacenter volumes.

A lot of googling later I found how to turn on the SNMP daemon for an ESX host, and a script or two to publish the data I need by SNMP. It took some doing, but it ended up pretty easy to do. One new perl script, the right config for snmpd on the ESX host, setting the ESX host's security policy to permit SNMP traffic, and pointing my gathering script at the host.

The perl script that gathers the local information is very basic:

#!/usr/bin/perl -w

use strict;
my $partition = ".";
my $partmaps = ".";
my $vmfsvolume = "\Q/vmfs/volumes/$ARGV[0]\Q";
my $vmfsfriendly = $ARGV[1];
my $capRaw = 0;
my $capBlock = 0;
my $blocksize = 0;
my $freeRaw = 0;
my $freeBlock = 0;
my $freespace= "";
my $totalspace= "";
open("Y", "/usr/sbin/vmkfstools -P $vmfsvolume|");
while () {
        if (/Capacity ([0-9]*).*\(([0-9]*).* ([0-9]*)\), ([0-9]*).*\(([0-9]*).*a
vail/) {
                $capRaw = $1;
                $capBlock = $2;
                $blocksize = $3;
                $freeRaw = $4;
                $freeBlock = $5;
                $freespace = $freeBlock;
                $totalspace = $capBlock;
                $blocksize = $blocksize/1024;
                #print ("1 = $1\n2 = $2\n3 = $3\n4 = $4\n5 = $5\n");
                print ("$vmfsfriendly\n$totalspace\n$freespace\n$blocksize\n");
        }
}

Then append the /etc/snmp/snmp.conf file with the following lines (in my case):

exec .1.3.6.1.4.1.6876.99999.2.0 vmfsspace /root/bin/vmfsspace.specific 48cb2cbc
-61468d50-ed1f-001cc447a19d Disk1

exec .1.3.6.1.4.1.6876.99999.2.1 vmfsspace /root/bin/vmfsspace.specific 48cb2cbc
-7aa208e8-be6b-001cc447a19d Disk2

The first parameter after exec is the OID to publish. The script returns an array of values, one element per line, that are assigned to .0, .1, .2 and on up. I'm publishing the details I'm interested in, which may be different than yours. That's the 'print' line in the script.

The script itself lives in /root/bin/ since I didn't know where better to put it. It has to have execute rights for Other, though.

The big unique-ID looking number is just that, a UUID. It is the UUID assigned to the VMFS volume. The VMFS volumes are multi-mounted between each ESX host in that particular cluster, so you don't have to worry about chasing the node that has it mounted. You can find the number you want by logging in to the ESX host on the SSH console, and doing a long directory on the /vmfs/volumes folder. The friendly name of your VMFS volume is symlinked to the UUID. The UUID is what goes in to the snmp.conf file.

The last parameter ("Disk1" and "Disk2" above) is the friendly name of the volume to publish over SNMP. As you can see, I'm very creative.

These values are queried by my script and dropped into the database. Since the ESX datacenter volumes only get space consumed when we provision a new VM or take a snapshot, the graph is pretty chunky rather than curvy like the graph I linked to earlier. If VMware ever changes how the vmfstools command returns data, this script will break. But until then, it should serve me well.

Labels: stats, storage, sysadmin, virtualization

<permalink> posted by riedesg : 9/19/2008 10:23:00 AM 0 comments

Moving storage around

The EVA6100 went in just fine with that one hitch I mentioned, and now comes all the work we need to do now that we have actual space again. We're still arguing over how much space to add to which volumes, but once we decide all but Blackboard will be very easy to add.

Blackboard needs more space on both the SQL server and the Content server, and as the Content server is clustered it'll require an outage to manage the increase. And it'll be a long outage, as 300GB of weensy files takes a LONG time to copy. The SQL server uses plain old Basic partitions, so I don't think we can expand that partition, so we may have to do another full LUN copy which will require an outage. That has yet to be scheduled, but needs to happen before we get through much of the quarter.

Over on the EVA4400 side, I'm evacuating data off of the MSA1500cs onto the 4400. Once I'm done with that, I'm going to be:

Rebuilding all of the Disk Arrays.
Creating LUNs expressly for Backup-to-Disk functionality.
Flashing the Active/Active firmware on to it, the 7.00 firmware rev.
Get the two Backup servers installed with the right MPIO widgetry to take advantage of active/active on the MSA>

But first we need the DataProtector licensing updates to beat its way through the forest of paperwork and get ordered. Otherwise, we can't use more than 5TB of disk, and that's WAY wimpy. I need at LEAST 20, and preferably 40TB. Once that licensing is in place, we can finally decommission the out-of-license BackupExec server and use the 6 slot tape library with DataProtector instead. This should significantly increase how much data we can throw at backup devices during our backup window.

What has yet to be fully determined is exactly how we're going to use the 4400 in this scheme. I expect to get between 15-20TB of space out of the MSA once I'm done with it, and we have around 20TB on the 4400 for backup. Which is why I'd really like that 40TB license please.

Going Active/Active should do really good things for how fast the MSA can throw data at disk. As I've proven before the MSA is significantly CPU bound for I/O to parity LUNs (Raid5 and Raid6), so having another CPU in the loop should increase write throughput significantly. We couldn't do Active/Active before since you can only do Active/Active in a homogeneous OS environment, and we had Windows and NetWare pointed at the MSA (plus one non-production Linux box).

In the mean time, I watch progress bars. TB of data takes a long time to copy if you're not doing it at the block level. Which I can't.

Labels: backup, clustering, msa, storage, sysadmin

<permalink> posted by riedesg : 9/19/2008 10:23:00 AM 0 comments

Sunday, September 14, 2008

EVA6100 upgrade a success

Friday night four HP tech arrived to put together the EVA6100 from a pile of parts and the existing EVA3000. It took them 5 hours to get it to the point where we could power-on and see if all of our data was still there (it was, yay), and a few hours after that on our behalf to put everything back together.

There was only one major hitch for the night, which meant I got to bed around 6am Saturday morning instead of 4am.

For EVA, and probably all storage systems, you present hosts to them and selectively present LUNs to those hosts. These host-settings need to have an OS configured for them, since each operating system has its own quirks for how it likes to see its storage. While the EVA6100 has a setting for 'vmware', the EVA3000 did not. Therefore, we had to use a 'custom' OS setting and a 16 digit hex string we copied off of some HP knowledge-base article. When we migrated to the EVA6100 it kept these custom settings.

Which, it would seem, don't work for the EVA6100. It caused ESX to whine in such a way that no VMs would load. It got very worrying for a while there, but thanks to an article on vmware's support site and some intuition we got it all back without data loss. I'll probably post what happened and what we did to fix it in another blog post.

The only service that didn't come up right was secure IMAP for Exchange. I don't know why it decided to not load. My only theory is that our startup sequence wasn't right. Rebooting the HubCA servers got it back.

Labels: clustering, exchange, hp, storage, sysadmin

<permalink> posted by riedesg : 9/14/2008 09:53:00 PM 0 comments

Wednesday, September 10, 2008

That darned budget

This is where I whine about not having enough money.

It has been a common complaint amongst my co-workers that WWU wants enterprise level service for a SOHO budget. Especially for the Win/Novell environments. Our Solaris stuff is tied in closely to our ERP product, SCT Banner, and that gets big budget every 5 years to replace. We really need the same kind of thing for the Win/Novell side of the house, such as this disk-array replacement project we're doing right now.

The new EVAs are being paid for by Student Tech Fee, and not out of a general budget request. This is not how these devices should be funded, since the scope of this array is much wider than just student-related features. Unfortunately, STF is the only way we could get them funded, and we desperately need the new arrays. Without the new arrays, student service would be significantly impacted over the next fiscal year.

The problem is that the EVA3000 contains between 40-45% directly student-related storage. The other 55-60% is Fac/Staff storage. And yet, the EVA3000 was paid for by STF funds in 2003. Huh.

The summer of 2007 saw a Banner Upgrade Project, when the servers that support SCT Banner were upgraded. This was a quarter million dollar project and it happens every 5 years. They also got a disk-array upgrade to a pair of StorageTek (SUN, remember) arrays, DR replicated between our building and the DR site in Bond Hall. I believe they're using Solaris-level replication rather than Array-level replication.

The disk-array upgrade we're doing now got through the President's office just before the boom went down on big expensive purchases. It languished in the Purchasing department due to summer-vacation related under-staffing. I hate to think how late it would have gone had it been subjected to the added paperwork we now have to go through for any purchase over $1000. Under no circumstances could we have done it before Fall quarter. Which would have been bad, since we were too short to deal with the expected growth of storage for Fall quarter.

Now that we're going deep into the land of VMWare ESX, centralized storage-arrays are line of business. Without the STF funded arrays, we'd be stuck with "Departmental" and "Entry-level" arrays such as the much maligned MSA1500, or building our own iSCSI SAN from component parts (a DL385, with 2x 4-channel SmartArray controller cards, 8x MSA70 drive enclosures, running NetWare or Linux as an iSCSI target, with bonded GigE ports for throughput). Which would blow chunks. As it is, we're still stuck using SATA drives for certain 'online' uses, such as a pair of volumes on our NetWare cluster that are low usage but big consumers of space. Such systems are not designed for the workloads we'd have to subject them to, and are very poor performers when doing things like LUN expansions.

The EVA is exactly what we need to do what we're already doing for high-availability computing, yet is always treated as an exceptional budget request when it comes time to do anything big with it. Since these things are hella expensive, the budgetary powers-that-be balk at approving them and like to defer them for a year or two. We asked for a replacement EVA in time for last year's academic year, but the general-budget request got denied. For this year we went, IIRC, both with general-fund and STF proposals. The general fund got denied, but STF approved it. This needs to change.

By October, every person between and Governor Gregoir will be new. My boss is retiring in October. My grandboss was replaced last year, my great grand boss also has been replaced in the last year, and the University President stepped down on September 1st. Perhaps the new people will have a broader perspective on things and might permit the budget priorities to be realigned to the point that our disk-arrays are classified as the critical line-of-business investments they are.

Labels: budget, clustering, exchange, hp, linux, msa, netware, novell, OES, opinion, storage, sysadmin, virtualization

<permalink> posted by riedesg : 9/10/2008 10:50:00 AM 0 comments

Disk-array migrations done right

We have two new HP EVA systems. An EVA4400 with FATA drives that we'll be putting into our DR datacenter in Bond Hall, and upgrading our EVA3000 into an EVA6100 + 2 new enclosures. The 4400 is a brand new device, so is sitting idle right now (officially). It will be replacing the MSA1500 we purchased two years ago, and will fulfill the duties the MSA should have been doing but is too stupid to do.

We've set up the 4400 already, and as part of that we had to upgrade our CommandView version from the 4.something it was with the EVA3000 to CommandView 8. As a side effect of this, we lost licensing for the 3000 but that's OK since we're replacing that this weekend. I'm assuming the license codes for the 6100 are in the boxes the 6100 parts are in. We'll find that out Friday night, eh?

One of the OMG NICE things that comes with the new CommandView is a 60 day license for both ContinuousAccess EVA and BusinessCopy EVA. ContinuousAccess is the EVA to EVA replication software, and is the only way to go for EVA to EVA migrations. We started replicating LUNs on the 6100 to the 4400 on Monday, and they just got done replicating this morning. This way, if the upgrade process craters and we lose everything, we have a full block-level replica on the 4400. So long as we get it all done by 10/26/2008, which we should do.

On a lark we priced out what purchasing both products would cost. About $90,000, and that's with our .edu discount. That's a bit over half the price of the HARDWARE, which we had to fight tooth and nail to get approved in the first place. So. Not getting it for production.

But the 60 day license is the only way to do EVA to EVA migrations. In 5 years when the 6100 falls off of maintenance and we have to forklift replace a new EVA in, it will be ContinuousAccess EVA (eval) that we'll use to replicate the LUNs over to the new hardware. Then on migration date we'll shut everything down ("quiesce I/O"), make sure all the LUN presentations on the new array look good, break the replication groups, and rezone the old array out. Done! Should be a 30 minute outage.

Without the eval license it'd be a backup-restore migration, and that'd take a week.

Labels: hp, msa, storage, sysadmin

<permalink> posted by riedesg : 9/10/2008 09:03:00 AM 0 comments

Wednesday, September 03, 2008

EVA4400 + FATA

Some edited excerpts of internal reports I've generated over the last (looks at watch) week. The referenced testing operations involve either a single stream of writes, or two streams of writes in various configurations:

Key points I've learned:
The I/O controllers in the 4400 are able to efficiently handle more data than a single host can throw at it.
The FATA drives introduce enough I/O bottlenecks that multiple disk-groups yield greater gains than a single big disk-group.
Restripe operations do not cause anywhere near the problems they did on the MSA1500.
The 4400 should not block-on-write the way the MSA did, so the NetWare cluster can have clustered volumes on it.
The "Same LUN" test showed that Write speeds are about half that of the single threaded test, which gives about equal total throughput to disk. The Read speeds are roughly comperable, giving a small net increase in total throughput from disk. Again, not sure why. The Random Read tests continue to perform very poorly, though total throughput in parallel is better than the single threaded test.

The "Different LUN, same disk-group," test showed similar results to the "Same LUN" test in that Write speeds were about half of single threaded yielding a total Write throughput that closely matches single-threaded. Read speeds saw a difference, with significant increases in Read throughput (about 25%). The Random Read test also saw significant increases in throughput, about 37%, but still is uncomfortably small at a net throughput of 11 MB/s.

The "Different LUN, different disk-group," test did show some I/O contention. For Write speeds, the two writers showed speeds that were 67% and 75% of the single-threaded speeds, yet showed a total throughput to disk of 174 MB/s. Compare that with the fasted single-threaded Write speed of 130 MB/s. Read performance was similar, with the two readers showing speeds that were 90% and 115% of the single-threaded performance. This gave an aggregate throughput of 133 MB/s, which is significantly faster than the 113 MB/s turned in by the fastest Reader test.

Adding disks to a disk-group appears to not significantly impact Write speeds, but significantly impact Read speeds. The Read speed dropped from 28 MB/s to 15 MB/s. Again, a backup-to-disk operation wouldn't notice this sort of activity. The Random Read test showed a similar reduction in performance. As Write speeds were not affected by restripe, the sort of cluster hard-locks we saw with the MSA1500 on the NetWare cluster will not occur with the EVA4400.

And finally, a word about controller CPU usage. In all of my testing I've yet to saturate a controller, even during restripe operations. It was the restripe ops that killed the MSA, and the EVA doesn't seem to block nearly as hard. Yes, read performance is dinged, but not nearly to the levels that the MSA does. This is because the EVA keeps its cache enabled during restripe-ops, unlike the MSA.

One thing I alluded to in the above is that Random Read performance is rather bad. And yes, it is. Unfortunately, I don't yet know if this is a feature of testing methodology or what, but it is worrysome enough that I'm figuring it into planning. The fastest random-read speed turned in for a 10GB file, 64KB nibbles, came to around 11 MB/s. This was on a 32-disk disk-group on a Raid5 vdisk. Random Read is the test that closest approximates file-server or database loads, so it is important.

HP has done an excellent job tuning the caches for the EVA4400, which makes Write performance exceed Read performance in most cases. Unfortunately, you can't do the same reordering optimization tricks for Read access that you can for Writes, so Random Read is something of a worst-case scenario for these sorts of disks. HP's own documentation says that FATA drives should not be used for 'online' access such as file-servers or transactional databases. And it turns out they really meant that!

That said, these drives sequential write performance is excellent, making them very good candidates for Backup-to-Disk loads so long as fragmentation is constrained. The EVA4400 is what we really wanted two years ago, instead of the MSA1500.

Still no word on whether we're upgrading the EVA3000 to a EVA6100 this weekend, or next weekend. We should know by end-of-business today.

Labels: benchmarking, hp, msa, storage, sysadmin

<permalink> posted by riedesg : 9/03/2008 03:11:00 PM 0 comments

Tuesday, September 02, 2008

EVA4400 testing

Right before I left Friday I started a test on the EVA4400 with a 100GB file. This is the same file-size I configured DataProtector to use for the backup-to-disk files, so it's a good test size.

Sequential Write speed: 79,065 KB/s
Sequential Read speed: 52,107 KB/s

That's a VERY good number. The Write speed above is about the same speed as I got on the MSA1500 when running against a Raid0 volume, and this is a Raid5 volume on the 4400. The 10GB file-size test I did before this one I also watched the EVA performance on the monitoring server, and controller CPU during that time was 15-20% max. Also, it really used both controllers (thanks to MPIO).

Random Write speed: 46,427 KB/s
Random Read speed: 3,721 KB/s

Now we see why HP strongly recommends against using FATA drives for random I/O. For a file server that's 80% read I/O, it would be a very poor choice. This particular random-read test is worst-case, since a 100GB file can't be cached in RAM so this represents pure array performance. File-level caching on the server itself would greatly improve performance. The same test with a 512MB file turns in a random read number of 1,633,538 KB/s which represents serving the whole test in cache-RAM on the testing station itself.

This does suggest a few other tests:

As above, but two 100MB files at the same time on the same LUN
As above, but two 100MB files at the same time on different LUNs in the same Disk Group
As above, but two 100MB files at the same time on different LUNs in different Disk Groups

Labels: benchmarking, hp, msa, storage

<permalink> posted by riedesg : 9/02/2008 09:32:00 AM 0 comments

Friday, August 29, 2008

Storage update

Both the EVA4400 and the EVA6100 parts arrived late Wednesday. Wednesday I got the EVA4400 partially unboxed, and finished it up Thursday. Got CommandView upgraded so we could manage the EVA4400, and thus lost licensing for the EVA3000. The 10/26 expiry date for that license is no problem, as the EVA3000 will become an EVA6100 well before then. Next weekend if the stars align right.

And today we schlepped the whole EVA4400 to the Bond Hall datacenter.

And now I'm pounding the crap out of it to make sure it won't melt under the load we intend to put on it. These are FATA disks, which we've never used so we need to figure it out. We're not as concerned with the 6100 since that's FC disks, and they've been serving us just fine for years.

Also on the testing list, making sure MPIO works the way we expect it to.

Labels: benchmarking, storage

<permalink> posted by riedesg : 8/29/2008 04:04:00 PM 0 comments

Wednesday, August 27, 2008

Woot!

The EVAs are scheduled to deliver today! This means that we are very probably going to be taking almost every IT system we have down starting late Friday 9/5 and going until we're done. We have a meeting in a few minutes to talk strategy.

There was some fear that the gear wouldn't get here in time for the 9/5 window. The 9/12 window has one of the key, key people needed to handle the migration in Las Vegas for VMWorld, and he won't be back until 9/21 which also screws with the 9/19 window. The 9/19 window is our last choice, since that weekend is move-in weekend and the outage will be vastly more noticeable with students around. Being able to make the 9/5 window is great! We need these so badly that if we didn't get the gear in time, we'd have probably done it 9/12 even without said key player.

The one hitch is if HP can't do 9/5-6 for some reason. Fret. Fret.

Labels: clustering, storage, sysadmin, virtualization

<permalink> posted by riedesg : 8/27/2008 08:50:00 AM 0 comments

Monday, August 25, 2008

Dynamic partitions in Server 2008 and Cluster

It would seem, and I've yet to trace down definitive proof of this, that Windows Server 2008 Clustering still has the Basic Partitioning dependency in it. This limits Windows LUNs to 2TB, among other annoyances. Such as the fact that resizing one of those puppies requires a full copy onto a larger LUN rather than extending the one you already have. How... 1999.

Labels: clustering, microsoft, opinion, storage, sysadmin

<permalink> posted by riedesg : 8/25/2008 03:52:00 PM 0 comments

Email sizes

The question has been raised internally that perhaps we need to reassess what we've set for email message-size limits. When we set our current limit, we did it to the apparent defacto standard for mail size limits, which is about 10 meg.

This, perhaps, is not what it should be for an institution of higher-ed where research is performed. We have certain researchers on campus that routinely play with datasets larger than 10MB, sometimes significantly larger. And these researchers would like to electronically distribute these datasets to other researchers, and the easiest means of doing that by far is email. The primary reason we have mail-servers serving the (for example) chem.wwu.edu domain is to have these folk with much larger message size limits. Otherwise, these folk would have their primary email in Exchange.

The routine answer I've heard for handling really large file sizes is to use, "alternate means," to send the file. We don't have a FTP server for staff use, since we have a policy that forbids the use of unauthenticated protocols for transmitting passwords and things. We could do something like Novell does with ftp.novell.com/incoming and create a drop-box that anyone with a WWU account can read, but that's sort of a blunt-force solution and by definition half of a half-duplex method. Our researchers would like a full duplex method, and email represents that.

So what are you all using for email size limits? Do you have any 'out of band' methods (other than snail mail) for handling larger data sizes?

Labels: exchange, storage, sysadmin

<permalink> posted by riedesg : 8/25/2008 11:49:00 AM 0 comments

Tuesday, June 24, 2008

Backing up NSS, note for the future

According to this documentation, the storing of NSS/NetWare metadata in xattrs is turned off by default. You turn it on for OES2 servers through the "nss /ListXattrNWMetadata" command. This allows linux level utilities (i.e. cp, tar) to be able to access and copy the NSS metadata. This also allows backup software that isn't SMS enabled for OES2 to be able to backup the NSS information.

This is handy, as HP DataProtector doesn't support NSS backup on Linux. I need to remember this.

Labels: backup, netware, novell, NSS, OES, storage

<permalink> posted by riedesg : 6/24/2008 09:10:00 AM 0 comments

Monday, June 16, 2008

A good article on trustees

Over on the Novell Cool Solutions site, Marcel Cox just posted an article about how Trustees are handled on the Novell Filesystems (TFS and NFS). If you wanted to know the fundamentals of how ACLs are done on NSS volumes and how it relates to eDirectory, this is a good start.

Labels: coolsolutions, edir, netware, novell, NSS, storage, sysadmin

<permalink> posted by riedesg : 6/16/2008 12:55:00 PM 0 comments

Wednesday, May 14, 2008

NetWare and Xen

Here is something I didn't really know about in virtualized NetWare:

Guidelines for using NSS in a virtual environment

Towards the bottom of this document, you get this:

Configuring Write Barrier Behavior for NetWare in a Guest Environment

Write barriers are needed for controlling I/O behavior when writing to SATA and ATA/IDE devices and disk images via the Xen I/O drivers from a guest NetWare server. This is not an issue when NetWare is handling the I/O directly on a physical server.

The XenBlk Barriers parameter for the SET command controls the behavior of XenBlk Disk I/O when NetWare is running in a virtual environment. The setting appears in the Disk category when you issue the SET command in the NetWare server console.

Valid settings for the XenBlk Barriers parameter are integer values from 0 (turn off write barriers) to 255, with a default value of 16. A non-zero value specifies the depth of the driver queue, and also controls how often a write barrier is inserted into the I/O stream. A value of 0 turns off XenBlk Barriers.

A value of 0 (no barriers) is the best setting to use when the virtual disks assigned to the guest server’s virtual machine are based on physical SCSI, Fibre Channel, or iSCSI disks (or partitions on those physical disk types) on the host server. In this configuration, disk I/O is handled so that data is not exposed to corruption in the event of power failure or host crash, so the XenBlk Barriers are not needed. If the write barriers are set to zero, disk I/O performance is noticeably improved.

Other disk types such as SATA and ATA/IDE can leave disk I/O exposed to corruption in the event of power failure or a host crash, and should use a non-zero setting for the XenBlk Barriers parameter. Non-zero settings should also be used for XenBlk Barriers when writing to Xen LVM-backed disk images and Xen file-backed disk images, regardless of the physical disk type used to store the disk images.

Nice stuff there! The "xenblk barriers" can also have an impact on the performance of your virtualized NetWare server. If your I/O stream runs the server out of cache, performance can really suffer if barriers are non-zero. If it fits in cache, the server can reorder the I/O stream to the disks to the point that you don't notice the performance hit.

So, keep in mind where your disk files are! If you're using one huge XFS partition and hosting all the disks for your VM-NW systems on that, then you'll need barriers. If you're presenting a SAN LUN directly to the VM, then you'll need to "SET XENBLK BARRIERS = 0", as they're set to 16 by default. This'll give you better performance.

Labels: benchmarking, netware, novell, NSS, OES, storage, virtualization

<permalink> posted by riedesg : 5/14/2008 10:31:00 AM 0 comments

Monday, May 12, 2008

DataProtector 6 has a problem, continued

I posted last week about DataProtector and its Enhanced Incremental Backup. Remember that "enhincrdb" directory I spoke of? Take a look at this:

See? This is an in-progress count of one of these directories. 1.1 million files, 152MB of space consumed. That comes to an average file-size of 133 bytes. This is significantly under the 4kb block-size for this particular NTFS volume. On another server with a longer serving enhincrdb hive, the average file-size is 831 bytes. So it probably increases as the server gets older.

On the up side, these millions of weensy files won't actually consume more space for quite some time as they expand into the blocks the files are already assigned to. This means that fragmentation on this volume isn't going to be a problem for a while.

On the down side, it's going to park (in this case) 152MB of data on 4.56GB of disk space. It'll get better over time, but in the next 12 months or so it's still going to be horrendous.

This tells me two things:

When deciding where to host the enhincrdb hive on a Windows server, format that particular volume with a 1k block size.
If HP supported NetWare as an Enhanced Incremental Backup client, the 4kb block size of NSS would cause this hive to grow beyond all reasonable proportions.

Some file-systems have real problems dealing with huge numbers of files in a single directory. Ext3 is one of these, which is why the b-tree hashed indexes were introduced. Reiser does better in this case out of the box. NSS is pretty good about this, as all GroupWise installs before GW became available for non-NetWare platforms created this situation by the sheer design of GW. Unlike NSS, ext3 and reiser have the ability of being formatted with different block-sizes, which makes creating a formatted file-system to host the enhincrdb data easier to correctly engineer.

Since it is highly likely that I'll be using DataProtector for OES2 systems, this is something I need to keep in mind.

Labels: backup, hp, netware, NSS, storage, sysadmin

<permalink> posted by riedesg : 5/12/2008 11:00:00 AM 0 comments

Wednesday, May 07, 2008

DataProtecter 6 has a problem

We're moving our BackupExec environment to HP DataProtector. Don't ask why, it made sense at the time.

Once of the niiiice things about DP is what's called, "Enhanced Incremental Backup". This is a de-duplication strategy, that only backs up files that have changed, and only stores the changed blocks. From these incremental backups you can construct synthetic full backups, which are just pointer databases to the blocks for that specified point-in-time. In theory, you only need to do one full backup, keep that backup forever, do enhanced incrementals, then periodically construct synthetic full backups.

We've been using it for our BlackBoard content store. That's around... 250GB of file store. Rather than keep 5 full 275GB backup files for the duration of the backup rotation, I keep 2 and construct synthetic fulls for the other 3. In theory I could just go with 1, but I'm paranoid :). This greatly reduces the amount of disk-space the backups consume.

Unfortunately, there is a problem with how DP does this. The problem rests on the client side of it. In the "$InstallDir$\OmniBack\enhincrdb" directory it constructs a file hive. An extensive file hive. In this hive it keeps track of file state data for all the files backed up on that server. This hive is constructed as follows:

The first level is the mount point. Example: enhincrdb\F\
The 2nd level are directories named 00-FF which contain the file state data itself

On our BlackBoard content store, it had 2.7 million files in that hive, and consumed around 10.5GB of space. We noticed this behavior when C: ran out of space. Until this happened, we've never had a problem installing backup agents to C: before. Nor did we find any warnings in the documentation that this directory could get so big.

The last real full backup I took of the content store backed up just under 1.7 million objects (objects = directory entries in NetWare, or inodes in unix-land). Yet the enhincrdb hive had 2.7 million objects. Why the difference? I'm not sure, but I suspect it was keeping state data for 1 million objects that no longer were present in the backup. I have trouble believing that we managed to churn over 60% of the objects in the store in the time I have backups, so I further suspect that it isn't cleaning out state data from files that no longer have a presence in the backup system.

DataProtector doesn't support Enhanced Incrementals for NetWare servers, only Windows and possibly Linux. Due to how this is designed, were it to support NetWare it would create absolutely massive directory structures on my SYS: volumes. The FACSHARE volume has about 1.3TB of data in it, in about 3.3 million directory entries. The average FacStaff User volume (we have 3) has about 1.3 million, and the average Student User volume has about 2.4 million. Due to how our data works, our Student user volumes have a high churn rate due to students coming and going. If FACSHARE were to share a cluster node with one Student user volume and one FacStaff user volume, they have a combined directory-entry count of 7.0 million directory entries. This would generate, at first, a \enhincrdb directory with 7.0 million files. Given our regular churn rate, within a year it could easily be over 9.0 million.

When you move a volume to another cluster node, it will create a hive for that volume in the \enhincrdb directory tree. We're seeing this on the BlackBoard Content cluster. So given some volumes moving around, and it is quite conceivable that each cluster node will have each cluster volume represented in its own \enhincrdb directory. Which will mean over 15 million directory-entries parked there on each SYS volume, steadily increasing as time goes on taking who knows how much space.

And as anyone who has EVER had to do a consistency check of a volume that size knows (be it vrepair, chkdsk, fsck,or nss /poolrebuild), it takes a whopper of a long time when you get a lot of objects on a file-system. The old Traditional File System on NetWare could only support 16 million directory entries, and DP would push me right up to that limit. Thank heavens NSS can support w-a-y more then that. You better hope that the file-system that the \enhincrdb hive is on never has any problems.

But, Enhanced Incrementals only apply to Windows so I don't have to worry about that. However.... if they really do support Linux (and I think they do), then when I migrate the cluster to OES2 next year this could become a very real problem for me.

DataProtector's "Enhanced Incremental Backup" feature is not designed for the size of file-store we deal with. For backing up the C: drive of application servers or the inetpub directory of IIS servers, it would be just fine. But for file-servers? Good gravy, no! Unfortunately, those are the servers in most need of de-dup technology.

Labels: backup, hp, netware, NSS, storage, sysadmin

<permalink> posted by riedesg : 5/07/2008 12:44:00 PM 0 comments

Wednesday, April 30, 2008

Legal processes

Yesterday we received a Litigation Hold request. For those of you who don't know, this is the order given as part of a lawsuit ordering us to take steps to preserve data that could be used as part of the Discovery process of the suit. This is something that is becoming more and more common these days.

Our department has been pretty lucky so far. Since I started here in late 2003 this is the first Litigation Hold request we've had to deal with. We've had a few "public records requests" come through which are handled similarly, but this is the first one involving data that may be introduced under sworn testimony.

This morning we had an article pointed out to us by the Office of Finance Management at the state. WWU is a State agency, so OFM is in our chain of bureaucracy.

Case Law/Rule Changes Thrust Electronic Document Discovery into the Spotlight.

It's an older PDF, but it does give a high level view of the sorts of things we should be doing when these requests come in. One of the things that we don't have any processes for are the sequestration of held data and chain of custody preservation. We are now building those.

Guideline #4 has the phrase, "Consultants are particularly useful in this role," referring to overseeing the holding process and standing up before a court to testify that the data was handled correctly. This is very true! Trained professionals are the kind of people to know the little nuances that hostile lawyers can use to invalidate gathered evidence. Someone who has done a lot of reading and been to a few SANS classes is not that person.

Just because it is possible to self represent yourself in court as your own lawyer, doesn't make it a good idea. In fact, it generally is a very bad idea. Same thing applies to the above phrase. You want someone who knows what the heck they're doing when they climb up there onto the witness stand.

This is going to be an interesting learning experience.

Labels: storage, sysadmin

<permalink> posted by riedesg : 4/30/2008 08:58:00 AM 0 comments

Thursday, March 20, 2008

BrainShare Thursday

Not a good day. My first course, "Advanced BASH," could more accurately be described as, "BASH scripting tips & tricks". I then proceeded to skip the other three sessions I had signed up for.

Novell Open Enterprise Server 2 Interoperability with Windows and AD. All about Domain Services for Windows and Samba. Neither of which we'll ever use. No idea why I wanted to be in this session.
Rapid Deployment of ZENworks Configuration Management. Other people around here have suggested that if we haven't moved yet, wait until at least SP3 before moving. If then. So, demotivated. Plus I was rather tired.
Configuring Samba on OES2. CIFS will do what we need, I don't need Samba. Don't need this one. Skipped.

DL236: Advanced BASH Course
BASH tips and tricks. I got a lot out of it, but the developers around me were quietly derisive.

ZEN Overview and Features
Not so much with the futures, but it did explain Novell's overall ZEN strategy. It isn't a coincidence that most of Novell's recent purchases have been for ZEN products.

TUT303: OES2 Clusters, from beginning to extremes
This was great. They had a full demo rig, and they showed quite a bit in it. Including using Novell Cluster Services to migrate Xen VM's around. They STRONGLY recommended using AutoYast to set up your cluster nodes to ensure they are simply identical except for the bits you explicitly want different (hostname, IP). And also something else I've heard before, you want one LUN for each NSS Pool. Really. Plus, the presenters were rather funny. A nice cap for the day.

And tonight, Meet the Experts!

Labels: brainshare, clustering, linux, novell, NSS, OES, storage, zenworks

<permalink> posted by riedesg : 3/20/2008 04:05:00 PM 0 comments

Friday, January 11, 2008

Disk-space over time

I've mentioned before that I do SNMP-based queries against NetWare and drop the resulting disk-usage data into a database. The current incarnation of this database went in August of 2004, so I have just over 4 years of data in it now. You can see some real trends in how we manage data in the charts.

To show you what I'm talking about, I'm going to post a chart based on the student-home-directory data. We have three home-directory volumes for students, which run between 7000-8000 home directories on them. We load-balance by number of directories rather than least-size. The chart:

Chart showing student home directory disk space usage, carved up by quarter.

As you can see, I've marked up our quarters. Winter/Spring is one segment on this chart since Spring Break is hard to isolate on these scales. We JUST started Winter 2008, so the last dot on the chart is data from this week. If you squint in (or zoom in like I can) you can see that last dot is elevated from the dot before it, reflecting this week's classes.

There are several sudden jumps on the chart. Fall 2005. Spring 2005. Spring 2007 was a big one. Fall 2007 just as large. These reflect student delete processes. Once a student hasn't been registered for classes for a specified period of time (I don't know what it is off hand, but I think 2 terms) their account goes on the 'ineligible' list and gets purged. We do the purge once a quarter except for Summer. The Fall purge is generally the biggest in terms of numbers, but not always. Sometimes the number of students purged is so small it doesn't show on this chart.

We do get some growth over the summer, which is to be expected. The only time when classes are not in session is generally from the last half of August to the first half of September. Our printing volumes are also w-a-y down during that time.

Because the Winter purge is so tiny, Winter quarter tends to see the biggest net-gain in used disk-space. Fall quarter's net-gain sometimes comes out a wash due to the size of that purge. Yet if you look at the slopes of the lines for Fall, correcting for the purge of course, you see it matches Winter/Spring.

Somewhere in here, and I can't remember where, we increased the default student directory-quota from 200MB to 500MB. We've found Directory Quotas to be a much better method of managing student directory sizes than User Quotas. If I remember my architectures right, directory quotas are only possible because of how NSS is designed.

If you take a look at the "Last Modified Times" chart in the Volume Inventory for one of the student home-directory volumes you get another interesting picture:

Chart showing the Last Modified Times for one student volume.

We have a big whack of data aged 12 months or newer. That said, we have non-trivial amounts of data aged 12 months or older. This represents where we'd get big savings when we move to OES2 and can use Dynamic Storage Technology (formerly known as 'shadowvolumes'). Because these are students and students only stick around for so long, we don't have a lot of stuff in the "older than 2 years" column that is very present on the Faculty/Staff volumes.

Being the 'slow, cheap,' storage device is a role well suited to the MSA1500 that has been plaguing me. If for some reason we fail to scare up funding to replace our EVA3000 with another EVA less filled-to-capacity, this could buy a couple of years of life on the EVA3000. Unfortunately, we can't go to OES2 until Novell ships an edirectory enabled AFP server for Linux, currently scheduled for late 2008 at the earliest.

Anyway, here is some insight into some of our storage challenges! Hope it has been interesting.

Labels: msa, netware, NSS, shadowvolumes, stats, storage

<permalink> posted by riedesg : 1/11/2008 09:47:00 AM 0 comments

Monday, January 07, 2008

I/O starvation on NetWare, another update

I've spoken before about my latency problems on the MSA1500cs. Since my last update I've spoken with Novell at length. Their own back-line HP people were thinking firmware issues to, and recommended I open another case with HP support. And if HP again tries to lay the blame on NetWare, to point their techs at the NetWare backline tech. Who will then have a talk about why exactly it is that NetWare isn't the problem in this case.

This time when I opened the case I mentioned that we see performance problems on the backup-to-disk server, which is Windows. Which is true, when the problem occurs B2D speeds drop through the floor; last Friday a 525GB backup that normally completes in 6 hours took about 50 hours. Since I'm seeing problems on more than one operating system, clearly this is a problem with the storage device.

The first line tech agreed, and escalated. The 2nd line tech said (paraphrased):

I'm seeing a lot of parity RAID LUNs out there. This sort of RAID uses CPU on the MSA1000 controllers, so the results you're seeing are normal for this storage system.

Which, if true, puts the onus of putting up with a badly behaved I/O system onto NetWare again. The tech went on to recommend RAID1 for the LUNs that need high performance when doing array operations that disable the internal cache. Which, as far as I can figure, would work. We're not bottlenecking on I/O to the physical disks, the bottleneck is CPU on the MSA1000 controller that's active. Going RAID1 on the LUNs would keep speeds very fast even when doing array operations.

That may be where we have to go with this. Unfortunately, I don't think we have 16TB of disk-drives available to fully mirror the cluster. That'll be a significant expense. So, I think we have some rethinking to do regarding what we use this device for.

Labels: hp, msa, netware, novell, storage

<permalink> posted by riedesg : 1/07/2008 09:54:00 AM 0 comments

Friday, December 28, 2007

NetWare and Hyperthreading, again

I talked about this one last year during some benchmarking. I just had a classic case crop up today.

It has long been consensus in the support forums that, (paraphrased) "If you have hyperthreading turned on and get an I/O thread stuck on a logical process, woe be unto you."

I have a server that I've been backing up for a fellow admin in another department. This particular server has 525GB of storage to back up, so it's going to take some time. It has been vexing figuring this one out. Until today, when I finally twigged to the fact that this server has HT turned on. I turn HT off as almost the first thing I do when setting up a server, so I don't think about it when troubleshooting.

Between 1000 and 1215 today, the backup got 882MB of data. Yeah, very crappy.

At 1215 I turned off the logical processors. This is a handy feature NetWare has, and I used it in the article I linked above.

At 1222 when I checked back the backup was up to 4.0GB.

At 1417 it is now up to 71GB backed up.

The only thing that changed was me turning off the logical processors. That's it. At that rate, this server should be backed up in around 15 hours, which is a far cry from the 30+ hours it was doing before.

Turn Hyperthreading off on your NetWare servers. Just do it.

Labels: hyperthreading, netware, novell, storage

<permalink> posted by riedesg : 12/28/2007 02:17:00 PM 0 comments

Wednesday, November 28, 2007

I/O starvation on NetWare, HP update

Last week I talked about a problem we're having with the HP MSA1500cs and our NetWare cluster. The problem is still there, of course. I've opened cases with both HP and Novell to handle this one. HP because I really thing that such command latencies are a defect, and Novell since they're having starvation issues with clusters.

This morning I got a voice-mail from HP, an update for our case. Greatly summarized:

The MSA team has determined that your device is working perfectly, and can find no defects. They've referred the case to the NetWare software team.

Or...

Working as designed. Fix your software. Talk to Novell.

Which I'm doing. Now to see if I can light a fire on the back-channels, or if we've just made HP admit that these sorts of command latencies are part of the design and need to be engineered around in software. Highly frustrating.

Especially since I don't think I've made back-line on the Novell case yet. They're involved, but I haven't been referred to a new support engineer yet.

Labels: clustering, hp, msa, netware, novell, OES, storage, sysadmin

<permalink> posted by riedesg : 11/28/2007 02:45:00 PM 0 comments

Wednesday, November 21, 2007

I/O starvation on NetWare

The MSA1500cs we've had for a while has shown a bad habit. It is visible when you connect a serial cable to the management port on the MSA1000 controller, and doing a "show perf" after starting performance tracking. The line in question is "Avg Command Latency:", which is a measure of how long it takes to execute an I/O operation. Under normal circumstances this metric stays between 5-30ms. When things go bad, I've seen it as far as 270ms.

This is a problem with our cluster nodes. Our cluster nodes can seen LUNs on both the MSA1500cs and the EVA3000. The EVA is where the cluster has been housed since it started, and the MSA has taken up two low-I/O-volume volumes to make space on the EVA.

IF the MSA is in the high Avg Command Latency state, and
IF a cluster node is doing a large Write to the MSA (such as a DVD ISO image, or B2D operation),
THEN "Concurrent Disk Requests" in Monitor go north of 1000

This is a dangerous state. If this particular cluster node is housing some higher trafficked volumes, such as FacShare:, the laggy I/O is competing with regular (fast) I/O to the EVA. If this sort of mostly-Read I/O is concurrent with the above heavy Write situation it can cause the cluster node to not write to the Cluster Partition on time and trigger a poison-pill from the Split Brain Detector. In short, the storage heart-beat to the EVA (where the Cluster Partition lives) gets starved out in the face of all the writes to the laggy MSA.

Users definitely noticed when the cluster node was in such a heavy usage state. Writes and Reads took a loooong time on the LUNs hosted on the fast EVA. Our help-desk recorded several "unable to map drive" calls when the nodes were in that state, simply because a drive-mapping involves I/O and the server was too busy to do it in the scant seconds it normally does.

This is sub-optimal. This also doesn't seem to happen on Windows, but I'm not sure of that.

This is something that a very new feature in the Linux kernel could help out, that that's to introduce the concept of 'priority I/O' to the storage stack. I/O with a high priority, such as cluster heart-beats, gets serviced faster than I/O of regular priority. That could prevent SBD abends. Unfortunately, as the NetWare kernel is no longer under development and just under Maintenance, this is not likely to be ported to NetWare.

I/O starvation. This shouldn't happen, but neither should 270ms I/O command latencies.

Labels: clustering, hp, msa, netware, novell, OES, storage, sysadmin

<permalink> posted by riedesg : 11/21/2007 12:40:00 PM 0 comments

Tuesday, September 04, 2007

Expanding the EVA

Our EVA3000 is full. All shelves have disks in them. In order to add space we need to replace our existing 143GB drives with 300GB drives. This is a rather expensive way to gain more space, as that extra 157GB of space costs the same as 300GB of space. But, that's what we have to do.

And wow does it take a while.

First I have to ungroup the disk. This can take up to two days. Then I pull the drive, and put the new one in. And regroup on top of it, which takes another up to two days. All the group/ungroup operations are competing for I/O with regular production.

Total time to add 157GB to the SAN? Looks to be 3 days and change.

We need a newer EVA.

Labels: storage, sysadmin

<permalink> posted by riedesg : 9/04/2007 11:02:00 AM 0 comments

Thursday, July 19, 2007

That darned MSA again

I'm not sure where this problem sits, but I'm having trouble with this MSA1500cs and my NetWare servers. I've found a failure case that is a bit unusual, but things shouldn't fail this way.

The setup:

NetWare 6.5, SP5 plus patches
EVA3000 visible
MSA1500cs visible
Pool in question hosted on the MSA
Pool in question has snapshots
Do a nss /poolrebuild on the pool

Do that, and at some point you'll get an error like this one:

 7-19-2007   9:48:22 am:    COMN-3.24-1092  [nmID=A0025]
 NSS-3.00-5001: Pool FACSRV2/USER2DR is being deactivated.
 An I/O error (20204(zio.c[2260])) at block 36640253(file block
 -36640253)(ZID 1) has compromised pool integrity.

The block number changes every time, and when it decides to crap out of the rebuild also changes every time. No consistency. The I/O error (20204) decodes to:

zERR_WRITE_FAILURE 20204 /* the low level async block WRITE failed*/

Which, you know, shouldn't happen. And this error is consistent across the following changes:

Updating the HAM driver (QL2300.HAM) from version 6.90.08 (a.k.a 6.90h) to 6.90.13 (6.90m).
Updating the firmware on the card from 1.43 to 1.45 (I needed to do this anyway for the EVA3000 VCS upgrade next month)
Applying the N65NSS5B patch, I had N65NSS5A on there before

PoolVerifies, a pure Read operation, do not throw this error.

I haven't thrown SP6 on there yet, as this is a WUF cluster node and this isn't intersession ;). This is one of those areas where I'm not sure who to call. Novell or HP? This is a critical error to get fixed as it impacts how we'll be replicating the EVA. It was errors similar to this, and activities similar to this, that caused all that EXCITEMENT about noon last Wednesday. That was not fun to live through, and we really really don't want to have that happen again.

Call Novell

Good:	Bad:
Their storage geeks know NetWare a lot better. Much more likely to know about Fibre Channel problems on NetWare.	Not likely to know HP-specific problems. More likely to recommend, "Well, then don't move your arm like that," as a solution.

The next step here is to delete these pools and volumes, recreate them, and see if things go Poink in quite the same way. I'm not convinced that'll fix the problem, as the errors being reported are Write errors, not Read errors, and the faulting blocks are different every time. I'm suspecting instability in the Write channel somewhere that is unique to a nss /poolrebuild, as I didn't get these errors when FILLING these volumes. Write channel in this case has a lot of Fibre Channel in it.

Labels: msa, netware, novell, storage

<permalink> posted by riedesg : 7/19/2007 02:20:00 PM 0 comments

Friday, July 06, 2007

Getting creative with Blackboard

I had me an idea yesterday. One of those ideas that I'm not sure is a good one, but wow does it make a certain kind of sense.

We, like all too many schools run Blackboard as the groupware product supporting our classrooms. There is an opensource product out there that also can do this, but we're not running it. That's not what this post is about.

First a wee bit of architecture. Roughly speaking, Blackboard is separated into three bits. The web server, the content server, and the database. The web-server is the classic Application Server that is what students and teachers interface with. The web server then talks with both the content server and database server. The content server is the ultimate home of all things like passed in homework. The database server glues this all together.

Due to policies, we have to keep courses in Blackboard for a certain number of quarters just in case a student challenges a grade. They may not be available to everyone, but those courses are still in the system. And so is all of the homework and assorted files associated with that class. Because of this, it is not unusual for us to have 2 years (6-7 quarters) of classes living on the content server, of which all but one quarter is essentially dead storage.

One of the problems we've had is that when it comes time to actually delete a course, it doesn't always clean up the Content associated with that course. Quite annoying.

This is a case where Dynamic Storage Technology would be great. Right now our Blackboard Content servers are a pair of Windows servers in a Windows Cluster. It struck me yesterday that this function could be fulfilled by a pair of OES2 servers in a Novell Clustering Services setup (or Heartbeat, but I don't know how to set THAT up), using Samba and DST to manage the storage. That way stuff that is accessed in the past, oh, 3 months would be on the fast EVA storage, and stuff older than 3 months would be exiled to the slow MSA storage. As the file-serving is done by way of web-servers rather than direct access, the performance hit by using Samba won't be noticable as the concurrency is well below the limit where that becomes a problem. Additionally, since all the files are owned by the same user I could use a non-NSS filesystem for even faster performance.

Hmmmm......

The problem here is that OES2 isn't out yet. Such a fantastical idea may be doable in the 2008 intersession window, but we may have other upgrades to handle there. But still, it IS an interesting idea.

Labels: blackboard, novell, OES, shadowvolumes, storage

<permalink> posted by riedesg : 7/06/2007 09:37:00 AM 0 comments

This site is a member of WebRing.
To browse visit Here.