September 2008 Archives

That darned iPhone

The DHCP scope that is associated with our Wireless network filled this morning. We knew we were getting tight, and were in the process of getting a new system in place that could handle non-contiguous network segments as address pools. But that's not in yet.

The reason we ran out? Apple iPhones. When they come into contact with a Wifi network, they grab an IP address even if they're not actively using it. So. Full scope.

That isn't my department, so I'm not sure what exactly we're doing about it. But, it came up this morning,.

Fickle fortune

I lost a RAID card in one of my Beta servers. Crap. These beasties are all old beasties since that's the only hardware that could be released for the beta. And with crap servers, comes a crap failure rate. This is the second RAID card I've lost, and I've lost one hard-drive too. It isn't common to lose more RAID cards than hard-drives. Arrg.

This puts a kink into things. This was going to be an edirectory host, so I could host my replicas on one set of servers and abuse the crap out of the non-replica application servers. I may have to dual host. Icky icky.

That darned 32-bit limit

| 1 Comment
Today I learned that the disk-space counters NetWare provides in SNMP use signed integers for its disk-space monitoring. These are stats published at a table at OID . Having just expanded our FacShare volume past 2TB, it went negative-space according to the monitors. A simple integer overflow since apparently Novell is using a signed integer for a number that can never be legitimately negative.

I've pointed this out on an enhancement request. This being NetWare, they may not chose to fix it if it is more than a two-line fix. We'll see.

This also means that volumes over 4TB can not be effectively monitored with SNMP. Since NSS can have up to 8TB volumes on NetWare, this could potentially be a problem. We're not there yet.

Moving storage around

| 1 Comment
The EVA6100 went in just fine with that one hitch I mentioned, and now comes all the work we need to do now that we have actual space again. We're still arguing over how much space to add to which volumes, but once we decide all but Blackboard will be very easy to add.

Blackboard needs more space on both the SQL server and the Content server, and as the Content server is clustered it'll require an outage to manage the increase. And it'll be a long outage, as 300GB of weensy files takes a LONG time to copy. The SQL server uses plain old Basic partitions, so I don't think we can expand that partition, so we may have to do another full LUN copy which will require an outage. That has yet to be scheduled, but needs to happen before we get through much of the quarter.

Over on the EVA4400 side, I'm evacuating data off of the MSA1500cs onto the 4400. Once I'm done with that, I'm going to be:
  1. Rebuilding all of the Disk Arrays.
  2. Creating LUNs expressly for Backup-to-Disk functionality.
  3. Flashing the Active/Active firmware on to it, the 7.00 firmware rev.
  4. Get the two Backup servers installed with the right MPIO widgetry to take advantage of active/active on the MSA>
But first we need the DataProtector licensing updates to beat its way through the forest of paperwork and get ordered. Otherwise, we can't use more than 5TB of disk, and that's WAY wimpy. I need at LEAST 20, and preferably 40TB. Once that licensing is in place, we can finally decommission the out-of-license BackupExec server and use the 6 slot tape library with DataProtector instead. This should significantly increase how much data we can throw at backup devices during our backup window.

What has yet to be fully determined is exactly how we're going to use the 4400 in this scheme. I expect to get between 15-20TB of space out of the MSA once I'm done with it, and we have around 20TB on the 4400 for backup. Which is why I'd really like that 40TB license please.

Going Active/Active should do really good things for how fast the MSA can throw data at disk. As I've proven before the MSA is significantly CPU bound for I/O to parity LUNs (Raid5 and Raid6), so having another CPU in the loop should increase write throughput significantly. We couldn't do Active/Active before since you can only do Active/Active in a homogeneous OS environment, and we had Windows and NetWare pointed at the MSA (plus one non-production Linux box).

In the mean time, I watch progress bars. TB of data takes a long time to copy if you're not doing it at the block level. Which I can't.

Monitoring ESX datacenter volume stats

A long while back I mentioned I had a perl script that we use to track certain disk space details on my NetWare and Windows servers. That goes into a database, and it can make for some pretty charts. A short while back I got asked if I could do something like that for the ESX datacenter volumes.

A lot of googling later I found how to turn on the SNMP daemon for an ESX host, and a script or two to publish the data I need by SNMP. It took some doing, but it ended up pretty easy to do. One new perl script, the right config for snmpd on the ESX host, setting the ESX host's security policy to permit SNMP traffic, and pointing my gathering script at the host.

The perl script that gathers the local information is very basic:
#!/usr/bin/perl -w

use strict;
my $partition = ".";
my $partmaps = ".";
my $vmfsvolume = "\Q/vmfs/volumes/$ARGV[0]\Q";
my $vmfsfriendly = $ARGV[1];
my $capRaw = 0;
my $capBlock = 0;
my $blocksize = 0;
my $freeRaw = 0;
my $freeBlock = 0;
my $freespace= "";
my $totalspace= "";
open("Y", "/usr/sbin/vmkfstools -P $vmfsvolume|");
while () {
if (/Capacity ([0-9]*).*\(([0-9]*).* ([0-9]*)\), ([0-9]*).*\(([0-9]*).*a
vail/) {
$capRaw = $1;
$capBlock = $2;
$blocksize = $3;
$freeRaw = $4;
$freeBlock = $5;
$freespace = $freeBlock;
$totalspace = $capBlock;
$blocksize = $blocksize/1024;
#print ("1 = $1\n2 = $2\n3 = $3\n4 = $4\n5 = $5\n");
print ("$vmfsfriendly\n$totalspace\n$freespace\n$blocksize\n");

Then append the /etc/snmp/snmp.conf file with the following lines (in my case):

exec . vmfsspace /root/bin/vmfsspace.specific 48cb2cbc
-61468d50-ed1f-001cc447a19d Disk1

exec . vmfsspace /root/bin/vmfsspace.specific 48cb2cbc
-7aa208e8-be6b-001cc447a19d Disk2

The first parameter after exec is the OID to publish. The script returns an array of values, one element per line, that are assigned to .0, .1, .2 and on up. I'm publishing the details I'm interested in, which may be different than yours. That's the 'print' line in the script.

The script itself lives in /root/bin/ since I didn't know where better to put it. It has to have execute rights for Other, though.

The big unique-ID looking number is just that, a UUID. It is the UUID assigned to the VMFS volume. The VMFS volumes are multi-mounted between each ESX host in that particular cluster, so you don't have to worry about chasing the node that has it mounted. You can find the number you want by logging in to the ESX host on the SSH console, and doing a long directory on the /vmfs/volumes folder. The friendly name of your VMFS volume is symlinked to the UUID. The UUID is what goes in to the snmp.conf file.

The last parameter ("Disk1" and "Disk2" above) is the friendly name of the volume to publish over SNMP. As you can see, I'm very creative.

These values are queried by my script and dropped into the database. Since the ESX datacenter volumes only get space consumed when we provision a new VM or take a snapshot, the graph is pretty chunky rather than curvy like the graph I linked to earlier. If VMware ever changes how the vmfstools command returns data, this script will break. But until then, it should serve me well.

EVA6100 upgrade a success

Friday night four HP tech arrived to put together the EVA6100 from a pile of parts and the existing EVA3000. It took them 5 hours to get it to the point where we could power-on and see if all of our data was still there (it was, yay), and a few hours after that on our behalf to put everything back together.

There was only one major hitch for the night, which meant I got to bed around 6am Saturday morning instead of 4am.

For EVA, and probably all storage systems, you present hosts to them and selectively present LUNs to those hosts. These host-settings need to have an OS configured for them, since each operating system has its own quirks for how it likes to see its storage. While the EVA6100 has a setting for 'vmware', the EVA3000 did not. Therefore, we had to use a 'custom' OS setting and a 16 digit hex string we copied off of some HP knowledge-base article. When we migrated to the EVA6100 it kept these custom settings.

Which, it would seem, don't work for the EVA6100. It caused ESX to whine in such a way that no VMs would load. It got very worrying for a while there, but thanks to an article on vmware's support site and some intuition we got it all back without data loss. I'll probably post what happened and what we did to fix it in another blog post.

The only service that didn't come up right was secure IMAP for Exchange. I don't know why it decided to not load. My only theory is that our startup sequence wasn't right. Rebooting the HubCA servers got it back.

Fixing DNS issues

I've noticed some slow DNS on my station for the last few weeks and finally got down to checking it out. In the wake of the cache-poisoning scare of late July, we had to upgrade our DNS servers to something a bit less scarily old. I believe this required an operating system rev. The last time this happened to me, we figured out that the DNS server in question had auto-negotiated itself to 10-HalfDuplex, and the switch thought it was 100-FullDuplex. You can imagine what that did to throughput.

I fired up wireshark and started tracking my DNS requests. A pattern soon emerged. The first entry in my resolve.conf list was taking anywhere from .5 to 5.2 seconds to resolve most queries. This is hella slow for a DNS server. Since I don't manage these machines, I let the admin who did manage 'em know about it. He couldn't find anything wrong with the DNS servers on a first glance.

Another thing I noticed when looking at the resolver requests I was passing was a lot of IPv6 requests. Almost all of them were for Active Directory related queries, as I've turned off IPv6 support in my web-browser. I still haven't quite figured out how to disable IPv6 on my openSUSE 10.3 machine here.

As it happens, said DNS admin came back in and said to look at things again. So I dropped into nslookup and started throwing queries and watching the response times in wireshark, and sure enough they were zippy again. He turned off IPv6 support on the DNS servers.

Looks like we'll probably need to have a conversation on campus about IPv6 sooner rather than later. Vista comes with it turned on by default, and happily we don't have much of that yet. But these newer linux distros all have it turned on by default.

That darned budget

| 1 Comment
This is where I whine about not having enough money.

It has been a common complaint amongst my co-workers that WWU wants enterprise level service for a SOHO budget. Especially for the Win/Novell environments. Our Solaris stuff is tied in closely to our ERP product, SCT Banner, and that gets big budget every 5 years to replace. We really need the same kind of thing for the Win/Novell side of the house, such as this disk-array replacement project we're doing right now.

The new EVAs are being paid for by Student Tech Fee, and not out of a general budget request. This is not how these devices should be funded, since the scope of this array is much wider than just student-related features. Unfortunately, STF is the only way we could get them funded, and we desperately need the new arrays. Without the new arrays, student service would be significantly impacted over the next fiscal year.

The problem is that the EVA3000 contains between 40-45% directly student-related storage. The other 55-60% is Fac/Staff storage. And yet, the EVA3000 was paid for by STF funds in 2003. Huh.

The summer of 2007 saw a Banner Upgrade Project, when the servers that support SCT Banner were upgraded. This was a quarter million dollar project and it happens every 5 years. They also got a disk-array upgrade to a pair of StorageTek (SUN, remember) arrays, DR replicated between our building and the DR site in Bond Hall. I believe they're using Solaris-level replication rather than Array-level replication.

The disk-array upgrade we're doing now got through the President's office just before the boom went down on big expensive purchases. It languished in the Purchasing department due to summer-vacation related under-staffing. I hate to think how late it would have gone had it been subjected to the added paperwork we now have to go through for any purchase over $1000. Under no circumstances could we have done it before Fall quarter. Which would have been bad, since we were too short to deal with the expected growth of storage for Fall quarter.

Now that we're going deep into the land of VMWare ESX, centralized storage-arrays are line of business. Without the STF funded arrays, we'd be stuck with "Departmental" and "Entry-level" arrays such as the much maligned MSA1500, or building our own iSCSI SAN from component parts (a DL385, with 2x 4-channel SmartArray controller cards, 8x MSA70 drive enclosures, running NetWare or Linux as an iSCSI target, with bonded GigE ports for throughput). Which would blow chunks. As it is, we're still stuck using SATA drives for certain 'online' uses, such as a pair of volumes on our NetWare cluster that are low usage but big consumers of space. Such systems are not designed for the workloads we'd have to subject them to, and are very poor performers when doing things like LUN expansions.

The EVA is exactly what we need to do what we're already doing for high-availability computing, yet is always treated as an exceptional budget request when it comes time to do anything big with it. Since these things are hella expensive, the budgetary powers-that-be balk at approving them and like to defer them for a year or two. We asked for a replacement EVA in time for last year's academic year, but the general-budget request got denied. For this year we went, IIRC, both with general-fund and STF proposals. The general fund got denied, but STF approved it. This needs to change.

By October, every person between and Governor Gregoir will be new. My boss is retiring in October. My grandboss was replaced last year, my great grand boss also has been replaced in the last year, and the University President stepped down on September 1st. Perhaps the new people will have a broader perspective on things and might permit the budget priorities to be realigned to the point that our disk-arrays are classified as the critical line-of-business investments they are.

Disk-array migrations done right

We have two new HP EVA systems. An EVA4400 with FATA drives that we'll be putting into our DR datacenter in Bond Hall, and upgrading our EVA3000 into an EVA6100 + 2 new enclosures. The 4400 is a brand new device, so is sitting idle right now (officially). It will be replacing the MSA1500 we purchased two years ago, and will fulfill the duties the MSA should have been doing but is too stupid to do.

We've set up the 4400 already, and as part of that we had to upgrade our CommandView version from the 4.something it was with the EVA3000 to CommandView 8. As a side effect of this, we lost licensing for the 3000 but that's OK since we're replacing that this weekend. I'm assuming the license codes for the 6100 are in the boxes the 6100 parts are in. We'll find that out Friday night, eh?

One of the OMG NICE things that comes with the new CommandView is a 60 day license for both ContinuousAccess EVA and BusinessCopy EVA. ContinuousAccess is the EVA to EVA replication software, and is the only way to go for EVA to EVA migrations. We started replicating LUNs on the 6100 to the 4400 on Monday, and they just got done replicating this morning. This way, if the upgrade process craters and we lose everything, we have a full block-level replica on the 4400. So long as we get it all done by 10/26/2008, which we should do.

On a lark we priced out what purchasing both products would cost. About $90,000, and that's with our .edu discount. That's a bit over half the price of the HARDWARE, which we had to fight tooth and nail to get approved in the first place. So. Not getting it for production.

But the 60 day license is the only way to do EVA to EVA migrations. In 5 years when the 6100 falls off of maintenance and we have to forklift replace a new EVA in, it will be ContinuousAccess EVA (eval) that we'll use to replicate the LUNs over to the new hardware. Then on migration date we'll shut everything down ("quiesce I/O"), make sure all the LUN presentations on the new array look good, break the replication groups, and rezone the old array out. Done! Should be a 30 minute outage.

Without the eval license it'd be a backup-restore migration, and that'd take a week.


Looking at usage stats, the amount of data transferred by Myweb for Students has gone down somewhat from its heyday in 2006. I blame Web 2.0. Myweb is a static HTML service. We don't allow any server-side processing of any kind other than server-side includes. This is not how web-development is done anymore. This very blog is database backed, but Blogger publishes static HTML pages to represent that database, which is why I'm able to host this blog on Myweb for FacStaff.

If we were to provide a full-out hosting service for our students (and staff), I'm sure there would be a heck of a lot more uptake. A few years ago there was a push in certain Higher Ed circles to provide a, "portfolio service", which would host a student's work for a certain time after graduation so they could point employers at it as a reference. We never did that for a variety of reasons (cost being a big one), but the sentiment is still there.

If we were to provide not only full-out hosting, but actual domain-hosting for students, it could fill this need quite well. Online brand is important, and if a student can build a body of work on "$studentname.[com|org|net|biz]" it can be quite useful in hunting down employment. Several of the ResTek technicians I know have their own domains hosting their own blogs, so the demand is there.

I've never worked for a company that did web-hosting as a business item, so I've only heard horror stories of how bad it can get. First of all, we'll need a full LAMP stack server-farm to run the thing. That's money. Second, we'll need the organizational experience with the technology to prevent badly configured Wordpress or PhpBB installs from DoSing other cohosted sites from resource-exhaustion by hackers. This is a worker-hours thing.

Then we'd have to figure out the graduated problem. Once a student graduates, do we keep hosting for them? Do we charge them? Do we force them off the system after a specific time? Questions that need answers, and these are the kinds of questions that contributed to the killing of the portfolio-server idea.

Personally, I think this is something we could provide. However, someone needs to kick the money tree hard enough to shake loose the funds to make it happen. Perhaps Student Tech Fee could do it. Perhaps it could be a 'discounted' added-cost service we provide. Who knows. But we could probably do it.

EVA4400 + FATA

Some edited excerpts of internal reports I've generated over the last (looks at watch) week. The referenced testing operations involve either a single stream of writes, or two streams of writes in various configurations:
Key points I've learned:
  • The I/O controllers in the 4400 are able to efficiently handle more data than a single host can throw at it.
  • The FATA drives introduce enough I/O bottlenecks that multiple disk-groups yield greater gains than a single big disk-group.
  • Restripe operations do not cause anywhere near the problems they did on the MSA1500.
  • The 4400 should not block-on-write the way the MSA did, so the NetWare cluster can have clustered volumes on it.
The "Same LUN" test showed that Write speeds are about half that of the single threaded test, which gives about equal total throughput to disk. The Read speeds are roughly comperable, giving a small net increase in total throughput from disk. Again, not sure why. The Random Read tests continue to perform very poorly, though total throughput in parallel is better than the single threaded test.

The "Different LUN, same disk-group," test showed similar results to the "Same LUN" test in that Write speeds were about half of single threaded yielding a total Write throughput that closely matches single-threaded. Read speeds saw a difference, with significant increases in Read throughput (about 25%). The Random Read test also saw significant increases in throughput, about 37%, but still is uncomfortably small at a net throughput of 11 MB/s.

The "Different LUN, different disk-group," test did show some I/O contention. For Write speeds, the two writers showed speeds that were 67% and 75% of the single-threaded speeds, yet showed a total throughput to disk of 174 MB/s. Compare that with the fasted single-threaded Write speed of 130 MB/s. Read performance was similar, with the two readers showing speeds that were 90% and 115% of the single-threaded performance. This gave an aggregate throughput of 133 MB/s, which is significantly faster than the 113 MB/s turned in by the fastest Reader test.

Adding disks to a disk-group appears to not significantly impact Write speeds, but significantly impact Read speeds. The Read speed dropped from 28 MB/s to 15 MB/s. Again, a backup-to-disk operation wouldn't notice this sort of activity. The Random Read test showed a similar reduction in performance. As Write speeds were not affected by restripe, the sort of cluster hard-locks we saw with the MSA1500 on the NetWare cluster will not occur with the EVA4400.

And finally, a word about controller CPU usage. In all of my testing I've yet to saturate a controller, even during restripe operations. It was the restripe ops that killed the MSA, and the EVA doesn't seem to block nearly as hard. Yes, read performance is dinged, but not nearly to the levels that the MSA does. This is because the EVA keeps its cache enabled during restripe-ops, unlike the MSA.
One thing I alluded to in the above is that Random Read performance is rather bad. And yes, it is. Unfortunately, I don't yet know if this is a feature of testing methodology or what, but it is worrysome enough that I'm figuring it into planning. The fastest random-read speed turned in for a 10GB file, 64KB nibbles, came to around 11 MB/s. This was on a 32-disk disk-group on a Raid5 vdisk. Random Read is the test that closest approximates file-server or database loads, so it is important.

HP has done an excellent job tuning the caches for the EVA4400, which makes Write performance exceed Read performance in most cases. Unfortunately, you can't do the same reordering optimization tricks for Read access that you can for Writes, so Random Read is something of a worst-case scenario for these sorts of disks. HP's own documentation says that FATA drives should not be used for 'online' access such as file-servers or transactional databases. And it turns out they really meant that!

That said, these drives sequential write performance is excellent, making them very good candidates for Backup-to-Disk loads so long as fragmentation is constrained. The EVA4400 is what we really wanted two years ago, instead of the MSA1500.

Still no word on whether we're upgrading the EVA3000 to a EVA6100 this weekend, or next weekend. We should know by end-of-business today.

A history of browsing

With Google Chrome now out and about, it got me thinking about my own browsing habits.

I first heard about this new 'www' thing back in college. I had been on the Internet for a couple years by that point, but http-free. Telnet and FTP were my friends. And so was Gopher. The very first browser I ever used was NCSA Mosaic, which was running on a DEC graphical station in one of the computer labs. It was a whonking big piece of software, so I didn't use it much. But use it I did.

Then someone installed a Netscape Navigator version to one of the CompSci machines and that somehow changed things. If I'm remembering right, it was version 0.97, and had the, "pulsing throbbing N" in the upper right corner (in a readme.txt: "And remember, it may be spelled "Netscape", but it's pronounced, "Mozilla."). After the 1.0 version it changed to the 'comet over the earth' logo that would stay with Netscape for the next many years.

The first browser I installed on a machine I owned was a package that I've forgotten the name of. It was a rather cunning use of telnet and the lynx text browser to emulate a graphical browser. This was before my university allowed SLIP or PPP dialups, of course. Once they did that, I could get my own version of Netscape. And did so.

And there I sat for a long number of years. I flirted with Opera a few times. Tried out IE. But, in the end, I stuck with Netscape. Opera just didn't feel right, or render things the way I expected. IE was the evil Microsoft, and I avoided it where possible.

And then... Netscape Communicator got stale. It hung in version 4.7something for aaaaaages. IE started eating Netscape's lunch. And things just got too annoying. At work I started using Opera, version 4.0 IIRC. It worked for the most part, but was still unsatisfactory.

I stuck with Opera for maybe 3 months before moving over to... sigh.... IE. IE5.5 was at the time significantly better than the alternatives. I moved to it exclusively at home, finally ditching Communicator. I believe the reason I moved had to do with IE's 'security zone' architecture, which made a lot of sense for me. Certain sites I wanted to grant 'trusted' status to, and it would just work with no popups. For the rest of the internet I could set different settings. It worked great.

And then I heard about an open-source version of Netscape called Mozilla. I kept an eye on it for a while, waiting for it to become more stable. I installed it on the home Linux machine since obviously IE wouldn't work there. In time, Mozilla matured to the point where it was stable enough for me, and I figured out how to make its security features do what I wanted them to do. I think I formally moved everything to Mozilla shortly after the 1.0 release.

And there I stayed, right up to the point where Mozilla killed Mozilla in favor of Firefox (formerly Firebird). I dutifully switched. And over the course of the next year I got steadily more annoyed with Firefox. Some of which I complained about here. I don't remember how I found out about it, but I learned of the SeaMonkey project that was recreating the Mozilla experience in a fully community-supported way. And that's where I am now.

Unfortunately, SeaMonkey is beginning to look as dated as Communicator was. Firefox 3 may be less annoying than earlier Firefox versions, so I probably have to try it out. Opera is pretty good, and I've spent time using 9.5 already, but the plugin community is weak and it doesn't do exactly what I want when it comes to privacy settings. IE8 is out of the picture since it's Windows-only. And so is Chrome, though I hear that'll be changing.

EVA4400 testing

Right before I left Friday I started a test on the EVA4400 with a 100GB file. This is the same file-size I configured DataProtector to use for the backup-to-disk files, so it's a good test size.

Sequential Write speed: 79,065 KB/s
Sequential Read speed: 52,107 KB/s

That's a VERY good number. The Write speed above is about the same speed as I got on the MSA1500 when running against a Raid0 volume, and this is a Raid5 volume on the 4400. The 10GB file-size test I did before this one I also watched the EVA performance on the monitoring server, and controller CPU during that time was 15-20% max. Also, it really used both controllers (thanks to MPIO).

Random Write speed: 46,427 KB/s
Random Read speed: 3,721 KB/s

Now we see why HP strongly recommends against using FATA drives for random I/O. For a file server that's 80% read I/O, it would be a very poor choice. This particular random-read test is worst-case, since a 100GB file can't be cached in RAM so this represents pure array performance. File-level caching on the server itself would greatly improve performance. The same test with a 512MB file turns in a random read number of 1,633,538 KB/s which represents serving the whole test in cache-RAM on the testing station itself.

This does suggest a few other tests:
  • As above, but two 100MB files at the same time on the same LUN
  • As above, but two 100MB files at the same time on different LUNs in the same Disk Group
  • As above, but two 100MB files at the same time on different LUNs in different Disk Groups