August 2006 Archives

Math is hard

So far in the process of debugging this little proggie I've discovered three major math errors.
  1. The timer I'm using has some math in it that made me miss decimal places. Oops.
  2. Small bug where I was retrieving the directory list the same number of times as the stepping facter. Give a stepping factor of 500, and it'll grab the dir list 500 times. Oops.
  3. After retrieving the time taken to grab the directory list I divided that by the stepping factor. Then the total number of entries it retrieved. It should have been just divided by the number of entries retrieved. Oops.


Benchmark observations

Switching this little prog to make files instead of directories was the work of about three lines of code. No biggie. Comparing it to the Directory make, there are a couple of observations I can make when running against NetWare/NSS onna SAN:
  • MKDIR seems to be more server-CPU intensive than TOUCH by quite a bit. During the long MKDIR test CPU was noticibly higher than ambient, but the TOUCH test barely twitched the needle. Hmmm.
  • MKDIR is a faster operation than TOUCH, from a client's perspective.
  • Directories are faster to enumerate than files.
  • Enumeration operations are sensitive to network latency. When the client is busy, enumeration gets noiser.
  • Both create and enumerate are sensitive to client CPU loads.
  • Enumeration is much faster than create, by about four orders of magnitude.
  • At least Directory Create time does trend upwards over time depending on how many objects are in the parent directory. Though this is is only really visible when going well above 100,000 directories, and is very slight; 2.049ms at 2000 dirs and 2.4159ms at 500K dirs. Haven't tested files yet.
Both tests were run on a directory with Purge Immediate set. For a REASON. In the actual benchmark I'll probably also set PI, so I'm not filling the slack space with deleted files and have to explain away a mid-benchmark performance drop when the server has to expire off the oldest files/directories.


Perhaps not?

I ran a few tests with the benchmark I wrote yesterday, and it doesn't seem to be a useful one. Perhaps if I tweak it to use files instead of directories it'll be more useful. But the charts I get out of it show very slow slides up, with no clear break points. MKDIR functions work nearly linearly, and enumerating the directories also shows very good performance. It still takes a long time to parse through 100,000 directories, but atomicly it works well.

Though, I wonder if files yield different results?

And, doh, I bet I'll get different results when running against a Windows server than a NetWare one. Heh. We'll see.


That was easier than I thought...

After not nearly as much time spent in front of VisualStudio as I thought I'd need, I now have a tool to test out the big-directory case. I'm running a few tests to see if the output does make sense, but early returns show not too shabby information. I'm still not 100% certain that I have my units right, but at least it'll give me something to compare against, and whether or not large directory listings are subjest to linear, or curved response times.

Some sample output:

Iterations, MkDirOpTime(ms), EnumerationTimeDOS(ms), EnumerationOpTimeDOS(ms)
500, 0.213460893024992, 4.33778808035625, 0.0086755761607125
1000, 0.205388901117948, 8.56917296772817, 0.00856917296772817
1500, 0.206062279938047, 12.6200889185171, 0.00841339261234476
2000, 0.203543292746338, 16.5182862916397, 0.00825914314581986
2500, 0.202069268714127, 20.5898861176478, 0.00823595444705914
3000, 0.201296393468305, 24.786919106575, 0.00826230636885834

That's 3000 directories being enumerated in that bottom line. I'm also not 100% on the time unit being miliseconds, though the same operation converts the arbitrary (?) system units into real units for all of those.

The utility takes two arguments:
  • Number of directories to create.
  • [optional] Stepping factor.
The above output was generated with "bigdir 3000 500" for 3000 subdirectories and do directory enumerations for every 500 directories. It defaults to a step of 1.


Getting ready for a benchmark

| 1 Comment
Last January I did a benchmark of OES-Linux versus OES-NetWare performance for NCP and CIFS sharing. That was done on OES SP1 due to SP2's relatively recent release. SP2 has now been out for quite some time, and both platforms have seen significant improvements with regards to NSS and NCP.

Right now I'm looking to test two things:
  • NCP performance to an NSS volume from a Windows workstation (iozone)
  • Big directory (10,000+ entries) performance over NCP (tool unknown)
I'm open to testing other things, but my testing environment is limited. There are a few things I'd like to test, but don't have the material to do:
  • Large scale concurrent connection performance test. Essentially, the NCP performance test done massively parallel. Over 1000 simultanious connections. Our cluster servers regularly serve around 3000 simultanious connections during term, and I really want to know how well OES-Linux handles that.
  • Scaled AFP test. This requires having multiple Mac machines, which I don't have access to. We have a small but vocal Mac community (all educational institutions do, I believe), and they'll notice if performance drops as a result of a theoretical NetWare to Linux kernel change.
  • Any AFP test at all. No mac, means no testy testy.
  • NCP performance to an NSS volume from a SLED10 station. I don't have a reformatable test workstation worth beans that can drive a test like this one, and I don't trust a VM to give consistent results.
The large directory test is one that my co-workers pointed to after my last test. The trick there will be finding a tool that'll do what I need to do. IOZONE comes with one that comes kinda close, but isn't right. I need to generate X sub-directories, and time how long it takes to enumerate those X sub-directories. Does it scale linearly, or is there a threshold where the delay goes up markedly?

This may require me to write custom code, which I'm loth to do but will do if I have to. Especially since different API calls can yield different results on the same platform, and I'm not programmer enough to be able to be certain which API call I'm hooking is the one we want to test. This is why I'd like to find a pre-built tool.

If you have something that you'd like tested, post in the comments. It may actually happen if you include a pointer to a tool that'll measure it. Who knows?

Tags: ,

Keeping up

Take a look at Hera's loading:

The break in the chart at the begining of 'week 34' is the point where I took Hera down to reformat it. The big spikes afterwards are all the Things I had to do to it last week.

The other thing to keep in mind is that the line before the break is formatted differently than after the break. Before, the green line was the point-in-time utilization of CPU0, and the blue line was the point-in-time utilization of CPU1. After the break, the green line is the 1-minute averaged load, and the blue line the 5 minute averaged load. Not exactly apples to apples. But generally speaking, if you add the blue and green lines together before the break you'll get and equivalent 'after the break' line.

The bit at the begining of the week where the blue line falls to zero and the green line gets really small is the time after I removed the replicas from Hera and before I turned it off. It was still getting non-trivial LDAP traffic at the time, but it was forwarding off to the other two eDir servers instead of serving results itself. Interesting.

I've already noticed that using iManager on that server will spike CPU quite noticibly. When I leave things be, load is about where I'd expect. Regular processing appears to be equivalent total load as before the reformat, possibly a bit less loading. What will be interesting is what the chart will look like once school starts up again.

Tags: ,

Plodding on

Things are more stable this morning, but we did have some issues. First and most worry making, two of the replicas on Hera were not in a good state. One, happily a small one, never left the "new" state. The other just plain wasn't synching completely.

First, the replica that never left 'new'. I haven't seen that one before, so it took a LOT of digging until I found the fix for it. All dstracing showed that attempts to sync that particular replica was throwing a -673 error (FFFFFD5F replica not on). What ultimately fixed it was doing a "network address repair" on the other two main eDir servers. That seemed to kick clear whatever blockage had built up.

The second one was easier. I just removed the replica from Hera while I worked the other problem. I put it back when the other replica was working fine. In the process I noticed that some of the servers in that replica (but not in the ring) were showing 'unlocatable' errors in the network address rebuild process. Not critical. But once the replica was back on, it showed no signs of going the way it did at first.

As a side effect, I also identified a handful of servers that weren't correctly advertising their presence in SLP. In every case the SLP discovery options were set to 2, or DHCP-only. In that state it'll ignore the slp.cfg completely. Changing it to 4 suddenly caused these servers to find the DA's and report their services, and thus permit their network addresses to be repaired.

SLP on this server in general is a bit confusing. I'm not sure what services an OES-Linux server is supposed to advertise, so I'm not sure if SLP is completely healthy.

I also managed to get LUM set up right. With that in place, My Fellow Admins can log in to the server without me having to create accounts! I'm so proud.

In terms of server health things are in very good shape. CPU usage is still a bit worrying, but now that I have a day's worth of data to look at it appears to be about the same as it was before. On the other hand, the "outstanding requests" in the iMonitor agent health-check consistantly shows lower numbers. Like 3-5 instead of the 7-9 it was before. Peanuts, but progress.

And this morning we heard that the first parts of the router replacement have started. A couple of buildings were moved to the new cloud around 6am today. No screaming so far.

Tags: ,

More thumping

Today's tasks were to get the monitoring we do to that server set up correctly, and fix niggling things. The monitoring was actually pretty fast. I had done my homework on that one, and getting the new configs in was pretty trouble free.

One of the bigger niggling things was LDAP. This is more of a left-over of the meltdown we had this May. The servers that drive the single-signon for most of campus objected to the certificate that Hera was presenting. That got routed around, but it got a couple of us into the thick of things. Getting nldap on linux to present a new certificate isn't quite a simple as it is on NetWare. It isn't easy either place, but it took more... whacking to get the change to truly take on Linux.

Also, all of the replicas have been put back onto Hera. And weirdly, our total CPU usage is higher than it was before the change. WHA? The inverse of that was the goal of the whole change in the first place. We'll see how thing go once we get some normal usage under our belts. All the poking I'm doing on the server is taking cycles, and importing whole replicas is CPU intensive as it is.

Happily, I created an 'installation server' local to that machine so I don't have to feed CD's if I need to install something or other. I haven't decided if I'll make this network-accessible or not, as we have exactly zero further OES-Linux servers in the pipeline. But still. It'll save time.

Tags: ,

State of the migration

I ran into a few hitches yesterday, that I hinted at. The first thing I ran across is that I don't understand how OpenSSL and NovellPKI work together. I got asked during the install to create a Certificate Authority. I got side-tracked in the mind-set of, 'there is only one CA per tree, and this isn't it', and didn't create one. This got me later when it didn't export some key SSL file and apache2 wasn't able to load.

So I removed edir from the server and tried the install again.

Where I came upon my second problem. Specifically, ndsconfig does not remove edirectory nearly as well as NWCONFIG does on NetWare. There were objects scattered hither and thither that prevented a successful reinstall of edir on the same server name. Objects like LDAP Server objects, and SAS objects. To get edir reinstalled successfully I had to manually delete all the extra objects.

This is a problem I ran into during testing, I just forgot I ran into it before I headed over to the other data center. Oops.

The third problem were the post-SP updates. Since SP2 was released in January, there have been a LOT of patches since then. 1.7GB worth of patches. Good thing I work for an Educational institution with fat pipes and was performing the update during an intersession when traffic is very light. Aye. THAT didn't give me any grief at all, happily.

Since SP2 came out, it looks like we've been averaging something like 2.3 patches per day inclusive of weekends. That's a lot. That's more than Microsoft in the bad old days before they came up with the Patch Tuesday concept. So once again I blow the dust off of procedures I used back then:
  1. Identify the patch.
  2. Assess if we have the package that is being patched.
  3. Determine if the behavior addressed by this patch is one we'll ever run in to.
  4. Based on 3, decided if this is a Patch Now, or Patch Normally patch.
Happily for me, normal users will never ever have file system access on this server so something like 85% of the security patches fix things that I'll only worry about once the server has already been broken in to. Therefore, most of the patches coming down the pike can wait for normal patch management days.

The other thing I forgot was the cardinal rule of doing ANYTHING with Linux:
Thou shalt have internet access and a browser. Yea, verrily, yea.
I didn't. I assumed, wrongly, that the Windows servers next to my patient would be usable. Those four are running headless. Oops. Ah well.

Tags: ,

Quick hits

| 1 Comment
  • 1.7GB of post-SP patches is a leeeetle excessive.
  • I don't understand how Linux works with the Novell PKI
  • ndsconfig does not clean up after itself very well. Reinstalling edir requires manual deletion of edir objects for it to work.
  • OES-Linux is less user-friendly during install than OES-NW

Aaaaand that's it. More later. Have to run.

Removing replicas

I'm removing the replicas from Hera. As expected, the STUDENT replica is taking quite some time. It has to flag all umpteen thousand objects as backlinks, and this isn't exactly a fast server. At the rate it is going, it'll be noon before it is completely expiring the replica.

Doesn't change anything, it's just fun to see all the blnk traffic in dstrace.


Fun stuff next Tuesday

Tuesday I will be nuking the Hera server and rebuilding it as an OES-Linux server. The other two eDirectory servers will not be following it any time soon for two reasons:
  1. No direction yet from On High (which just changed people) regarding Novell-supplied operating systems.
  2. Two NetWare dependancies on those two servers
This is our first in-production OES-Linux box. We've had them in the tree before, but in a test capacity rather than actually doing work. One of them is currently acting as a build-server, though the 18gb drives in it are proving to be a tad wimpy when it comes to hosting build trees. Anyway, this is the very first one.

And then we learn how to support these things. With patches coming out for them every other day (thank you, Open Source) the patch-cycle management will need tweaking from what we do with our other systems (specifically, NetWare and Windows). Also needing input is how to get the other admins in, and how hosting eDir on Linux changes how we kick problems.

Exciting stuff. There are a few things that needed doing to get it into our environment. Firstly, Hera is a Primary Timesync source, which will obviously have to change. Second to that, we have a pair of service monitoring services that need to be informed how to re-query their bits. This server doesn't do SLP, so we don't have to worry about that problem quite yet.

I hope to post experiences after I have things in place. But the server is physically hosted not in my building, so I'll be away-from-desk the whole day I'm doing the install.

Tags: ,

What makes a server a server?

Anandtech is running an article series about servers and what makes a computer a server. The first installment was posted today. In their words, the difference is:
Basically, a server is different on the following points:
  • Hardware optimized for concurrent access
  • Professional upgrade slots such as PCI-X
  • RAS features
  • Chassis format
  • Remote management
One of the things that makes my life a LOT simpler is the remote-access-card in our servers. We use HP for ours, so it's called an Integrated Lights-Out card, or iLO. That card is the difference between having to drive in (a 20-25 minute drive) and fixing something from my living room at home. The iLO gives me access to a power button from home, which is the most useful thing it does. Second to that, I have the ability to see what the server is presenting on the screen like a KVM; quite useful for Abend screens. I don't know if they even make things like that for 'home brew' servers.

RAS isn't Remote Access Service, it is an acronym for Reliability, Availability, serviceability. Servers almost always have redundant power supplies, where desktop hardware doesn't. Servers almost always have hot-swappable hard-drives, where desktop hardware doesn't. Some servers (especially with NetWare installed) even have hot-swappable CPU's. These sorts of features are slowly creeping into the desktop realm, but some just don't make cost-effective sense. With RAID becoming more an more prevalent in the home, hot-swappable drives in the home are going to become more common. But hot-swap power-supplies? Probably not.

Chassis format is another key area. When you scale past a certain spot, racking your servers makes more and more sense. Once you get there, your options for whiteboxing your way to IT Glory become much less available. Engineering a 1U server takes work, since that form-factor is very prone to over-heating in the modern high energy environment. If you want to go blades... you'll have to go OEM instead of build your own.

The article concludes with a discussion on Blades and their place in the server market. And they make some really good observations. Blades are useful if you really do need 24 mostly identical servers. We have a blade rack with 24 servers, and that rack was filled within 12 months of its arrival. What filled it were the servers that were moving onto, "they want HOW much?" maintenance plans, and buying new was deemed the way to the future rather than keeping maintenance on the old stuff. We won't face that situation for another two years when another cluster of servers will hit that exalted 40%-of-purchase-price-a-year maintenance level. So our next series of server replacements will probably be a bunch of 1U and 2U servers rather than another blade rack.

But when the blade-rack hits the high-maint-cost level, we'll probably replace it with another blade system. By then, perhaps, the OEM's will have figured out how to future-proof their blade infrastructure so we can, maybe, have one set of blade chassis survive two generations of servers. Right now, that's not the case.
My latest patch-cycle worked pretty well following these procedure, so I wrote them up. Enjoy!

Tags: ,


I'm back! Yay!

And what do I see on the Novell CoolBlogs?

Your Server Room. Looking for pictures of your pile of servers. We're actually in pretty good shape here at WWU, since we had a new datacenter built 5-6 years ago. We've fully Understood the joys of cable management and labeling. The rats nests of bygone years are a thing of the past. Sterile racks, cables managed to within an inch of their lives (have to leave a bit of wiggle room), and labels on everything. It just doesn't make for very interesting pictures, unless that's your thing.

No guarantees that the labels are correct, but we do go through and fix that a couple times a year. Not too shabby.

Our biggest problem is electrical, as our UPS is at 84% capacity as of this morning. Due to building codes in our area, and some fancy footwork on the part of the parties that built our building and datacenter, it is vastly cheaper to just add a second UPS to the room and not expand the capacity of the one we have. That's happening in the next few weeks, right before we get our new router core (Cisco 6500-series chassis, though the one going in our datacenter won't have a router module *pout*).

Thermally we're in pretty good shape. Our biggest problem right now is getting fully vented front doors. That'll do wonders to rack temperatures. Happily, our few really dense racks already have fully perfed doors. Unhappily, we do have some racks with internal temperatures at the server air-intakes of 85 degrees; that is within operating spec so we're not panicking too hard. Just not best-practice, and we know. We know from experience that if one of our two AC units quits the room raises in temperature by 7 degrees, so except for the racks with the 85 degree internal temps everything else is still within operating spec. Plenty of room for expansion there as well.

Things progress

So far, except for a few hitches SP5 (and company) is behaving. There is an issue with the wsock6i patch, which required us to put in the wsock6j patch released yesterday. The Apache abend I posted a few back hasn't reoccured, but I'm still keeping an eye on things. Still too early to really tell if the Memory Allocator problems are a thing of the past.

BlackBoard continues to be a thorn in our side, with SP1 coming out this quickly after a major rev. That tells you something right there. I'm not sure what exactly is broken, but it isn't enough to prevent normal classes being held.

During the 4 weeks we don't have students around here, I hope to get some changes to our AFP setups. Specifically, rename the volumes to be more cluster-friendly. This'll break a script in use in our few mac-labs that auto-mounts the user's home directory, so this will have to be a coordinated thing. But not having students around will make that easier.

Then we get to put together our home-brew BCC. We're not using any real BCC software because we won't fit over that particular barrel. So we're going to use a combination of split-clustering, rsync, and related technologies. I really really wish we had a fibre to connect the SAN up here to the one going in up on campus. That would make me happy. Alas. Maybe in another year or two.

Then in September we get to cut over our datacenter onto a Cisco 6509. That will be, er, fun. Yeah, fun. But, in the end, worth it.

And I go on vacation again! It has been a summer for that. I'll be back on the 16th. Until then.

Tags: ,

Revised abend.log format?

That abend I posted yesterday had me looking at abend logs again. There were a few other abends last night (all is not quite well in Denmark), so I had some comparisons to do. Looks like Novell has finally dumped the 'loaded modules' list from the abend logs, and instead is giving detailed information about register contents. The stack list is still there and that's what I generally look at most, that and EIP. But this change results in MUCH smaller abend.logs.

Take a look at the example.

See? A lot shorter. That and a CONFIG.TXT will tell you what's running on the server, so I guess NTS doesn't need the loaded modules list in the abend.log anymore. Huh.

Tags: ,

Oo, not good

Just had an abend on the server handling student MyFiles. And I don't like the look of this Abend.log. Icky.
Novell Netware, V6.5 Support Pack 5 - CPR Release
PVER: 6.50.05

Server STUSRV2 halted Wednesday, August 2, 2006 5:45:31.592 pm
Abend 1 on P00: Server-5.70.05-1937: CPU Hog Detected by Timer

CS = 0060 DS = 007B ES = 007B FS = 007B GS = 007B SS = 0068
EAX = FBF17BC3 EBX = 045F34E0 ECX = 045F3554 EDX = 00000046
ESI = 05ED3DE4 EDI = 05ED3DE4 EBP = 1AFCCF56 ESP = 96C5E970
EIP = 00000000 FLAGS = 00200002

Running process: Apache_Worker 145 Process
Thread Owned by NLM: APACHE2.NLM
Stack pointer: 96C5E988
OS Stack limit: 96C50840
Scheduling priority: 67371008
Wait state: 3030070 Yielded CPU
Stack: --FBF17BC3 ?
00114541 (LOADER.NLM|WaitForSpinLock+71)
--00000000 (LOADER.NLM|KernelAddressSpace+0)
0011435D (LOADER.NLM|kspinlock_patch+76)
8EF637FB (NWUTIL.NLM|_SCacheFreeMP+3B)
--045F3554 ?
--05ED3DE4 ?
--05ED3DE4 ?
8EF62BF5 (NWUTIL.NLM|NWUtilFree+25)
--045F34E0 ?
--05ED3DE4 ?
8EF571E9 (NWUTIL.NLM|dt$ConfigFile+C9)
--05ED3DE4 ?
--8EFB6280 ?
--05ED3DE4 ?
--8D098D60 ?
--1E8011B4 ?
8EF57211 (NWUTIL.NLM|CF_Delete+11)
--05ED3DE4 ?
--00000002 (LOADER.NLM|KernelAddressSpace+2)
003630FD (SERVER.NLM|FunnelingWrapperReturnsHere+0)
--05ED3DE4 ?

See the end of that stack trace? All those "CCCCC" entries? Me thinks something busted a buffer somewhere. We'll see if this reoccurs. Oh I hope I don't have another cycle of NetStorage abends. Those are hard on servers.

Tags: ,

Blackboard patent


This will be bad. BlackBoard already has a near lock on the 'learningware' market, and this patent will help enforce that. Not good.

Let me put it this way. Of all the BlackBoard administrators I've spoken with, none of them admit pleasure involved with that part of their jobs.


Near heart failure

The thing that caused me to go into near heart failure was one server's misbehaving. I had just finished applying SP5 to one of the WUF servers and it was rebooting. It didn't come right up so I go back and look to see where it might be hung up. Instead what I see is the "Multiple abends occuring, processor halted" screen. Aie!

And it seemed to happen every time on reboot. Some quick looking showed that it happened pretty quickly after NILE.NLM loaded. But due to the nature of the problem, not abend.log was generated for me to know what exactly faulted.

So I drop 'er down to DOS and to a "SERVER -ns -na" and step through the launch process. During this time I noticed that the HP BIOS is expecting a "Windows" server. I thought that was odd, and put it on the list of things to change. I step through boot, and sure enough, during AUTOEXEC.NCF processing it bombs on the multiple-abends-occuring problem. Arrrg. Still don't know what module load caused the problem, though.

So I reboot and change the OS the BIOS is expecting to be NetWare, and reboot. It comes up just peachy.

oooookaaaay.... Didn't know it was that twitchy. Anyway, it is up and running just fine now.

Tags: ,

SP5 (and extra bits) is in

You ResTek folk will want to double-check iPrint. I can't do it justice from here. Except for one bit of near heart failure (tell you about it tomorrow) things went well. And I learned that there is a new MOD_EDIR released (version 1.0.12) that has listed as a fix not breaking when cluster failovers happen! I'm all over that.

Og need sleep. Ook.

I'm of mixed mind about this one (purely selfish reasons of course), but Novell has released what they're calling, "OES SP2, NW6.5 SP5 Update 1". Which you can download here. This is not exactly new for Novell, since they frequently release 'update collections' for the Novell Client (I think "C" was released a week ago or so). But the important thing here are the patches they included in the update. From the TID:
This file contains the following files which have been previously released separately and are provided together in this patch kit as a convenience for installation:
nw65os5a.exe 2973639
n65nss5a.exe 2973412
nwlib6h.exe 2974119
wsock6i.exe 2973892
I was planning on applying three of the four, but the WinSock one I hadn't. Perhaps I should integrate that into my patch process, eh?

In the bigger picture, I think we just saw how Novell is going to handle getting between SP5 (a.k.a. OES SP2) and OES2.0 (a.k.a. SP6).

Tags: ,