January 2005 Archives

Print rates

| 2 Comments
Since I'm playing with printers today, I checked the print-rates:

11am - noon: 4169 pages
Noon - 1pm: 4282 pages

Last week at this time:

11am-noon: 4681 pages
Noon - 1pm: 3311 pages

NDPS again

This may not impact us much, but it looks like HP printers prefer a certain LPR queue.

http://support.novell.com/cgi-bin/search/searchtid.cgi?/10080373.htm

I'm doing some tests to see if this info changes our problems any.

NDPS problems again

We're getting this bug:
NDPSGW-3.01g-NDPS OPERATION ERROR(prhh154-2): The 208 operation to the NDPSM failed.

If needed, the Error Codes explanations in the online System Messages documentat
ion may provide additional information about the following return codes.
Error code: -704 (0xFFFFFD40)
We've had it before. This is one of three ways that the printer-agents in our pools can go unresponsive. This is also the first time I've caught one of them at this. The -704 error is probably an NDPS/NDPSM/NDPSGW error and not a DS error, much as they share the same error codes. Not 100% on that one. I'll see what I can dig out of developer docs.

Disk space

| 2 Comments
It would seem that User3 is running a bit short of disk-space. I say 'a bit' since we haven't hit any walls yet. We have 15% free right now, which is good for a while. But how far back Salvage works is starting to take a dive as a result, since NSS enforces the 10% free-blocks rule.

One prime side-effect of declining Salvage availability will be an increase in restore-from-tape requests. Frankly, that's a pain in the tape-library, and I don't like having to go there. Neither do either of my office-mates. So we're looking into more space.

We do need a policy for defining when we add space to a volume. We've had this SAN for 1.5 years now, and this is the first time we've come close to knocking our heads on a space limit. Pretty good by my measure. One of us is looking into prime space-users and will be putting pressure on a couple of them. That might free up some gigs.

So this is as good a time as any for describing some of the built in ways to gather disk-usage information out of Netware! My prime source these days is the netware portal.

https://ourserver.admcs.wwu.edu:8009/

That thing. It came in NW5.1 IIRC, possibly NW5.0, so its been around a good long while. Very useful, and a lot of Netware admins are woefully unaware of the wonders it has in store. I aim to fix that.

Go there.

The very first page (after logging in, of course) is the page with the Volumes on it. Click on a volume, preferably one with user-data on it.

Now take a look at that list. In all versions, you should see a list of the root-level directories. Depending on your volume type and netware version, what other data you see can change a bit.

NW5.1: Traditional Volumes have a "size" column that gives the size of that directory tree. Very useful for things like tracking user-directory size. NSS volumes do not have the same advantage, so the Size column for directories is zero.

NW6.0: Works as NW5.1, but newer NSS code-bases may populate the Size field.

NW6.5: Fixes the NSS-has-no-size bug, so it populates! Nice feature, I'm sure Novell caught a lot of flak about not getting it in there before.

Why the difference, you ask? That's because traditional volumes cache the entire FAT table in memory, which makes that sort of on-the-fly calculation possible and fast. It also was the reason traditional volumes too a l-o-n-g time to mount (one volume at my old job took 15 minutes to mount). I'm not sure what changes to allow Novell to do this sort of thing in NSS. Perhaps the built this very space-tracking feature into the system. I'm not sure.

For NW6.0 and NW6.5, there is another feature that you may not be aware of. Go back to that Directory listing, and look at the top row of icons. There is an icon there called "Inventory". If you click on that button, a java-thingy kicks off and inventories the volume. What you get in the end is a report detailing a wide variety of things:
  • A list of file-extentions, sorted by how much space they take up on the volume. Good for seeing how bad the MP3 collections are getting.
  • A list of file-extensions, sorted by how many files they have on the volume.
  • A list of users, sorted by how much space each user is using on the volume, as tracked by Owner attributes. Note: some backup and AV packages can 'steal' the owner-attribute from users, so this may not be reliable for tracking the true owners of files.
  • A break-down of the last-accessed time, for last-day, last-week, two weeks ago, a month ago, two months ago, four months ago, six months ago, a year ago, and more than two years ago.
  • A break-down of the creation-time for files, same as above
  • A break-down of the last-modified time for files, same as above
As you can see, quite the list of information. The data is stored at the root of the volume in a set of files called "volume_inventory". There is a .html report, and a .xml file as well. Salivate as you will. Sadly, this inventory can't be automatically kicked off as far as I have been able to discover.

Two things, quite useful. Enjoy!

Brainshare band rumors

To quote:
okay...so this band is not a 1960s, 70s or even 80s band. They began in 1994 and have released 3 or 4 CDs since then, and from what I understand, they are currently in studio recording their next CD. I don't believe they were actually one that was suggested earlier.
"Whoa" go I. A band I might have actually listened to as a kid. Possibly an Alt-Band.

So I go check the brainshare site.
The BrainShare 2005 conference party—to be held Wednesday, March 23, at 6:30 p.m. in the Delta Center—will feature Train, whose first album, "Drops of Jupiter," went platinum. Its title single, "Drops of Jupiter (Tell Me)," spent a total of 53 weeks on the Hot 100 before winning a Best Rock Song Grammy and Best Arrangement Grammy. Check out some of the pieces from Train's new album, "My Private Nation," on the group's Web site.
Train!

Excitement aftermath

The person responsible for sheep-dogging the switch reboots this morning arrived to find that the cluster was already down. It wasn't a split-brain down, no servers were at that special screen. The cluster was just... down. By the time we got the chance to look at the cluster event-log, the events detailing why this special circumstance occured had scrolled off the back of the log.

That log needs to be longer. This isn't the first time I've had key data fall off the back of the log this way. So far I haven't found a way to extend it.

In other news, everything else went fine.

Excitement tomorrow morning

Our telecom section has informed us that they need to reboot the switches in our machine room for IOS upgrades. This has taken some effort to arrange. We have clustered services in there, and when the switches reboot, the heartbeat signals will go away. Then the clusters will do what clusters are supposed to do when heartbeat goes away, freak out.

One of us will be here tomorrow morning to issue that fateful command "cluster down", and babysit the whole process. Once everything is turned off, he'll give the OK to the Telecom dude to perform the upgrade and reboot. Once the switch is back up, up will come the servers, and the babysitting of the cluster resurection will commence.

If that behaves like the few times I've had to bring large numbers of cluster services up at the same time (such as when the NDPS problem took out the entire student half of the cluster), then something WILL go wrong. Some volume mount or something will cause one node to lock up hard, and that resource won't come up for 5 minutes or so. It'll be rocky, but we'll get it up.

Funny TID

| 2 Comments
Okay, someone over there at Novell got away with something PR wouldn't approve of.

Like this.

ADD SECONDARY IPADDRESS 192.168.2.25
CLUSTER START DHCP .CN=DHCP_NovellRocks-FS1.OU=OU1.O=O1.T=DownWithMS
LOAD DHCPSRVR -D2

New patches

Novell has released a new LibC patch! This is somewhat exciting for those of us who have been fighting issues related to NXCreatePathContext. To quote:
- Fixed a problem when authenticating to a server with no replica. Seen by Apache and it would return error code 83.
The 'no replica' issue has been with us since before the NWLIB6A patch. In fact, 6a fixed it for the most part. As you can see here, it is still something of a problem.

As with anything, hope springs eternal. Maybe this one will fix my problems.

Edit: Unfortunately, mod_edir starts throwing "error: 83" problems. This is the, "can't create server identity, error: 83" issue I've had in the past, before nwlib6a came around. Back-reving to NWlib6a from nwlib6b fixed the issue with no changes to the config file.

Huygens probe data

Titan imageThis is an image from the Huygens probe, as grabbed from space.com. The only processing I've done with it is to enhance contrast. The bottom image shows it best. That looks a LOT like a fluid at a specific grade. You can make out what look to be streams, or at least fluid filled canyons. The exact landing spot of Huygens isn't known yet. But this is some interesting stuff.

MyWeb & fun

It looks to me that the combination of:

Apache2
LibC update NWLIB6a
MOD_EDIR 1.0.8
Home directories on cluster nodes

Just plain isn't ready for prime time. The hard part is that the buggy system call seems to be nxcreatepathcontext in that it is very touchy about what it accepts as valid results. For reasons I've not been able to determine, I've been getting error values of 111 (Generic Filesystme error, see filesyserr) and 116 (Generic NCP error, see h_errno), both are not much with the useful. The really sad part is that I can't tell when these errors get thrown so myweb for students can go down for days before someone thinks to call in about it.

One of the more frustrating things is the mod_edir module itself. It is technically open-source. But since it compiles an NLM, I can't just 'pop in' and make edits to dump the values myself. Doing so requires a copy of Metrowerks Codeweaver with the Novell libraries, and that's a grand. Sorry. I've heard that there is a way to bugger GCC on Linux so it'll cross-compile, but I'm not THAT much of a geek to be able to pull that off and expect it to work. And it isn't helping that the two developers behind the product are not terribly responsive to problems like these.

What I need right now is either better error handling in mod_edir, better error handling in NXCreatePathContext, better error trapping in mod_edir, or the ability to tweak code myself.

Fun with SNMP

| 1 Comment
...only not me.

Someone else around here. Checking toner levels with homebrew SNMP code? Nifty. Perhaps someday Novell will put hooks for SNMP status checking into the NDPS Gateway. Who knows.

Printing!

Printing in most of the labs was down for a chunk of today. I can't explain why it happened, just that it happened and we're aware of it. Its one of those things that doesn't go 'bing' when it fails, so we only hear about it when ATUS gets calls. We managed to kick 'em into gear in due time. Sadly, this is one of those things that can (and has) fail at 1:22am right before finals week. I got that call once.

Netware 6.5 and memory

We've had NW6.5 in all of the cluster for a week now, and already we're getting a feel for our particular problems. One thing that rises above the rest is memory handling. It just plain is crankier than NW6.0 was. Novell knows this, and has published a frequently updated TID on the topic:

TID10091980 Memory Fragmentation Issue with NetWare 6.5

The old Netware hands among you may remember w-a-y back in the day when what's now called the Traditional Filesystem required a bunch of tweaking to make work. Well, we're back there again only for memory-handling this time.

To give you an idea as to what we're facing here are some numbers from our servers.

Stat
StuSrv1
StuSrv2
Total System Memory
2,147,062,784
2,147,062,784
NLM Memory
424,497,152
435,085,312
File System Cache (NSS)
1,715,937,280
369,316,840
Mounted Volumes
Stu1, Stu3, Class1
Stu2, Stupublic

Note the b-i-g difference in FIle System Cache. Stusrv1 has more mounted volumes, and they're bigger, but that does not account for the 4.6x increase in memory usage for the file-system cache. The actual difference in terms of mounted, used disk-space is closer to 2.5:1 rather than 4.6:1. Why is this happening? I don't know.

Our Cache Balance setting IS set to 85%. But the reason why some servers are actually taking that is not clear at this time. We never came close to that number with NW6.0. We're also running the latest, bleeding edge NSS modules thanks to the troubleshooting required to get NW6.5 into our cluster. Right now I'm setting things back to 60% and see what we get.

All of our cluster nodes list "Fragmented Kernel Space" in the 10-25% of memory range. So far that hasn't been a true problem. The TID above lists ways to handle that, but our servers haven't been up as NW6.5 long enough to get a true feel for 'normal lode' yet. Plus, the reboots required for the settings to take effect still incur service outages on the cluster (an outage that lasts 15 seconds is still an outage) so it takes scheduling to get changes in.

Our NDS servers have also had memory-frag issues. I've heard rumors that this is more associated with eDir 8.7.3 than Netware 6.5, but it still remains. There are some DS settings you can use to try and reduce frag numbers.

Printing

WWU Libraries killed their elderly Novell server that had been handling their printing and migrated things over to the cluster. I can tell. Here it is, the first week of class, and between 12:00 and 1:00 we printed a hair over 1000 jobs. To give a bit of perspective, finals-week last quarter we were running around 500-600 jobs/hour.

Libraries goes through a LOT of paper.

Edit: I ran the report, and found that in that period we printed 3145 pages.

Welcome to 2005. Remember "05" on your checks.

The cluster has survived the weekend at NW6.5. However, I'm not liking some of what I'm seeing. From some of my other servers it looks like NW6.5 is more vulnerable to memory fragmentation than NW6.0 was. We had one cluster node lock hard two days ago, which caused failovers.

One of the services that didn't completely survive was myweb for FacStaff, the webserver that serves this very blog. After looking into what went screwy, it looks like the server served pages for several hours before it started returning a "111" error. When I look up that error message, I find that it is "Generic file system error; see filesyserrno". On a previous troubleshoot of mod_edir and LibC I know that filesyserrno is an error-trapping number. In this case it is not being extracted to the logfiles, so I'm not sure what it returned. The only way I know to grab it again is to set a breakpoint, and that just isn't nice to do with a cluster-node. What it does tell me is that Something went wrong and it couldn't handle it.

This is an example of a problem in LibC, not mod_edir. LibC is the Netware library that is multi-processor aware, long filename aware, and getting more POSIXy as time moves on. The old CLib library which began live in the NW2.x products is none of these. Apache1.3 and the modules that accompany it called mod_hdirs and mod_rdirs were ported to Netware using the CLib library. This is why the Apache1.3 version of MyWeb is far more stable than the Apache2.0 (linked to LibC instead of CLib) version.

MyFiles (a.k.a. NetStorage) has survived the move to NW6.5 very well. Things are getting used, and I now have logfiles to gauge usage. Also important, since we're using Apache2.0 instead of Apache1.3 I now can use RotLogs to make sure my logfiles don't get insanely large. We've had some MyFiles outages in the NW6.0 days due to the error_log file hitting 4GB.