December 2004 Archives

Upgrade complete

The cluster upgrade was completed last night. At 1:45am, I got to watch as all the cluster-volumes went offline suddenly, go to the Upgrade state, and then migrate back to their starting places. It was interesting. And definately a reason to do this during break rather than when school is in session. No issues of note, this WAS the sixth one I've done so far ;)

Cluster upgrades

Wednesday night/Thursday morning I'll be upgrading the sixth and final node in the Novell cluster to NW6.5. Since this node hosts the Software volume, during this upgrade the W: drive will not be available. As this will be in the 1am-3am range, I don't expect many people to notice this.

However, since this is the last node in the cluster to get upgraded, this upgrade has an extra step. After Netware has rebooted after the file-copy and then starts the upgrade in ernest, it will then take a step guaranteed to cause heart-stoppage during term.

It'll take every single cluster resource off-line. All of 'em.

Then it'll mount the volumes on the node I'm upgrading, but in an unusable 'UPGRADE' state. It'll then upgrade the data structures in the partition tables and NSS trees to the format for NW6.5. Reportedly, this doesn't take long. When this is complete, it'll move all the volumes back to where they came from and put them online. Clearly, this isn't something we could do during term.

Sftp on Netware

I found an undocumented documentation:

Modified to allow sftp browsing of remote servers. This requires that the remote servers both 5.1 and 6.0 be updated to the latest version of LibC. Sept. 27, 2004 for 6.0 and Sept 20, 2004 for 5.1

I found this in a CVS note in the Forge OpenSSH site. This would explain why I wasn't getting good results. The account I was testing with had its home-directory on a not-yet-updated NW6.0 machine. The patch in question is the same NWLIB6A patch that I installed on all the Student servers back in October as part of the great NDPSGW problem. Fac-side hasn't been touched since it hasn't had the same issues. The student side is ALL NW6.5 with the LIbC patch, so student-side SFTP should work just peachy.

Or rather, it should. I haven't completely verified it yet.

VNC & Netware

It would seem that the combination of the NWLIB6a patch and NW6.5 make VNCSRV have a memory leak. Fortunately, there is a fix at the VNC Forge Site. VNCSRV is not supposed to take up 564 megs of memory.

Status

Christmas Eve and alls well.

It spreads

Harrr.

Miracle happened, and now the whole student half of the cluster is being served out of NW6.5. The last two updates were really nifty since I could get a node upgraded in about an hour. The upgrade went something like this.
  1. Perform standard Upgrade
    1. Remove CPQMPK and CPQSHD.CDM while doing so
  2. Apply NWLIB6A patch
  3. Apply APR211 patch
  4. Apply AP2052 patch
  5. Apply N65NSS2B patch
  6. Apply iprint11b patch bits (sort of...)
  7. Apply ncpfsp patch
  8. Apply nslockpatch
  9. Apply sshpt2 patch
  10. Apply tcp657 patch
  11. Copy apache config-file for NetStorage from the working node
  12. Copy mod_edir from the working node to the new node, overwriting the old one
  13. Copy sshd_config from working node
  14. Copy the trio of NCF files from the working node that kick off MyWeb and MyFiles
  15. Copy the ftp NCF file from the working node so get SSH bits in
  16. Rearrange AUTOEXEC.NCF so the apache startups come nearer to the end
  17. Reset server
Tada! Looks like a lot of steps, but the bits in the middle actually go really fast. All those patches are needed since we're pretty close to SP3 these days, and well, we need 'em.

Upgrades continue

It would seem that Server#2 has been upgraded in under an hour. For this I have to thank all the work I put in over the last week and a half upgrading Server#1 in the cluster. It also helped that the 'feature' I ran into during the upgrade of the other server was avoided, and this saved me a couple days right there. Then it was just applying all the right patches, moving config files to where they needed to be and rearranging the load-order of somethings, and whamo!

Two thirds of the student cluster are now at NW6.5. All of it should be done by monday. Or later today, if a Miracle strikes and no one has files open on the last node. Its the DAY BEFORE CHRISTMAS EVE people, GO HOME! We are, and we're paid to be here.

Next week we do Faculty side, and that'll be interesting. Older hardware, and one of the servers has a direct-attach volume that all of campus maps to (don't ask), which means that one has to be done after hours.

A bit of humor

>1) be at the *actual* server you wish to install the licenses.
>2) (spin twice clockwise while chanting "Bill is my hero."
>3) call 888-571-2048
>4) navigate to the proper area by saying "Terminal Services"
>5) standing on your left foot
>6) at this point an phone-elf will appear
>7) you will need to navigate to All Programs/AdminTools/Terminal Services/Install License
>8) give the elf the License Server ID
>9) switch to your right foot at this point, or everything will break
>10) our enrollment number is: [redacted]
>11) stand back and cover your eyes
>12) wait for the magic to happen
>13) you can now return to both feet

OpenSSH in the cluster

T'aint working yet. SSH had a bug in it for a long time where you couldn't access volumes not hosted local to the SSH server. This was theoretically fixed in the sshpt2 patch. Unfortunately, I can't say that it does. SSH, or more specifically sftp, is a big push from on high due to the insecure nature of FTP. FTP-on-Netware is our LAST insecure connection method from off campus; IMAP on titan was the other last one.

Some of the folks in other parts of Campus have had fair sucess I hear. Perhaps its the clustered nature of our environment that changes things. Cluster-volumes aren't 100% the same as normal volumes, and that can throw some software.

SSH-to-Netware doesn't work since NetWare isn't a shell-server. You can try, but all you'll do is spam the console with a failed login attempt and get denied. Sort of like forgetting what machine you're on an attempting to su to root... only you don't own that box. OpenSSH's logging on Netware is more annoying as it spams console and not just the logger screen.

Looking over the ftp log-files I see we have one student who is uploading a webcam image every 5 minute or so. Should that get moved over to sftp, I do NOT want to see each and every sftp connection logged to Console. Do NOT need that.

MyFiles improves

It looks like the webdav HTTPS bug is improved in the newer version of NetStorage (a.k.a. MyFiles). In the old version, WebDav to MyFiles would get you at list... but the sub-folders all attempted to connect by way of http instead of https. And since we had http turned off for MyFiles (it IS unsecured, afterall) that presented problems.

New version apparently works normally! At least two people that were broken are working now. For the loyal readers among you, the student MyFiles is currently running on the NW6.5 box, in case you want to give it a try.

NetStorage figured out

Took a call to Novell to figure it out, but we got it. A few lines were missing from the config file.

<ifmodule mod_jk.c>
JkWorkersFile "sys:/adminsrv/conf/mod_jk/workers.properties"
JkLogFile "logs/mod_jk.log"
JkLogLevel error
</ifmodule>

A little thing. The key entry is the JkWorkersFile, as that is what tells Apache how to get at Tomcat's actual worker-bits. Once that was in, the 500 server-error failures went away and things started working again. The interface is slightly different than the NetStorage that came with NW6, but not unworkably slow. One menu bar is blue where the other way grey. Still will probably generate calls when things go live, users are like that, but nothing fatal.

MyFiles hint:

If downloads aren't working, switch to "text mode". It's the wordpad looking icon towards the top of the menu in the right hand window. It doesn't use the fancy script to download the file, it just accesses it direct-like.

Oddity fixed!

Found the problem for the thing I posted last night!

If I comment out the CPQSHD.CDM from the STARTUP.NCF file and let SCSIHD.CDM perform those duties, the partitions are correctly read and usable. The version of CPQSHD.CDM I was using was version 2.02 dated mid-April of 2004. So far as I can tell, that's the latest. But now that node is working correctly.

Now to figure out the NetStorage thing.

Upgrade oddities

| 2 Comments
Okay, I've now stumped two front-line Novell techies, and at least one back-line Compaq geek.

The problem:
The cluster partitions are not visible to the new node

The Symptoms:
  • In NSSMU, the partitions show as a specific size, but 0kb for Partitioned and Unpartitioned space.
  • In Monitor, the devices are right there, but nothing is behind them
  • In NSSMU, if you do a Scan For New Devices, you get a 526 error
A lot was done. Almost all of it unproductive. I'm hoping to get at Novell back-line support tomorrow. The list of what was done:
  • Back-rev the QL2300.HAM version to the one being used successfully by the NW6SP4 nodes that are working just peachy.
  • Back-rev the SCSIHD.CDM file to the SP1 version.
  • Upgrade the QL2300.HAM version to the newest certified version (dated 10/8/2004, includes in betaSP3)
  • Upgrade the NSS code to N65NSS2B
  • Run the Novell supplied PARTFIX.NLM utility on it
    • This had the benefit of giving us an additional error to work with: "Partition size exceeds device capacity"
  • On the EVA, create a new virtualdisk and present it to the upgraded node. Reboot, partition, create a pool, create a Volume. Reboot. Extend the volume. Reboot. Rescan
    • This worked exactly like it should. THIS volume is perfectly readable.
    • This bit is what caused Compaq to say that it isn't the driver misreading the partition table, but rather an error in the OS reading the partition-data supplied by the driver.
  • Performed a Pool Rebuld on a mostly-harmless cluster resource that was also very small. Did not cause things to re-present
  • Discovered on my own that we have and odd thing. On the NW6SP4 box for one of the cluster drives, it reports 640Gb capacity, 639.99Gb partitioned. On the NW65Sp2 box, it reports 639.99Gb capacity. Note which value this matches.
My theory is that there is something borked on the NSS datastructures on the cluster drive. That stuff was created at least one full service-pack ago, possibly two (possibly NW6SP2). I'm not sure since that predates me. I've read some things that 'legacy' environments like ours have Issues. Especially if there are extents to the NSS drives in the intervening time like we've had. The only sure-fire way to make us work is to back it all up and restore it from tape.

And at something like 1.4TB of data, that ain't gonna happen any time soon. We'll move to NW6SP5 first, and limp until Summer before that happens. There is a CHANCE that we'll need to get to SP5, upgrade to newer NSS code and the verify/rebuild all of our pools to make it all work. Chancy, but it could work.

And all this because the Auditors don't like our use of FTP on the Novell cluster. Since OpenSSH isn't certified to work with NW6, we have pressure to upgrade to NW65 where it IS working correctly.

NetStorage sort of working

It loads, you can log in and stuff. But downloading files is not much with the working. Same issue we had a few months ago where attempting to download the file gives you a 404 error in "DownloadFile not found" or somesuch. As this is a complete service pack newer than the old software with this problem, this shouldn't BE a problem.

Grrr.

NetStorage not working

Okay... mod_xsrv is refusing to load correctly. This means that NetStorage (a.k.a. MyFiles) can't run on the new node until I get it worked out.

Neck deep in upgrades

We're in interterm right now, so we're really buy. I'm neck deep in an upgrade of NW6 to NW6.5 on the cluster. And it isn't going well.

Neat trick I learned today:

apache2 shutdown -p OS

That'll shut down Apache JUST in the OS address-space, leaving ADMINSRV and any other address spaces you have running (such as, oh, MYWEB) still serving requests.

Mmmm... finals week

Our printing rate is down. About a third to a half what it was Monday about this time.

Also, the rate of space-loss on the student user volumes also picked up a hair this week as final projects are saved.

Backup speeds

The GigE switch is in, the jacks are wired. Now to plug servers into it and see if we get any increased speed out of the thing. I'm hoping we will, but the challenge of getting a new network cable into production systems is a touch tricky. Tonight half our Exchange cluster will land on a server on GigE, which will give us a better idea how I/O vs CPU bound the Exchange backup is.

Exchange front-end thing

As I talked about recently, we've been having some oddities lately. We found out it wasn't logfiles that were killing us (though that was part of it), it was that the priv1.stm file has grown to a bit over 13GB for no known reason. I grabbed a copy offline and ran BinText on it, which revealed that it is chock full of virus mails going back as far as September.

I don't know why they're parking there. GroupShield probably has something to do it with, but I couldn't tell you what. System Manager doesn't show any mailboxes with that kind of size on that server (only System boxes exist anyway), and the mail queues don't have anything that large. September is long enough ago that it should have been purged by Queue clean-up and mailstore cleanups. No go.

Odd.

NDPS: Drat

Had an abend about 12:50. Crap.

NDPS resolution?

Late yesterday Novell shipped us actual patch code for our NDPS problem. Previously all they've sent us was debug builds designed to populate variable space with as much data as they could cram, so when we got crashes they'd have more to work with. This particular module is the usual NDPSGW.NLM size, the debug builds were about 23K larger than the 'real' ones.

Also last night I turned SNMP polling back on for all of our printers (all 61 of 'em). That had been corolated somewhat with the abends we were getting, though not definate. Our abend frequency went down when we turned that off, but didn't go away. I don't have a lot of time left in Finals Week to have students pounding on the NDPSM, so I needed to get the fixes in and set the environment up for a failure as soon as I could.

If we get through Finals Week without a crash, we'll be in a very good spot for our desired upgrade of the cluster from NW6 to NW6.5.

No crashes yet!

Finals week

It is finals week. Not much happens during this week, except frantic attempts to repair broken services. Like printing! We haven't had it go out yet, but this has been the quarter for printing problems.

LDAP, eDir, and Solaris

| 1 Comment
One of the continuing projects we have around here is to try and get Titan to accept an eDir username/password for login. This is quite doable these days, we just haven't done it yet. We're hoping to get the kinks out of the system in time for Spring quarter, though that may slip a bit.

How this works is to use PAM to point at eDir for its auth-source. This will have interesting side-effects, since we have to attach an aux-class to each user that needs a titan account, and then populate the required fields (among them, a UniqueID and a UID number). The hope is that we can rig it so that we don't have massive permissions problems on Titan. Thus the test.

So I decide to get a jump on things and see if I can make it work on a Linux environment. Only I chose my favorite distro to deploy on, Slackware. And Slackware is the only distro known to man that does NOT have PAM support built in. It seems the developer of the distro doesn't trust PAM's security model, and is smug in the fact (like Linux-geeks are at MS-geeks) that Slackware hasn't required a lot of patches thanks to PAM faults.

Ergo, this is going slow.

Things are different here

Life is rather definitely slower here at NewJob than at OldJob. At OldJob I was hourly instead of salaried. That by itself is, in this case, a better thing when it comes to de-stressing. So far the instances where I've put in 'unpaid overtime' have been balanced out by 'unofficial comp-time'. This is not always the case for salaried people in my line of work. Also, in 1999 I was one of the top OT earners at OldJob thanks to Y2K (largely due to a GroupWise upgrade from GW4.1 to GW5.5).

There are whole days when I'm doing nothing but my daily duties. This doesn't sound significant, but thanks to lots of automation my daily duties are a small percentage of my total work. I have time for this blog, for one, and that's something I didn't have time for at OldJob.

Not that OldJob was terribly stressful. It had its ups and downs, but I kept up. It was sort of like a long-distance jog once you get into the rhythm of it all; hard, but I can keep doing this for a l-o-n-g time. But like that long-distance run, I almost never had 'downtime' where I was thumb-twiddling.

My presence here has also freed up enough time that the senior of the two admins here can focus on automation projects. He created a system for automation of various things (such as account creation, intruder-lockout clears, password resets) a number of years ago, and in the time when my predecessor left and before I arrived almost no work was done on it due to overwork. Now that I'm here to take up workload, he's been downright buried in it and has managed to push forward a major improvement of functionality. Fact is, the next 6-12 months should see a complete rewrite of the system, something that couldn't have happened when it was just two of 'em around here.

Yep, life is slower here. Part of me misses the bustle of being that involved in this. But the rest of me points to the relatively lost salary I got by moving out here and the other part quiets down a bit. I'll get my share of crap soon enough.

Time Passes

A year ago today I left OldState for the drive out here. My first day of work here at NewJob was on 12/8. My how things have changed over time.