Upgrade oddities

Okay, I've now stumped two front-line Novell techies, and at least one back-line Compaq geek.

The problem:
The cluster partitions are not visible to the new node

The Symptoms:
  • In NSSMU, the partitions show as a specific size, but 0kb for Partitioned and Unpartitioned space.
  • In Monitor, the devices are right there, but nothing is behind them
  • In NSSMU, if you do a Scan For New Devices, you get a 526 error
A lot was done. Almost all of it unproductive. I'm hoping to get at Novell back-line support tomorrow. The list of what was done:
  • Back-rev the QL2300.HAM version to the one being used successfully by the NW6SP4 nodes that are working just peachy.
  • Back-rev the SCSIHD.CDM file to the SP1 version.
  • Upgrade the QL2300.HAM version to the newest certified version (dated 10/8/2004, includes in betaSP3)
  • Upgrade the NSS code to N65NSS2B
  • Run the Novell supplied PARTFIX.NLM utility on it
    • This had the benefit of giving us an additional error to work with: "Partition size exceeds device capacity"
  • On the EVA, create a new virtualdisk and present it to the upgraded node. Reboot, partition, create a pool, create a Volume. Reboot. Extend the volume. Reboot. Rescan
    • This worked exactly like it should. THIS volume is perfectly readable.
    • This bit is what caused Compaq to say that it isn't the driver misreading the partition table, but rather an error in the OS reading the partition-data supplied by the driver.
  • Performed a Pool Rebuld on a mostly-harmless cluster resource that was also very small. Did not cause things to re-present
  • Discovered on my own that we have and odd thing. On the NW6SP4 box for one of the cluster drives, it reports 640Gb capacity, 639.99Gb partitioned. On the NW65Sp2 box, it reports 639.99Gb capacity. Note which value this matches.
My theory is that there is something borked on the NSS datastructures on the cluster drive. That stuff was created at least one full service-pack ago, possibly two (possibly NW6SP2). I'm not sure since that predates me. I've read some things that 'legacy' environments like ours have Issues. Especially if there are extents to the NSS drives in the intervening time like we've had. The only sure-fire way to make us work is to back it all up and restore it from tape.

And at something like 1.4TB of data, that ain't gonna happen any time soon. We'll move to NW6SP5 first, and limp until Summer before that happens. There is a CHANCE that we'll need to get to SP5, upgrade to newer NSS code and the verify/rebuild all of our pools to make it all work. Chancy, but it could work.

And all this because the Auditors don't like our use of FTP on the Novell cluster. Since OpenSSH isn't certified to work with NW6, we have pressure to upgrade to NW65 where it IS working correctly.


Was this resolved?

I did solve this. Take a look at this post:This post