Oh, today just got very exciting. I was pulling out a 1U rack to see exactly what memory we had in the thing. This is one of those newer Compaq DL360's, and they don't have a cable-management tray. So I was back there unplugging things so I could pull it out. I get things cleared, and I gently shove it forward. It hit the stop on the rail where I have to go in front and unhitch it to go forward even more (I needed it out far enough so I could open the top). As I'm stepping down from the stool in back, one of the rails gives way and the thing plummets to its doom.
It was installed in U41 so it fell a good 7 feet to bounce off of the computer room floor. It left a black streak where it hit, and it came to rest on its top three feet from the impact point with some plastic bits thrown 15 feet down range. I wasn't hurt, though I did lose a fingernail somewhere along the way. One of the rails just came loose from the front post, and since there wasn't a server under it, it just went head over tails.
The chassis is bent in a number of places. One of the two drives is unremovable because of bent metal. The other drive doesn't rattle when gently shaken, so that's all for the good. It had an OS on it, I was just waiting for telecom to get me a live ethernet cable. And there are mysterious plastic bits rattling around the inside.
This server wasn't in production yet, and was installed by HP. So this is clearly an installation error of some form. And we're treating it as such. This is a "Gold" server, and the server is quite clearly 'down'. I've never bounced a server quite this way before, and I don't like the experience.
I spent a day or two trying to get Novell Account Manager 3.02 to work in my VPC setup. Wow, does it have Issues. First off, the 3.0 build I had wouldn't work in the Win2003 setup, which had me spending way too long figuring out that I needed 3.02 to install on Win2003. Aie. Then the 3.02 install, which looked prettier, persistantly crashed the register-service process. Even when I set compatibility levels for a Win2K level on the exe (deconstructing the build process in the, er, process) it still crashed.
Grumbling, I pulled out a Win2K build I had and tried it on it. That, thankfully, worked. I got the manager installed and to the management web-page.
The the Agent installer crashed when trying to configure IT. Dr. Watson stuff, not LDAP-fu.
At this point, I'm wondering if it was some Windows Update that screwed things up. I'm not sure if it is worth it to persue. For what I'm seeing on our License Fulfilment site implies that we don't actual, say, licenses for the software. But it doesn't go out and say it specifically. If we have to purchase for each of our objects, then this is a dead issue toot sweet; I don't care if it is $.15/head.
Perhaps it'll work better if I try installing it to a NetWare server instead? I don't want to do that since NetWare doesn't play nice in VPC; NetWare, just like Win95, has an idle-loop that NOOPs instead of HALTs, which makes for a high CPU usage on the host OS. Things like Java do not take well to a high CPU usage, even if it is set to a lower priority.
The Admin has returned, and all is still well with the world.
We still haven't figured out the pattern TAS is following for the enable/disable failures. We're gathering data. Now that he's here to fix things, we have hopes of coming to the bottom of this.
Other then that, very quiet today.
I just spent a quality hour fixing a little bitty big problem in one of our partitions. I've written about this before. The "Max Delta" time on one of the partitions was getting outragiously high, but only as reported on a couple of servers. What was also of note is that the Change Cache on some of these servers contained practically every object ever been through that particular partition.
Not a good state to be in, but not lethal. More of a resource suck than anything. Poking it with various sticks didn't work. What did it in the end was to declare a new epoch. But that hasn't cleared the Change Cache stuff yet, though that may just be a matter of time. We'll see. At least everything it talking now.
Not much going on. DS-Admins meeting, during which we discussed NW6.5 and some of its implications. Hardly earth-shattering.
Monday OtherAdmin gets back from vacation. Once he digs out from 3 weeks of e-mail, it should prove interesting as to what grabs his attention first. TAS needs work, real work, as it was partially broken from the day he left. Exchange needs handling while we have all three of us here. Plus he needs updating on how our blades are coming along.
Should be fun!
Today I'm putting in the server that will manage our blade systems. This is my first Windows 2003 server install here. I had one W2K3 install at OldJob about a year ago, so the process wasn't totally lost of me. It's going pretty well, as I get to deal with the HP way of installing servers.
Though I did find something interesting. Unless I'm missing it totally, HP doesn't have a nice feature Dell has had for years. Dell has a web-site where you can plop in the service-tag number and get back a shipping manifest of what shipped with the server. This is especially nice when I'm not the one that specified the server in the first place and need to see what kind of memory came with it, or what OS shipped with it in the first place. I haven't been able to find that with HP yet.
No significant issues. A LAN cable is too short and will need to be redone, but the switch is in and functional. The cable can be done during normal production times, so I'm not worried. I had the thing configured before I put it in the rack anyway.
Time to go to bed.
Tonight I'll be replacing one of the 8-port fibre-switches with a 16-port switch I've got parked on my desk right now. Because the server-to-SAN connections are NOT redundant, this means that I get to take down half the cluster to do this swap. Fortunately, the one service that can only reside on one node is not included in this one, so all cluster services SHOULD remain up.
In theory. We'll see. Anyway, it'll be happening during our normal Tuesday night (Wednesday morning) maintenance window, so no one should complain.
This positions us well for next week. Next week the admin who has been on vacation all July comes back and we have a week before the other admin goes off to camp in eastern WA. We hope to get the Exchange migration booted along, now that the blades are in. At this rate I'm going to be bored come October.
The blade rack is installed. The controller from the BB-Test failure is in and things are working again. So far, all is well with the world. Now we just need to crowd things around on the SAN switches so we can swap an 8-port for a 16-port.
Tuesday I upgrade the CA server and last of the three replica servers. No significant issues to report. Then, contrary to popular belief of such things, I went right on vacation. Family in town, so my ability to respond to calls depended entierly on where I was in relation to cell towers. And for a good chunk of Wednesday I was hiding behind islands.
Happily, no issues cropped up. We have a weird one that looks to predate the upgrade to 8.7.1 much less the upgrade to NW6.5. In fact, it looks to date from last summer. The fact that we hadn't noticed until now means that it isn't a severe error. A ring-delta is not reporting correctly on some servers with R/W replicas. There IS a TID on this one, but it shows that the error should show up on Subordinate Reference servers. But both the server that is claimed to be in error and the Master server do NOT see the error so I'm willing to ignore it.
And working. There was some excitement during the process of bringing up the newly reinstalled server. It seems I didn't wait quite long enough for the obits of the old objects to full process through. Little excitement, but nothing killer. Things are working now.
This bodes well for tomorrow when I do the CA server. Then all the servers I can do in production times will be done (I think) and the rest will have to be done after hours. Wheee.
Things survived the weekend alright. The new NW6.5 server was backed up with nary an issue. This morning I work on replica server #2.
In other news, the Housing request for workstation import is working! They got their DNS entries set up so zenwsimport exists, and my import services are showing imports. And workstations where they are supposed to go.
...will be populated with a lot more itsy bitsy agents.
If Novell wants us to be able to access their servers without any additional software on our part, they have a lot of work to do. You see, the client is a think app. It does lots of things. It allows you to have NCP-access to the Netware server. It allows your NDPS to work correctly. It resolves your Novell hosted resources. And most important of all, it has a login script.
I'm afraid you're going to have to pry that login-script out of our cold dead hands.
iPrint is a nifty idea, but requires an agent. It installs pretty easy, and the user may not notice that it is an agent. Works good. Requires login for audited printing. Users get dinged for each page they print just like they're supposed to. Spiffy.
And NetDrive allows you to map a drive to a Netware server not running Native File Access Pack. And if it is running NFAP, theoretically you can login without anything special. Though, your clustering may not work as well as it would with NetDrive. Pretty cool when it works.
ZenAgent allows your machine to be more
hackable administerable from your administrators. Extra spiffy. With niftiness.
But where is my U: drive? My P: drive? My Q: drive? How the heck am I supposed to remember the UNCs for all the drives I need access to? Aren't they supposed to magically appear wherever I go? How am I, as an administrator, going to force every user that logs in to my servers to run a specific executable, such as a Zen update? Login script!
Yes, Zen may be able to reproduce a lot of that through cunningly crafted Group Policy Objects. Or Run Always application objects with drive-mappings. But that's not the same. Oh, no. A login script is universal to all Novell Clients from back in the 16-bit netx days to the latest Win2K/XP v4.90sp2 client. It runs more often, and more reliably, than Zen for Desktops runs. Oh, login-script. How I adore thee.
So far things are going well. IManager is working fine. My dstrace screen is back now, though it oddly takes a few seconds to pop up. NDS is healthy. Timesync peachy. Monday I'll to the other one, and I hope to do the CA server on Tuesday.
You see, I learned about a really nifty enhancement to PKI on eDir. It seems that with the advent of NICI 2.4.0, which shipped with NW6.0, it is possible to export the self-signed certificate, delete the old CA object, create a new CA object, link it to a new server, and import the old CA-certificate. Thus preserving the CA, and allowing the CA-server to be nuked and rebuilt. W-A-Y cool. The Novell Migration Wizard is pretty nifty stuff, but if you want to upgrade a server without, say, upgrading it, this is darned useful stuff. Yay! Changing your CA certificate is a pain in bum.
It also seems that ResTek is interested in getting students access to their user directories and print to res-hall located printers from the ResTek network. Presumably on their own machines. This is an interesting concept, and I'm very curious to see how far it goes. Right now the best option is to install the Novell Client on the machines (*) and have 'em connect through that. Getting the ZEN agent on 'em would be extra nifty, and they're making sounds that such might be doable.
And I've had another contact from Novell regarding that Defect. No real progress, but it was a call-back. Glad to hear it. This particular problem has some political implications I was not previously aware of. It seems that the FTP process on the cluster is the last unsecured transfer protocol we have in use, and having to wait 12 months to get it secured would be... not good. We have options to work around the problem if it comes down to it, but all of them are kludges, and none are how we'd LIKE to operate if we were given the choice.
(*) Yes, best. You see, we haven't deployed CIFS (or NFS) in the cluster for a very good reason I haven't been told yet. We have AFP but it is far from clusterable, and only works through the magic of a script one of the guys in the office has built. But we don't do CIFS. Only NCP, and NetStorage. Oh, and we haven't used iPrint yet, so we're not about to turn it on on this short of notice. Perhaps later, but not now. Standards must be followed, afterall.
The first of the three replica servers has been upgraded to NW6.5 with the Sp2 overlay CD. This is to a Compaq ML530-G1, and so far things have been working. This particular server was simpler than the others will be since it has no cryptographic role in the tree. The other two are either the CA, or an SDI domain server.
Right now I'm trying to figure out why I'm not getting a dstrace screen. Perhaps they took away my SET DSTRACE=+S and threw it into iMonitor. Or not. Still figuring.
I've registered against the defect. I've also given the team handling the defect the business impact that this defect is having on us. We have a very limited window to do major overhauls to the Netware Cluster, and this defect is a stop-project problem for this year's upgrades.
If we can't get this defect cleared, we're going to have to stay with NetWare 6.0 on our cluster until August of next year. At that time we could go with NetWare 7.0, or Open Enterprise Server, or whatever it'll be called. At that point the new OS would be lucky to have a single service-pack, which makes it a bit too immature for us to trust with a must-be-up system.
So far all the indications are that all of the cluster-based services will transfer without a hitch. Except for this one. Once its cleared, we're all good.
Got the third replica server up to 126.96.36.199, and things are happy. All three are now there, so every replica ring in the tree now has an 188.8.131.52 server in it. Woo!
Also set up the very first NW6.5 server in the tree! I needed to get it in there so I could check out the apache/mod_edir stuff. And it seems, as expected, the bug with mod_edir talking to clusters is still there. But now that I've proven the bug exists with NW65SP2 I should be able to register against the defect.
Second server, no issues reported by anyone so far. This is good.