February 2004 Archives

It looks like we'll be deploying an update that'll get our Certificate Authorities imported as trusted to all workstations. We can't just use AD for this since a goodly percentage of our machines aren't imported in AD. This has a number of benefits:
  • Users won't get the warning message about 'untrusted certificates' when hitting some services
  • We can sign more stuff
  • And possibly most important, we can stop paying VeriSign weyrgelt everytime we need to set up a secure service and don't want worrying messages to come up

I just learned that said volume is a volume mapped by everyone on campus. I'm sensing a resource issue back when they bought the SAN. This should really be fault tolerant. This server will remain unpatched until the standard maintenance time.

It turns out that one of the cluster volumes isn't really a cluster volume, but direct attached. Nevermind that it has a cluster resource. This is good, really, as it tells me that we have a resource. But it can't go anywhere. This will complicate getting that particular cluster node patched.

NW6SP4 has been applied to two more servers in the cluster, without incident. One more server remains before the whole thing is patched. On the plus side, possibly a really big plus, NDPSM hasn't bombed out on the student side since we patched. We had been getting abends on CPU HOGS relating to interupts not returning correctly while execution is in NDPSM's SNMP polling process. Oh, I hope this has cleared up that particular abend problem. Oh, I hope.

The new Ciscoworks server has been passed off to telecom for them to do Cisco things to it. I wish them luck. We used Ciscoworks at my last job, and installng the thing was a pain in th rear. The guy had to reinstall it several times in order to get some install-time settings done correctly.

Such as, what is your seed-device when you do your first discover. We set it to the wrong device, and the map that generated was really screwy as a result. That and what devices are set for SNMP discovery can bite you.

As this application is written in Java, worlds of problems can come up when trying to access it from browsers with different JRE's installed. At OldJob, I had three different JRE's I 'needed' in order to access various web-based consoles. Ciscoworks, which required a specific JRE. Elron Software's WebInspector console, which needed a different JRE than Ciscoworks. McAfee's E500 device, which also had a specific JRE needed. Fortunately, the JRE for the E500 and WebInspector were close enough that I could use the same one for each with only minimally annoying problems. Ciscoworks' JRE was cranky enough I had to have my test machine have that JRE in order to use that application at all.

It survived the night.

Remember that ASUS P3V4X I was complaining about? Well it seem that it really is the IDE drivers. Or at least, I have more information that such is the case. We flashed the BIOS, which was a trivial update. But when I did SP4 I did NOT update the drivers. So far, things seem to have worked. The server has survived a pair of reboots, and has survived a simulated backup (thank you SMSTEST). We'll see if it survives the night.

In personal news, a friend of mine just started working for the Pope (indirectly) as a Systems Administrator involving Solaris, at a private University. He left a job with his State to go work for a private University. Congratulations, Chilly!

This afternoon we headed down to the Seattle area to see a demonstation of HP/Compaq's blade-server line. It was an impressive demonstration. The best part is that the number of servers we're considering putting into this system is ABOVE the break-even point. This stuff requires some infrastructure invesments, and if you have under 6 blades it comes to be more expensive than normal 1U servers would be. The power for these enclosures is 220v three phase, which is some heafty juice. A fulll 42U rack of these should pump out the heat something fierce.

We may get there in the not too distant future, as a number of our current servers are in the 3-5 year range for age. Thus, they need replacing. A lot of them can be replaced with blades, so long as their I/O needs aren't heafty, or can be serviced from the SAN easy enough. The bigger database servers may need their own stuff, but we'll burn that bridge when we come to it. These bitty things can't have PCI slots in 'em, so no direct attached backup for these puppies.

The ASUS P3V4X just plain doesn't like NW6SP4. I think something in the IDE drivers and the VIA chipset. But this morning I found the newly rebuilt server abended, and the SYS volume continually VREPAIRed. Not good.

Rebuilt from the SP3-overlay CD, and so far all is well.

For future reference, to install a SYS volume for a NetWare 6 install as Traditional not NSS, the F5 key is to be used to switch. This is not a screen-documented feature, but it does show up in the NetWare documentation.

Found it. It seems the ASUS P3V4X motherboard doesn't like things in NW6SP4. We have another server like this one that'll need rearranging. It occured to me that if we reformat and use a traditional volume for SYS and not NSS we might get around the problem. These two servers don't really need a NSS volume for SYS.

SP4 hates this server. There is some obscure NSS something or other somewhere in the partition tables that is tripping up the NSS update. It'll install fine from the SP3 overlay CD, but as soon as I put in SP4 and reboot... wham.

| 1 Comment
...or not. SP4 seems to have either caused, or catalyzed, a problem in our SYS: volume on a homebrew server. Calling Novell to help figure it out. "invalid super block headers" is the magic key, and none of the TID's are helping. And modules from NW6NSS3c aren't working either. Next step is to try and move files from the backup (thankfully saved to a traditional volume) to the C:\NWSERVER directory and see if we can get up THEN. As this server is our ArcServe backup server, rebuilding will be a !$~%!@#.

Service pack 4 for Netware 6 has been successfully applied to the root-ring, and all is well. In addition, this brought in a new version of dsrepair, which allowed a continual-sync problem we had to go away. The problem wasn't significant, just annoying to those purists among us who don't like strange things happening to our directory-services. The fix to our problem is described here.

Almost a year ago, we were doing some consolidations of replicas in our NDS tree. One replica got merged with the root replica. Unfortunately, somehow a transitive vector was still flying around with the replica-number of the now-dead replica. Thus, updates to the object in question have been staying on the transitive vector for almost a year. The object worked just fine, it was just that the changes were chasing themselves around the root-ring. A side effect of this is that the root-ring was never more than 3 seconds out of sync from each other due to the continual sync. That ain't normal.

Abend on one server in the cluster, following the disturbing pattern of past abends. Something screwy in how NDPS is reacting to interupt requests. Printing is now on the server with SP4 on it, so we'll see how well things stand up.

NW6SP4 has been applied to a cluster node, and it seems to be working just fine. There is a slight problem with the NW-portal, but nothing that can't be worked around.

The hacks of a few weeks ago have certainly increased vigilance, and suspicion of unusual activity. This is a good thing.

Now if only I could figure out why the apache logs aren't rotating like they used to. Hmmm.

All is quiet. No fires.

*taps meter* tha' heck? What happened to our bandwidth? Weird.

Anyway... SP4 seems to have stuck, and not caused any (immediate) problems.

Applying NW6SP4 to a server this morning to see what kinds of issues we come up with. A read of the readme doesn't show anything we have to be worried over. We'll see how this turns out.

Looks like we may be doing NW6SP4 soon. Depending on if it looks stable so far.

Quote-file update.

Unfortunately, nothing like a pattern is emerging from our abends since 12/1/03. WSPIP.NLM has several, but we've had far more CPU-HOG abends than Page Fault Processer Exceptions. Perhaps we need to tweek our CPU-Hog detection parameters? Not sure, here.

It abended again. And this time I'll have the opportunity of running the log through Novell's abend analysis engine. Perhaps it'll tell us something useful, like "update your TCP package", or something like that.

This patch is needed for some NetWare sp3 machines in order to generate a correct abend-log. It is sad we need to apply this, but we are.

Here here is a pretty thing:

Compaq

and:

Dell

Both are 1U keyboard/video/mouse trays using flat-panels. Probably laptop panels. These can save buckets of space in a rack. We're not that dense here, but at OldJob we were. I've had the Dell kind before, and have been somewhat pleased by it.

You really have to watch frequency response with the panels, and the laptop-decended ones are even pickier than the normal flat-panel ones. If you get stuck in a state where a video card insists on driving things at 85MHz, you really have no choice but to either reformat the server (icky), or haul in an old CRT and run through that. It can be a challenge. Especially if you have to deal with Sun equipment that insists on driving its video signal at 85MHz. Ahem. Fortunately, I don't have to do that at this job.

The problem of the day appears to be NDPS. There is quite a push to get NDPS deployed because the server currently being used for queue-based-printing is an ooooold NW4.11 box. We need to get that out of the way. For one, it'll help us on the road to a network that just runs TCP/IP. For two, it'll get rid of some really creaky hardware.

The problems are many. A couple of HP printers are showing decided recalcitrance when it comes to being set up. The HP LaserJet 4200 is a prime offender, with the HP 5M a close second. Also, NDPS-objects that don't distribute a print driver are having issues sending jobs to printers. Sometimes, it'll print two pages. Sometimes, not at all. Unfortunately, that problem doesn't have a pattern yet, which makes troubleshooting problematic.

In other news, it looks possible to wire the Macintosh labs so they route jobs through NDPS. It won't get audited by PCounter, which they aren't now anyway, but it would centralize that function nicely. In the future, once such accounting can be put in place, it will be a matter of changing things on the servers and not on the clients.

Both servers are installed and ready for the next stage. One was an HP Proliant that I didn't have a SmartStart CD for. Couldn't find it. Fortunately one turned up in an office around these parts, and seemed to have the drivers needed to get it going. The SmartStart OS install for Win2K is pretty nifty. It copied the OS-source-files to disk before it kicked off the install, so I didn't need the CD during OS install. This meant I could use a disk I had created that had SP4 slipstreamed in, but wasn't bootable. Saved a step this way.

Now the challenge is to figure out if they belong in the domain or not. If the customers need regular console access, these servers may end up in their own workgroup. If they don't, they get domanified. However, the computers may still be in the domain, but the access is handled locally. Unknown at this time. Must get clarification before going forward with this one.

Oops! It seems the two servers have been here some time. I was corrected in this mistake late yesterday. I'm now setting the two up. One is going to be a Cisco ACS server, so gets to be cranked down quite hard, security-wise. This'll be fun. But I'll need to know the remote-access needs of the telecom group before I get too deep into things. If they expect console access, that's one thing. If not, that's another.

Found what was wrong with NetStorage. It seems the server it was configured to send authentication queries against was set wasn't getting queried. Because the NetStorage server was unable to convert the DNS name into an IP address. So I just changed the address to an IP addres, like it is on the Student side, and it works as advertised. Simple, but tedious, change. All is well now.

It seems that NetStorage is requiring a full context to login for the faculty/staff side. Students don't have to. Must figure out why.

Anandtech recently posted an article about a new Intel CPU that released.

http://www.anandtech.com/cpu/showdoc.html?i=1956

Intel has decreased their die size from .13 micron to .09 micron (90 nanometers). But in order to do that had to make some fundamental changes in order to provide a faster CPU. A very interesting read. One of the key things I learned was that existing P4 stocks based on Northwood will remain faster than the new stuff (Prescott) until the new stuff hits 4GHz.
This may impact us when it comes to server purchases, as OEM's are already in possession of these new units. We shall see where this takes us.

I've been tasked with the installation and base setup of two servers for our telecom section. One is a Ciscoworks server. The other is a Cisco Secure Access Control Server that we'll be using for something or other. The second has some fairly interesting requirements for what the base OS looks like. It will be interesting to see how much of it I'll be able to do and still maintain base standards for server installs. As I understand it, the hardware hasn't even been ordered yet, so it'll be a few weeks before I get my hands dirty on this one.

We are giving real thought to getting blade-servers for some of our new servers this year. Our vendor has a new generation of blade servers that seem to eliminate some of the road-blocks from previous blades. For one, they have a chassis that allows blades that use P4 CPU's. AND permit Fibre Channel connections to a SAN. The combination of all this makes it possible to have a blade be a head-unit for a SAN-based system. Such as a medium throughput, high-volume database. Or a clustered file-server. Or whatnot. On the down-side, the power requirements are a touch exotic for our existing setup, so we'll have to convert something if we get one (or two).

Stress testing of this web server continues, with very good results. I created a more stressful test designed to work around internal caching, and succeeded. With 500 'users' hammering at an average of 5ms between requests, server load went to about 75%. As this is roughly 60x our expected load, this is good news.

I've been playing around with a handy utility called 'hammerhead'. It's a web-server stress tester, and I've been throwing it against this server to see how far I can push it. The answer is pretty far. Considering what the current load is on the existing 'student web' server, we shouln't have a problem. I threw a load that was around 75x the normal load, as baselined by the existing server, and the server didn't even creek. I'm playing around with some options to get a better stress tester out of it, so we'll see how far it goes.

The honeypot worked! I snagged a hacker Saturday night. Full net-sniff of the traffic in question. They tried to turn it into a warez site, but ran out of disk-space. Seems to be a different group that the last group that hacked us.