September 2005 Archives

Server-side includes

Turns out I had set them to be allowed (IncludesNOEXEC), but hadn't actually set things up so they could be used. Erm. That's fixed now. We'll see how they actually work out, eh?

NXCreatePathContext

The saga continues.

Last week on advice from one of the folk in the Developer forums I did some network captures of the problem in progress. This the SFTP-not-working thing for those obsessive followers of this blog, just so you know. The entries in the logs were saying that SSHD was unable to create a user identity to connect to the remote directory, and was reporting an ENCP (Generic NCP Error) error. Not much with the useful.

However, on the wire there was a strange conversation that seemed to lead to the error being thrown:

SSHD-Server: Get Addresses for Resource, buffer size=1024
ResourceServer: -649 Insufficient Buffer
S: Get Addresses for Resource, buffer size=9129
R: [addresses]
S: [connection tear-down]

When I reported this, it apparently jives with a known bug in LibC. So I got a brand new LibC.NLM to try out. And since this is a fault on the SSHD-Server side of it, I don't need to get it into the resource server and the resultant reboots. Yay! So I've thrown it onto the SSHD/MyWeb server to see how bullet proof it is. The date of the NLM is newer than the Sp4 release, which worries me a bit.

Yummy power outages!

We're having one right now! It was a windy day today, so some line probably blew down. No idea how wide-spread this particular outage is. But the generator fired, and the transfer-switch transferred, so our machine room is on generator right now. UPS load is at about 82%. Unfortunately, I don't know either how many hours of gas we have in that thing, nor how long a runtime 82% means once that generator runs dry.

Hopefully, we won't have to worry about that.

Here is a funny thing

Of the rootkits I've pulled off of servers, none of them seem to have been as nasty as what seems to be comming down in spyware these days. I wonder why that is? The SpyWare stuff is all about polymorphic naming, strange services, and outright pervasiveness in everything in the system. The stuff I've cleaned up manually on servers has been relatively easy, and most of it has been some variant of HackerDefender.

Interesting.

More cache buffers

Ok, now that I noticed the problem, we seem to be having a problem with our NDS servers bouncing themselves due to lack of memory. This is, as they say, Not Good. I understand that NW65SP4 contains significant memory management enhancements, but we're not going to throw that in just because. I want to try and stamp this one out without having to upgrade our eDir.

School continues

Due to a bug I've talked about a lot, here, here, and here to name three, sftp isn't working on the faculty side of the cluster. Which means that blog updates aren't happening. So you'll be getting these in a big flood once I get the problem worked around (node failover).

Printer continues apace. The hourly average between 11:00am and now seems to be around 1000 jobs. That's a bit up from last year, but it is also the time of the quarter when sylabus and class materials are being printed up. And there was a pcounter update, which makes some UI iPrint aware (not that it does anything with it, but aware).

Mystery Reboots

Two of our three NDS servers rebooted themselves last night. From the logs it looks like they ran out of cache-allocator memory and somehow managed to reboot themselves without generating an abend.log entry. Weird.



As you can see from the chart (thank you Intermapper and a custom probe), they had been decreasing steadily over time before running completely out. The thing is, I'm not sure where the leak is. The eDir cache has been tuned and I looked at it yesterday afternoon and didn't see anything out of the ordinary. The Filesystem cache had been getting large, but one of the fixes in SP3 was to introduce the ability to scavange cache-buffers out of the NSS cache if needed. It wasn't any leak in TSAFS, since no backups were running at the time the messages started showing up.

Funny quote

| 1 Comment
On a more personal note, if there's anything that inspires so much
fear as a server that doesn't mount it's volumes, I haven't seen it...
Skydiving is tame in comparison. (The ground can only kill you once.
600 angry users can do much more damage.)

I'm wide awake after a 21 hour day... "no volumes" works better than
caffine.
Oh, I've had that happen to me before. That sudden, dropping sensation in the stomach that lets you know that a world of hurt just opened up before you.

More school stuff

This morning we did the P-Counter printer-quota reset. We do this before every quarter. Since this touches every single student object we have, whether real or memorex, this is a pretty intensive NDS operation. Since this is an attribute that isn't in an index, its that much more expensive in terms of time. Thanks to all of the updates, some password changes this morning didn't happen at their ususal 'right now' pace, and took a few minutes to get everywhere.

The print-rate is picking up as people print off schedules and class materials.

We're doing a brisk business in account activate/deactivate, which also has NDS impacts.

We're getting a lot of password changes right now.

Ahh, the begining of school. The busiest time of year for some systems.

Oh really?

https://wuf-stuns.wuf.wwu.edu/[yada]/New%20Folder%20(2)/hot%20showers%205-1.wmv

R-i-i-i-i-g-h-t. At least it wasn't in a "myweb" folder.

Bittorrent usage

This is old hat to some of you, but it is still interesting. Below, we have a graphic from our bandwidth monitor:



Here you can see very clearly what one lone Bit Torrent user can do to an internet connection. That was one station, and it tookl up about a quarter of our total available bandwidth once it started permitting uploads, and that percentage was scaling upwards once the East Coast started waking up. From the looks of it, his "share percentage" probably was in the 8.00-10.00 range. It got killed about 9am.

We'll be seeing more of this during the year.

The start of quarter is nigh

Classes start on Wednesday, and move-in starts processing tomorrow. Monday and Tuesday students will be registering for final classes, buying lots of stuff, and setting up accounts if they haven't already. The PCounter reset will happen Monday, giving everyone new quota to play with.

A couple of projects that we'd have liked to have gotten done before students arrives have slipped. The migration of our FrontPage server to SharePoint has hit a solid rough patch that'll probably not get done until early October. Upgrades to our home built web-forms have also hit a snag and will probably get pushed to another couple weeks. The new Titan isn't fully in place yet, or so I hear.

But things are looking solid elsewhere. Disk-space is in good shape, though that doesn't really start stressing until about 4 weeks into the quarter anyway. The anti-spam solution for titan is present, though I'm not sure where it sits; we need trainers for the bayesian filters. And Veritas hasn't had any vulnerabilities announced against in recently, so we should be hack-free for the critical period. Ahem.

So. We're in good shape.

SFTP is slow

Especially on the cluster. I've noticed this on a couple of ssh installs I've run into over the last few years. Some SSHD's just plain are poky. I just republished this blog by way of SFTP, as a way to clean up some template changes on the archived pages. And it took 17 minutes to get all the stuff down. That's 450-odd files, and 6.6MB of data. Pathetic transfer rate, but I suspect part of that is due to file-open/file-close operations.

Performance of that caliber I've seen on some Linux/Solaris boxes, so this isn't limited to NetWare.

CA resolution

Turns out there are some bugs in the CA install.

Our environment:
  • Empty root domain
  • Domains all initally installed as 2000-Server
  • All Domain Controllers now 2003 Server SP1
It turns out that when you install the CA into such an environment, it creates a new group, but does not add the CA server to it. This is what happened to us. The group is CERTSVC_DCOM_ACCESS in the Users container. Adding the "Domain Controllers" group to that particular group allows auto-enrolment to work for that domain. I'm still getting the child domain, where the rest of us are, up and running but at least its working to spec right now.

This is actually documented in the SP1 release notes:

Note that if the certification authority is installed on a domain controller, and the enterprise is made up of more than one domain, Certificate Services cannot automatically update the DCOM security settings for enrollees from outside the certification authority’s domain. Therefore, these enrollees will be denied enroll access to the certification authority.

To resolve this issue, you must manually add the users to the CERTSVC_DCOM_ACCESS security group. Because the CERTSVC_DCOM_ACCESS security group is a domain local group, you can add only domain groups to it. For example, if users and computers from another domain, a domain named Contoso, have to enroll with the certification authority, you must manually add the Contoso\Domain Users group and the Contoso\Domain Computers group to the CERTSVC_DCOM_ACCESS security group.
Which says so right there. But SP1 hasn't been out long enough for any KB articles to be out on this subject.

PKI woes

It turns out that when we replaced all the DCs this summer, we nuked our AD-based CA. Oops. Still, it took us this long to notice it, so we're clearly not using AD-PKI all that much. But getting it back into place is proving challenging. Very challenging.

Disk space

In the prep for Fall, we've been assessing where we're sitting for disk space. In pretty fair condition, as it turns out. The Student user volumes were all hovering around 30% free, which is a touch low, so we increased the sizes to get the free-percent to about 50%. That should get us through fall quarter and a good chunk of winter quarter as well. We hope.

On the FacStaff side, we're still in good shape. Our volumes aren't balanced, though, which will need to be addressed at some point. User3 shouldn't be double the size of User1 for instance. Good candidate for weekend migrations once we have a list and process built up.

MORE publishing!

Novell has seen fit to publish an AppNote I submitted back in May. This is on installing OES-NetWare to HP Blades. Most of the data in it is still good, but HP had a software release this summer that changed some of the menu options in their Altiris stuff. When the quarter starts and I have time again, I'll probably work up a rewrite to include the newer stuff. Until HP comes out with a formal installer for NetWare onto their blades, I'll try and keep this one up to somewhat date.