Plodding on

Things are more stable this morning, but we did have some issues. First and most worry making, two of the replicas on Hera were not in a good state. One, happily a small one, never left the "new" state. The other just plain wasn't synching completely.

First, the replica that never left 'new'. I haven't seen that one before, so it took a LOT of digging until I found the fix for it. All dstracing showed that attempts to sync that particular replica was throwing a -673 error (FFFFFD5F replica not on). What ultimately fixed it was doing a "network address repair" on the other two main eDir servers. That seemed to kick clear whatever blockage had built up.

The second one was easier. I just removed the replica from Hera while I worked the other problem. I put it back when the other replica was working fine. In the process I noticed that some of the servers in that replica (but not in the ring) were showing 'unlocatable' errors in the network address rebuild process. Not critical. But once the replica was back on, it showed no signs of going the way it did at first.

As a side effect, I also identified a handful of servers that weren't correctly advertising their presence in SLP. In every case the SLP discovery options were set to 2, or DHCP-only. In that state it'll ignore the slp.cfg completely. Changing it to 4 suddenly caused these servers to find the DA's and report their services, and thus permit their network addresses to be repaired.

SLP on this server in general is a bit confusing. I'm not sure what services an OES-Linux server is supposed to advertise, so I'm not sure if SLP is completely healthy.

I also managed to get LUM set up right. With that in place, My Fellow Admins can log in to the server without me having to create accounts! I'm so proud.

In terms of server health things are in very good shape. CPU usage is still a bit worrying, but now that I have a day's worth of data to look at it appears to be about the same as it was before. On the other hand, the "outstanding requests" in the iMonitor agent health-check consistantly shows lower numbers. Like 3-5 instead of the 7-9 it was before. Peanuts, but progress.

And this morning we heard that the first parts of the router replacement have started. A couple of buildings were moved to the new cloud around 6am today. No screaming so far.

Tags: ,