An exciting week

Yesterday at about 9:30, I noticed that the master eDir server was throwing memory allocator errors. I've seen a bajillion of those since SP3 a year and a bit ago, so that's nothing new. So was the server refusing to shut down gracefully. This has been a common problem with our blade servers, and I haven't figured out where the hang is. It stuck there for a good long while and CPU1 never was shut off, so I hard-booted the server. No other way it was coming back, and in its state it was 'sort of up' and causing problems in computer labs and some logins. So it HAD to reboot.

Unfortunately, it didn't come up right. eDir was in some strange state when the reset button was pressed, and it was unrecoverably corrupted. Never seen that before. So we had to reinstall eDir on that machine from the 'down server' TID. Owie.

It also held our CA, and the backup I thought I had didn't exist.

Double Owie.

It also held the masters for most of the replicas, and was in all the replica rings.

Triple Owie.

It was one of our two SLP DA's, and clients out there had a hard time with the data it was giving, when it was giving it.

Quadruple Owie.

It was a long day yesterday.

In one of those 'silver lining' things, the outage showed to us how many LDAPy things rely on that specific server for service. Now that we have LDAP load-balanced by an F5 BigIP, this was the main chance to get folk pointing at the server directly to point at the BigIP's virtual IP. Yay! That way, that IP will stay up so long as at least one of the three eDir servers is up. No repeat of yesterday.

Tags: ,