Very exciting weekend

I wake up Saturday morning to find 25 text pages on my cell-phone. It seems that at 5:17am the one service on the cluster I have monitoring set up for crashed. No biggie, usually that means just migrating the service to a different node, kicking the web-server and migrating it back. Not today.

BigBrother showed that ALL OF THE STUDENT CLUSTER WAS DOWN

As in, everything was done. Red lights galore. I don't like that.

I also don't like that I can't get into the iLO cards for these servers and reboot at least one of 'em remotely. So I have to go in. Which I do.

I find that all three servers abended during some part of the backup process. Each abend had a TSA error, such as TSA SCAN, or TSA READ. Further investigation shows that the servers went down at 11:19pm, 1:28am, and 5:17am. Some services were completely knockec out at the 1:28am crash and the rest came to a halt at 5:17am. This is the first all-down event since 10/18/03.

I do some updating to see if I can patch it for Monday. I notice this morning when I was going through the abend.logs to see if I can find anything, I see that the ArcServe agent was loaded. This is interesting since it was Veritas BackupExec doing the backup at the time. I don't know if this was a case of unholy reverberations, or just plain luck. At any rate, NWAGENT is not loaded anymore, and I have hopes that the backups tonight won't cause EVIL.

Needless to say, I'll be watching backups like a hawk tonight.