When good power becomes bad power

A good thing is happening. We're replacing the generator backing up the datacenter with a unit large enough to run both HVAC units. When the room was built in the 1999/2000 timeframe, it was presumed that one would be enough to keep the room cool enough. That's true to a point, but it didn't take into account localized hot-spots due to very hot running servers like the ones in our ESX cluster. Testing we've done shows that the temps do fall out of tolerance between 30 to 45 minutes after running on only one HVAC unit. So, we're setting things up to run both HVAC units. Good! It'd be even better if we could get a newer UPS since this one was nearly EOL when we bought it. But that's something for another capitol request.

Because we're replacing a generator, this means some unavoidable periods of time when the room is not fully protected and we'll be running on naked utility power. Like I said, this is unavoidable. Happily, utility power is pretty stable this time of year. We're having a bit of a hot-snap right now, so there is some concern about AC-related brown-outs but it isn't quite that hot yet. That's why the work is scheduled to be done during the cool part of the day.

Murphy did not agree with us. Yesterday, they spliced in the new generator transfer switch to the Bypass circuit of the UPS. This should have been a non-event since the main circuit was just fine and feeding load. Unfortunately for us, the monitor card on the UPS saw the Bypass circuit failing as a UTILITY FAIL event. What's more, it erroneously fired the ON_BATTERY event even though the UPS was not actually on battery. This started the shutdown timers on the servers with the UPS shutdown-service client on it. This is why things got Very Exiting around 8:57am yesterday, as these servers shut themselves down. On the plus side, things worked as they should. On the negative side, we were trusting a signal source that it turns out we shouldn't trust. Crap.

Then this morning. This morning they were splicing in the new transfer switch to the mains circuit, and during this time the UPS would be on Bypass leaving us on naked utility power. Once done, the new generator would be supporting the UPS. The next outage would be similar, putting the UPS on bypass, while they cut over to the new electrical panel downstairs.

Unfortunately for us, when the work was completed and we went through the UPS startup procedure, two things happened. First, we discovered that the Input breaker had tripped some time between when we shut the UPS down and opened the doors to start it back up. We (actually the WWU Facilities electricians, I was just shoulder surfing at this point) flipped the breaker to the On position, which gave the datacenter a transient power flicker on the order of 50ms-100ms, which didn't bring anything down. Second, when we got to the part of the startup procedure that says 'tell the UPS to turn on,' it failed with an error to the effect of, "incorrect phase rotation, startup aborted." This caused the electricians some great concern, and they went about validating their wiring.

Which tested out fine. The phases well and truely are wired in correctly. They have very high confidence in this. Which leaves something in the UPS being wonky. So they call the UPS vendor, who ends up sending a technician up from Seattle to look things over. He should be here any time now. Meanwhile, we've been on naked utility power since 7am this morning.

The electricians are very concerned about that Input breaker tripping. This is a 50KVA 3-phase UPS, and when those short out the arc it generates is more accurately described as an explosion. The breaker caught it, as it should, but it shows a highly energetic event was avoided. They do not have confidence that we can bring the UPS up without a blip in power to the main load. If not a full on surge if it fails the wrong way.

The decision was made to prepare for shutting the whole machine room down. This is not a decision made lightly, this is the week before finals so uptime is even more critical right now. This decision will have to be made by the Vice Provost or the President, and we haven't had word yet what they've decided. We hear they're considering the full shutdown to start at 1am. We're still planning for it.

This would mark the first time since the datacenter went production back in 2000 that we've had to gracefully shut the whole thing down. The closest we've come was last September when we had to shutdown the EVA3000 in order to upgrade it to an EVA6100, and all servers connected to it had to be shut down. We're guessing that it'll take 45 minutes to get everything down, and close to 90 minutes to bring it all up in the correct order. When the room is down is when the electricians will attempt to restart the UPS.

This is an all-hands thing, and we'll have to get in contact with the University parties that have servers in there so they can either shutdown for the night, or be here to shut down in person. We've designated a pair of admins to sleep through the event so they can be fresh for the morning disasters while the rest of us sleep in.

Of course, the powers-that-be may decide to risk another UPS restart with load. Who knows.

Once a decision has been made about what to do, I'm fairly certain an all-points email will go out if we decide the full shutdown is needed. This is why we get paid the big money.

EDIT: It is official. We're taking everything down starting at 1am tonight.