We all know it can happen, a BIOS update of some kind bricks whatever just got flashed, but it's one of those things you hope happens to other people first so you know not to go there. It happened to me recently, which got me thinking about continuous deployment from a hardware POV. Hardware being what it is, hard, you can't iterate and roll-back the way you can do software. There is no such thing as Vagrant for Embedded Systems that I've found!
The problem of, "when do I update the firmware for my server," is one that faces anyone with a physical infrastructure. There isn't really a globally accepted best-practice for this one, though the closest I can find is:
If the vendor lists the update as critical, apply it.
If you're experiencing one of the problems listed in the fixes, apply it.
If vendor tech-support tells you to apply it, apply it.
Otherwise, don't apply it.
But only apply it to a test device first to verify it actually fixes the problem. Then roll it out.
Doing so pro-actively is kind of risky, and only really useful in repurposing scenarios. Also, this 'best practice' assumes you have identical hardware to actually test with. Which a lot of us don't, and often can't due to slight differences between servers of the same model.
So. For those of us who are working on infrastructures either small enough to not be able to afford test hardware, or diverse enough that there is no such thing as a common class of machine, what are we to do?
Hope, mostly, and trust in your vendor support contracts to ship you new hardware in case you get a brick.
Or, trust in your redundancies and treat new-firmware-updates like a lost-server outage. If you get a brick, you're still within your failure tolerance and know not to go there for the rest of 'em. This is the approach we ended up taking, and it worked. We were running without our scale-test environment for a few days but production was unaffected until we could unbrick the affected machines.
In our case I suspect we had a v1.0 hardware revision, and the newest firmware was only backwards compatible for v1.0a and newer or something. I don't have proof of this, but that's what it feels like. Of course, this eventuality was not mentioned in the release-notes anywhere. Thus, testing.