Recently in sysadmin Category

The different kinds of money

| No Comments

Joseph Kern posted this gem to Twitter yesterday.

CapEx.png

It's one of those things I never thought about since I kind of instinctively learned what it is, but I'm sure there are those out there who don't know the difference between a Capital Expenditure and an Operational Expenditure, and what that means when it comes time to convince the fiduciary Powers That Be to fork over money to upgrade/install something that there is a crying need for.

Capital Expenditures

In short, these are (usually) one-time payments for things you buy once:

  • Server hardware.
  • Large storage arrays.
  • Perpetual licenses.
  • HVAC units.
  • UPS systems (but not batteries, see below).

Operational Expenditure

These are things that come with an ongoing cost of some kind. Could be monthly, could be annual.

  • Your AWS bill.
  • The Power Company bill for your datacenter.
  • Salaries and benefits for staff.
  • Consumables for your hardware (UPS batteries, disk-drives)
  • Support contract costs.
  • Annual renewal licenses.

Savy vendors have figured out a fundamental truth to budgeting:

OpEx ends up in the 'base-budget' and doesn't have to be justified every year, so is easier to sell.
CapEx has to be fought for every time you go to the well.

This is part of why perpetual licenses are going away.


But you, the sysadmin with a major problem on your hands, have found a solution for it. It is expensive, which means you need to get approval before you go buy it. It is very important that you know how your organization views these two expense categories. Once you know that, you can vet solutions for their likelihood of acceptance by cost-sensitive upper management. Different companies handle things differently.

Take a scrappy, bootstrapped startup. This is a company that does not have a deep bank-account, likely lives month to month on revenue, and a few bad months in a row can be really bad news. This is a company that is very sensitive to costs right now. Large purchases can be planned for and saved for (just like you do with cars). Increases in OpEx can make a month in the black become one in the red, and we all know what happens after too many red months. For companies like these, pitch towards CapEx. A few very good months means more cash, cash that can be spread on infrastructure upgrades.

Take a VC fueled startup. They have a large pile of money somewhere and are living off of it until they can reach profitability. Stable OpEx means calculating runway is easier, something investors and prospective employees like to know. Increased non-people CapEx means more assets to dissolve when the startup goes bust (as most do). OpEx (that AWS bill) is an easier pitch.

Take a civil-service job much like one of my old ones. This is big and plugged into the public finance system. CapEx costs over a certain line go before review (or worse, an RFC process), and really big ones may have to go before law-makers for approval. Departmental budget managers know many ways to... massage... things to get projects approved with minimal overhead. One of those ways is increasing OpEx, which becomes part of the annually approved budget. OpEx is treated differently than CapEx, and is often a lot easier to get approved... so long as costs are predictable 12 months in advance.


The dragon in the datacenter

| No Comments

Systems Administrators have a reputation, a bad one, when it comes to personal skills. I saw it at WWU when problems went unreported because users were afraid we'd yell at them for being stupid. I see it every time someone speaks with passion about DevOps improving the adversarial relationship between Dev and Ops. Two different groups of people, two different problems, same root cause.

  1. Not formally trained people experiencing problems we're tasked with fixing (a.k.a. "users").
  2. Formally trained engineers trying to build/maintain a complex system (a.k.a. "dev").

Dealing with the untrained

End users are tricky people. They don't think the way we do. Because they don't know how a system works, they develop completely wrong mythologies for why things break the way they do. They share folk remedies with each other rather than calling for trained assistance. Those folk remedies can actually make things worse.

Dealing with the trained

Developers are tricky people. They're supposed to understand this stuff, but for some reason only get part of it. Or they only really see one part of the whole constellation of the problem-space and don't understand how their actions make things difficult for another part of the puzzle. It's forever frustrating because they're supposed to know better.


Cynicism: (1): The firm belief that the person telling you how to do something differently is blowing smoke up your ass because they don't know it doesn't work that way.
(2): The firm belief that a certain class of person will just never, ever, get it.


Sysadmins become jaded cynics because the end users never get any better, and explaining the same thing over and over again gets old. And it never helps. And they keep doing the same stupid things, over, and over, and over. No amount of training helps. No amount of "intuitive" walk-throughs help. No amount of video tours help. The customer support organization helps filter the blithering lunacy, but it just means the extra special stupid escalates to L3 where we live.

Customer Service is an outlook as much as it is a skill. Far too many of us lack that outlook and aren't motivated to get the skill. The 'customer' we're serving most of the time is an abstract known as "uptime", that's quantifiable and doesn't file reports with your boss when you get a bit firm with it over the phone. As an industry we're regular consumers of Customer Support in the form of our vendors and the support contracts we hold with them. We know what we like when we get to the human:

  • They speak our language.
  • They don't get defensive when we blow steam about our frustrations with their product.
  • When we describe in detail what we think the problem is they don't dismiss our concerns and tell us how it really failed.

The jaded cynic sysadmin doesn't do any of that. We use condescending language (very probably unintentionally condescending). We respond to attacks on our systems by getting defensive. We see a chance to myth-bust and jump on it with glee, describing in detail how that failure mode actually occurred.

When users have problems they don't come to the jaded cynic sysadmin with them. This is driven through a combination of fear of being attacked, disgust that such people are allowed to keep working, and a desire to avoid assholes whenever possible.


Corrosive Cynicism: The belief that everyone around you doesn't know how it really works, and it's your job to explain why that is.


Sysadmins become jaded cynics after the developers persistently and stubbornly refuse to pick up the little platform quirks they're developing the application on. It gets tiring having to continually disabuse them of their assumptions on how the OS/platform works. You wish they'd talk to you sooner rather than wait to the end when all the bad assumptions have been baked in and they have to patch around them.

This is not some ever-changing population of end users, these are your coworkers. You see them every day (or, well, at least once or twice a week at meetings). You're both supporting the same overall problem, but your focus areas are different. They're concerned with algorithmic efficiency, you're concerned with system resources and what consumption rates mean for the future. They're concerned with making this one application work, you're concerned with how that application will fit in to the whole ecosystem of apps that share the same resources.

No one understands how it all fits together but you and your fellow sysadmins. If they came to you earlier, they wouldn't have these problems.

Congratulation, you're a BOFH.

The failure-mode here is the same as it was with the end users, a lack of Customer Service skills. Only instead of an ever changing population of stupid-doers you have a small population of the willfully ignorant. If you become hard to approach, you'll be fixing messes well after it was cheap and easy to fix. They're avoiding you because you're forever telling them 'no', and you're not exactly nice about it.


From the point of view of others

Green Dragon
Alignment Lawful Evil
Breath Weapon Acid Cone
Preferred Habitat Forests and Datacenters

The jaded cynic sysadmin most definitely works within the system. They may even be the system, but that authoritiy is derived from someone who let them have the keys to the kingdom. However, they're very often the last word when it comes to their systems This makes them lawful.

The jaded cynic sysadmin never seems to care what others think. They have their own goals, and asking them for stuff doesn't seem to do anything. Bribery can work, though. This makes them evil.

The jaded cynic sysadmin is... not someone you want to piss off. And they're easy to piss off, just existing seems to be enough sometimes. When that happens you risk a verbal flaying. It's called a breath weapon for a reason.

It all began with a bit of Twitter snark:


SmallLAMPStack.png

Utilities follow a progression. They begin as a small shell script that does exactly what I need it to do in this one instance. Then someone else wants to use it, so I open source it. 10 years of feature-creep pass, and then you can't use my admin suite without a database server, a web front end, and just maybe a worker-node or two. Sometimes bash just isn't enough you know? It happens.

Anyway...

Back when Microsoft was just pushing out their 2007 iteration of all of their enterprise software, they added PowerShell support to  most things. This was loudly hailed by some of us, as it finally gave us easy scriptability into what had always been a black box with funny screws on it to prevent user tampering. One of the design principles they baked in was that they didn't bother building UI elements for things you'd only do a few times, or would do once a year.

That was a nice time to be a script-friendly Microsoft administrator since most of the tools would give you their PowerShell equivalents on one of the Wizard pages, so you could learn-by-practical-example a lot easier than you could otherwise. That was a real nice way to learn some of the 'how to do a complex thing in powershell' bits. Of course, you still had to learn variable passing, control loops, and other basic programming stuff but you could see right there what the one-liner was for that next -> next -> next -> finish wizard was.

SmallLAMPStack-2.png

One thing that a GUI gives you is a much shallower on-ramp to functionality. You don't have to spend an hour or two feeling your way around a new syntax in order to do one simple thing, you just visually assemble your bits, hit next, then finish, then done. You usually have the advantage of a documented UI explaining what each bit means, a list of fields you have to fill out, syntax checking on those fields, which gives you a lot of information about what kinds of data a task requires. If it spits out a blob of scripting at the end, even better.

An IDE, tab-completion, and other such syntactic magic help scripters build what they need; but it all relies upon on the fly programatic interpretation of syntax in a script-builder. It's the CLI version of a GUI, so doesn't have the sigma of 'graphical' ("if it can't be done through bash, I won't use it," said the Linux admin).

Neat GUIs and scriptability do not need to be diametrically opposed things, ideally a system should have both. A GUI to aid discoverability and teach a bit of scripting, and scripting for site-specific custom workflows. The two interface paradigms come from different places, but as Microsoft has shown you can definitely make one tool support the other. More things should take their example.

10 year blog-anniversary

| 1 Comment

10 years ago today, I had my first post.

This was done as part of the first big project I was given when I started working for WWU: figure out how to serve web-pages from home directories. Which I did, and this blog was a way to make sure it actually worked. It did. Back then I used Blogger and their FTP publish option to maintain this thing, I've since moved on to my own domain and actual blog-software.

10 years later I'm also starting a brand new job, and am all of 3 days into it so far. By now I'm just beginning to get a handle on the complexity of the problem I'm facing.

I'm not posting as often as I used to. In part that's because I've been working for places that have intellectual property they need to protect and talking about what I'm working on is frequently a violation of that, and in part there are other outlets for the shorter stuff. Twitter for instance, and even ServerFault.

I'm still here, and still going. Some pointless stats after the cut.

There is a market for this

| No Comments

On-call?

Don't want the person sharing your bedroom to wake up when you get paged?

There's a widget for that.

If you're a deep sleeper and share a bed with a light-sleeper, this just might be the thing you need to let them keep sleeping after that 1:30am call. Or the thing to let you know you need to check your phone in the middle of a meeting. Either way, sysadmins are a market for this thingy!

As I look around the industry with an eye towards further employment, I've noticed a difference of philosophy between startups and the more established players. One easy way to see this difference is on their job postings.

  • If it says RHEL and VMWare on it, they believe in support contracts.
  • If it says CentOS and OpenStack on it, they believe in community support.

For the same reason that tech startups almost never use Windows if they can get away with it, they steer clear of other technologies that come with license costs or mandatory support contracts. Why pay the extra support cost when you can get the same service by hiring extremely smart people and use products with a large peer support community? Startups run lean, and all that extra cost is... cost.

And yet some companies find that they prefer to run with that extra cost. Some, like StackExchange, don't mind the extra licensing costs of their platform (Windows) because they're experts in it and can make it do exactly what they want it to do with a minimum of friction, which means the Minimum Viable Product gets kicked out the door sooner. A quicker MVP means quicker profitability, and that can pay for the added base-cost right there.

Other companies treat support contracts like insurance: something you carry just in case, as a hedge against disaster. Once you grow to a certain size, business continuity insurance investments start making a lot more sense. Running for the brass ring of market dominance without a net makes sense, but once you've grabbed it keeping it needs investment. Backup vendors love to quote statistics on the percentage of business that fail after a major data-loss incident (it's a high percentage), and once you have a business worth protecting it's good to start protecting it.

This is part of why I'm finding that the long established companies tend to use technologies that come with support. Once you've dominated your sector, keeping that dominance means a contract to have technology experts on call 24/7 from the people who wrote it.

"We may not have to call RedHat very often, but when we do they know it'll be a weird one."


So what happens when startups turn into market dominators? All that no-support Open Source stuff is still there...

They start investing in business continuity, just the form may be different from company to company.

  • Some may make the leap from CentOS to RHEL.
  • Some may contract for 3rd party support for their OSS technologies (such as with 10gen for MongoDB).
  • Some may implement more robust backup solutions.
  • Some may extend their existing high-availability systems to handle large-scale local failures (like datacenter or availability-zone outages).
  • Some may acquire actual Business Continuity Insurance.

Investors may drive adoption of some BC investment, or may actively discourage it. I don't know, I haven't been in those board meetings and can argue both ways on it.

Which one do I prefer?

Honestly, I can work for either style. Lean OSS means a steep learning curve and a strong incentive to become a deep-dive troubleshooter of the platform, which I like to be. Insured means someone has my back if I can't figure it out myself, and I'll learn from watching them solve the problem. I'm easy that way.

Anyone taking DevOps to heart should read about Normal Accidents. The book is about failure modes of nuclear power plants; those highly automated and extremely instrumented things that they are still manage to fail in spite of everything that we do. The lessons here carry well into the highly automated environments we try to build in our distributed systems.

There are a couple of key learnings to take from this book and theory:

  • Root cause can be something seemingly completely unrelated to the actual problem.
  • Contributing causes can sneak in and make what would be a well handled event into something that gets you bad press.
  • Monitoring instrumentation failures can be sneaky contributing causes.
  • Single-failure events are easily handled, and may be invisible.
  • Multiple-failure events are much harder to handle.
  • Multiple-failure events can take months to show up if the individual failures happened over the course of months and were invisible.

The book had a failure mode much like this one:

After analysis, it was known that the flow direction of a specific coolant pipe was a critical item. If backflow occurred, hot fluid could enter areas not designed for handling it. As a result, a system was put in place to monitor flow direction, and automation put in place to close a valve on the pipe if backflow was detected.

After analyzing the entire system after a major event, it was discovered that the flow-sensor had correctly identified backflow, and had activated the valve close automation. However, it was also discovered that the valve had frozen open due to corrosion several months prior to the event. Additionally, the actuator had broken when the solenoid moved to close the valve. As a result, the valve was reported closed, and showed as such on the Operator panel, when in fact it was open.

  • The valve had been subjected to manual examination 9 months before the event, and was due to be checked again in 3 more months. However, it had failed between checks.
  • The actuator system was checked monthly and had passed every check. The actuator breakage happened during one of these monthly checks.
  • The sensor on the actuator was monitoring power draw for the actuator. If the valve was frozen, the actuator should notice an above-normal current draw. However, as the actuator arm was disconnected from the valve it experienced a below-normal current draw and did not detect this as an alarm condition.
  • The breaking of the actuator arm was noted in the maintenance report during the monthly check as a "brief flicker of the lamp" and put down as a 'blip'. The arm failed before the current meter triggered its event. As the system passed later tests, the event was disregarded.
  • The backflow sensor actually installed was not directional. It alarmed on zero-flow, not negative-flow.

Remediations:

  • Instrument the valve itself for open/close state.
  • Introduce new logic so that if the backflow sensor continues to detect backflow, raise alarms.
  • Replace the backflow sensor with a directional one as originally called for.
  • Add a new flow sensor behind the valve.
  • Change the alerting on the actuator sensor to alarm on too-low voltages.
  • Increase the frequency of visual inspection of the physical plant

That valve being open caused Fun Times To Be Had. If that valve system had been operating correctly, the fault that caused the backflow would have been isolated as the system designers intended and the overall damage contained. However, this contributing cause, one that happened months before the triggering event, turned a minor problem into a major one.

So, why did that reactor release radioactive materials into the environment? Well, it's complicated...

And yet, after reading the post-mortem report you look at what actually failed and think, 'and these are the jokers running our nuclear power plants? We're lucky we're not all glowing in the dark!'

We get the same kind of fault-trees in massively automated distributed systems. Take this entirely fictional, but oh-so-plausible failure cascade:

ExampleCorp was notified by their datacenter provider of the need for emergency power maintenance in their primary datacenter. ExampleCorp (EC) operated a backup datacenter and had implemented a hot failover method, tested twice a year, for moving production to the backup facility. EC elected to perform a hot failover to the backup facility prior to the power work in their primary facility.

Shortly after the failover completed the backup facility crashed hard. Automation attempted to fail back to the primary facility, but technicians at the primary facility had already begun, but not yet completed, safe-shutdown procedures. As a result, the fail-back was interrupted part way through, and production stopped cold.

Service recovery happened at the primary site after power maintenance completed. However, the cold-start script was out of date by over a year so restoration was hampered by differences that came up during the startup process.

Analysis after the fact isolated several causes of the extensive downtime:

  • In the time between the last hot-failover test, EC had deployed a new three-node management cluster for their network switch configuration and software management system, one three node cluster for each site.
  • The EC-built DNS synchronization script used to keep the backup and primary sites in sync was transaction oriented. A network fault 5 weeks ago meant the transactions related to the DNS update for the cluster deployment were dropped and not noticed.
  • The old three-node clusters were kept online "just in case".
  • The differences in cluster software versions between the two sites was displayed in EC's monitoring panel, but was not alarmed, and disregarded as a 'glitch' by Operations. Interviews show that Ops staff are aware that the monitoring system will sometimes hold onto stale data if it isn't part of an alarm.
  • At the time of the cluster migration Operations was testing a new switch firmware image. The image on the old cluster was determined to have a critical loading bug, which required attention from the switch vendor.
  • Two weeks prior to the event EC performed an update of switch firmware using new code that passed validation. The new firmware was replicated to all cluster members in both sites using automation based on the IP addresses of the cluster members. The old cluster members were not updated.
  • The automation driving the switch firmware update relied on the non-synchronized DNS entries, and reported no problems applying updates. The primary site got the known-good firmware, the backup site got the known-bad firmware.
  • The hot-swap network load triggered the fault in the backup site's switch firmware, causing switches to reboot every 5 minutes.
  • Recovery logic in the application attempted to work around the massive network faults and ended up duplicating some database transactions, and losing others. Some corrupted data was transferred to the primary site before it was fully shut down.
  • Lack of technical personnel physically at the backup site hampered recovery from the backup site and extended the outage.
  • Out of date documentation hampered efforts restart services from a cold stop.
  • The inconsistent state of the databases further delayed recovery.

That is a terrible-horrible-no-good-very-bad-day, yes indeed. However, it shows what I'm talking about here. Several small errors crept in to make what was supposed to be a perfectly handleable fault something that caused many hours of downtime. This fault would have been discovered during the next routine test, but that hadn't happened yet.

Just like the nuke-plant failure, reading this list makes you go "what kind of cowboy outfit allows this kind of thing to happen?"

Or maybe, if it has happened to you, "Oh crimeny, I've so been there. Here's hoping I retire before it happens again."

It happens to us all. Netfix reduces this through the Chaos Monkey, using it to visibly trigger these small failures before they can cascade into big ones. And yet even they fall over when a really big failure happens naturally.

What can you do?

  • Accept that the multiple-failure combinatorics problem is infinite and you won't be able to capture every fail case.
  • Build your system to be as disaster resilient as possible.
  • Test your remediations, and do so regularly.
  • Validate your instrumentation is returning good results, and do so regularly.
  • Cross-check where possible.
  • Investigate glitches, and keep doing it after it gets tediously boring.
  • Cause small failures and force your system to respond to them.

These are all known best-practices, and yet people are lazy, or can't get sufficient management buy-in to do it (a 'minimum viable product' is likely excessively vulnerable to this kind of thing). We do what we can, snark at those who visibly can't, and hope our turn doesn't come up.

Or: the older you get, the stronger the imperative is to automate fault handling in your environment.

Last night I got a text at 2:36am. Being the sysadminly on-call type that I am, I leaped out of bed still half asleep. I keep my phones on a cell phone stand that's really loud when something is vibrating on it, so I've been trained to get up when that racket starts. This was my phone going buzz, which meant I missed the work-phone going buzz (I have two for a reason).

It was when the phone was in my hand that I remembered that I'm not on any on-call schedule right now.

R u male or female?

From a west-coast area code.

An interesting question as it happens, but not my infrastructure crying for parental guidance. Or worse, the actual on-call people being over their heads and needing my help. Nope, just some lost soul hoping for love.

Shut off phone, went back to bed.

Whereupon I stayed awake another two hours.

Had this been an actual emergency, work would have been understanding of me coming in late after a late-night call-out. Like that time I dealt with a major infrastructure failure on 3.5 hours of sleep. They're nice that way.

Sleep disruption like that is getting more common as I get older. I give thanks that my infrastructure doesn't usually cry mommy in the wee hours, and if there is crying it's more likely to happen in the evenings than at 3am. That evening call-out may have me working until 3am, but that's better than getting 2 hours of sleep and getting woken up.

This is a goal that many sysadmins aspire to. On a new person's first day on the job, they have a computing asset upon which they can work, and all of their accounts are accessible and configured so they can do their work. No friction start. Awesome.

How this worked at WWU when I was there:

  1. HR flagged a new employee record as Active.
  2. Nightly batch process notices the state-change for this user and kicks off a Create Accounts event.
  3. CreateAccounts kicks off system-specific account-create events, setting up group memberships based on account type (Student/Faculty/Staff).
    1. Active Directory
    2. Blackboard
    3. Banner
    4. If student: Live@EDU.
    5. Others.
  4. User shows up. Supervisor takes them to the Account Activation page.
  5. User Activates their account, setting a password and security questions.
  6. Automation propegates the password into all connected systems.
  7. User starts their day.

It worked. There was that activation step, but it was just the one, and once that was done they're all in.

This worked because the systems we were trying to connect all had either explicit single-sign-on support, or were sufficiently scriptable that we could write our own.

I'm now working for a scrappy startup that is very fond of web-applications, since it means we don't have to maintain the infrastructure ourselves. The above workflow... doesn't exist. It's not even possible. Very roughly, and with the details filed off:

  1. Invite user to join Google Apps domain.
  2. Wait for user to accept invite and log in.
  3. Send invites from 3 separate web-applications that we use daily.
  4. Wait for user to accept all the invites, create accounts, and suchlike.
  5. Add the new accounts to app-internal groups.
  6. Send invites from the 2 web-applications with short "email verification" windows when the user is in the office and can get them.
  7. Add the new accounts to other app-internal groups.

The side-effect of all of this is that the user has an account before their official start-date, they don't get all of their accounts until well after 8am, and admin users have to do things by hand. Of those 5 web-apps, only 2 of them have anything even remotely looking like an SSO hook.

There is an alternate workflow here, but it has its own problems. That workflow:

  1. Hard create the new user in Google Apps and put into in the 2-factor authentication bypass group. Write down the password assigned to the user.
  2. Login as that user to Google Apps.
  3. Invite the user to the 5 web-applications.
  4. Accept the invites and create users.
  5. Add the new account to whatever groups inside those web-apps.
  6. New user shows up.
  7. Give user the username and password set in step 1.
  8. Give user the username and password for everything created in step 4.
  9. Walk the user through installing a Password Manager in their browser of choice.
  10. Walk the user through changing their passwords on everything set in steps 1 and 4.
  11. Walk the user through setting up 2-factor.
  12. Take user out of 2-factor bypass group.

This second flow is much more acceptable to an admin since setup can be done in one sitting, and final setup can be done once the user shows up. However, it does involve written down passwords. In the case of a remote-user, it'll involve passwords passed through IM, SMS, or maybe even email

That one "Account Activation" page we had at WWU? Pipe-dream.

At some point we'll hit the inflection point between "Scrappy Startup" and "Maturing Market Leader" and the overhead of onboarding new users (and offboarding the old ones) will become onerous enough that we'll spend serious engineering resources coming up with ways to streamline the process. That may mean migrating to web-apps that have SSO hooks.

You know what would make my life a lot easier now?

If more web-apps supported either OpenID or Google Login.

It's one fewer authentication domain I have to manage, and that's always good.

Yes, that happens

| 3 Comments

We all know it can happen, a BIOS update of some kind bricks whatever just got flashed, but it's one of those things you hope happens to other people first so you know not to go there. It happened to me recently, which got me thinking about continuous deployment from a hardware POV. Hardware being what it is, hard, you can't iterate and roll-back the way you can do software. There is no such thing as Vagrant for Embedded Systems that I've found!

The problem of, "when do I update the firmware for my server," is one that faces anyone with a physical infrastructure. There isn't really a globally accepted best-practice for this one, though the closest I can find is:

If the vendor lists the update as critical, apply it.
If you're experiencing one of the problems listed in the fixes, apply it.
If vendor tech-support tells you to apply it, apply it.
Otherwise, don't apply it.

But only apply it to a test device first to verify it actually fixes the problem. Then roll it out.

Doing so pro-actively is kind of risky, and only really useful in repurposing scenarios. Also, this 'best practice' assumes you have identical hardware to actually test with. Which a lot of us don't, and often can't due to slight differences between servers of the same model.

So. For those of us who are working on infrastructures either small enough to not be able to afford test hardware, or diverse enough that there is no such thing as a common class of machine, what are we to do?

Hope, mostly, and trust in your vendor support contracts to ship you new hardware in case you get a brick.

Or, trust in your redundancies and treat new-firmware-updates like a lost-server outage. If you get a brick, you're still within your failure tolerance and know not to go there for the rest of 'em. This is the approach we ended up taking, and it worked. We were running without our scale-test environment for a few days but production was unaffected until we could unbrick the affected machines.

In our case I suspect we had a v1.0 hardware revision, and the newest firmware was only backwards compatible for v1.0a and newer or something. I don't have proof of this, but that's what it feels like. Of course, this eventuality was not mentioned in the release-notes anywhere. Thus, testing.

Other Blogs

My Other Stuff

Monthly Archives