Recently in virtualization Category

Immutable infrastructure

This concept has confused me for years, but I'm beginning to get caught-up on enough of the nuance to hold opinions on it that I'm willing to defend.

This is why I have a blog.

What is immutable infrastructure?

This is a movement of systems design that holds to a few principles:

  • You shouldn't have SSH/Ansible enabled on your machines.
  • Once a box/image/instance is deployed, you don't touch it again until it dies.
  • Yes, even for troubleshooting purposes.

Pretty simple on the face of it. Don't like how an instance is behaving? Build a new one and replace the bad instance. QED.

Refining the definition

The yes, even for troubleshooting purposes concept encodes another concept rolling through the industry right now: observability.

You can't do true immutable infrastructure until after you've already gotten robust observability tools in place. Otherwise, the SSHing and monkey-patching will still be happening.

So, Immutable Infrastructure and Observability. That makes a bit more sense to this old-timer.

Example systems

There are two design-patterns that structurally force you into taking observability principles into account, due to how they're built:

  • Kubernetes/Docker-anything
  • Serverless

Both of these make traditional log-file management somewhat more complex, so if engineering wants their Kibana interface into system telemetry, they're going to have to come up with ways to get that telemetry off of the logic and into the central-place using something other than log-files. Telemetry is the first step towards observability, and one most companies do instinctively.

Additionally, the (theoretically) rapid iterability of containers/functions mean much less temptation to monkey-patch. Slow iteration means more incentive to SSH or monkey-patch because that's faster than waiting for an AMI or template-image to bake.

The concept so many seem to miss

This is pretty simple.

Immutable infrastructure only applies to the pieces of your infrastructure that hold no state.

And its corollary:

If you want immutable infrastructure, you have to design your logic layers to not assume local state for any reason.

Which is to say, immutable infrastructure needs to be a DevOps thing, not just an Ops thing. Dev needs to care about it as well. If that means in-progress file-transforms get saved to a memcache/gluster/redis cluster instead of the local filesystem, so be it.

This also means that you will have both immutable and bastion infrastructures in the same overall system. Immutable for your logic, bastion for your databases and data-stores. Serverless for your NodeJS code, maintenance-windows and patching-cycles for your Postgress clusters. Applying immutable patterns to components that take literal hours to recover/re-replicate introduces risk in ways that treating them for what they are, mutable, would not.

Yeahbut, public cloud! I don't run any instances!

So, you've gone full Serverless, and all of your state is sitting in something like AWS RDS, ElasticCache, and DynamoDB, and using Workspaces for your 'inside' operations. No SSHing, to be sure. That said, this is about as automated as you can get. Even so, there are still some state operations you are subject to:

  • RDS DB failovers still yield several to many seconds of "The database told me to bugger off" errors.
  • RDS DB version upgrades still require a carefully choreographed dance to ensure your site continues to function, if glitchy in short periods.
  • ElasticCache failovers still cause extensive latency as your underlying SDKs catch up to the new read/write replica location.

You're still not purely immutable, but you're as close as you can get in this modern age. Be proud.

I've seen this dyamic happen a couple of times now. It goes kind of like this.

October: We're going all in on AWS! It's the future. Embrace it.
November: IT is working very hard on moving us there, thank you for your patience.
December: We're in! Enjoy the future.
January: This AWS bill is intolerable. Turn off everything we don't need.
February: Stop migrating things to AWS, we'll keep these specific systems on-prem for now.
March: Move these systems out of AWS.
April: Nothing gets moved to AWS unless it produces more revenue than it costs to run.

What's prompting this is a shock that is entirely predictable, but manages to penetrate the reality distortion field of upper management because the shock is to the pocketbook. They notice that kind of thing. To illustrate what I'm talking about, here is a made-up graph showing technology spend over a course of several years.

BudgetType-AWS.pngThe AWS line actually results in more money over time, as AWS does a good job of capturing costs that the traditional method generally ignores or assumes is lost in general overhead. But the screaming doesn't happen at the end of four years when they run the numbers, it happens in month four when the ongoing operational spend after build-out is done is w-a-y over what it used to be.

The spikes for traditional on-prem work are for forklifts of machinery showing up. Money is spent, new things show up, and they impact the monthly spend only infrequently. In this case, the base-charge increased only twice over the time-span. Some of those spikes are for things like maintenance-contract renewals, which don't impact base-spend one whit.

The AWS line is much less spikey, as new capabilities are assumed into the base-budget in an ongoing basis. You're no longer dropping $125K in a single go, you're dribbling it out over the course of a year or more. AWS price-drops mean that monthly spend actually goes down a few times.

Pay only for what you use!

Amazon is great at pointing that out, and hilighting the convenience of it. But what they don't mention is that by doing so, you will learn the hard way about what it is you really use. The AWS Calculator is an awesome tool, but if you don't know how your current environment works, it's like throwing darts at a wall for accurately predicting what you'll end up spending. You end up obsessing over small line-item charges you've never had to worry about before (how many IOPs do we do? Crap! I don't know! How many thousands will that cost us?), and missing the big items that nail you (Whoa! They meter bandwidth between AZs? Maybe we shouldn't be running our Hadoop cluster in multi-AZ mode).

There is a reason that third party AWS integrators are a thriving market.

Also, this 'what you use' is not subject to Oops exceptions without a lot of wrangling with Account Management. Have something that downloaded the entire EPEL repo twice a day for a month, and only learned about it when your bandwidth charge was 9x what it should be? Too bad, pay up or we'll turn the account off.

Unlike the forklift model, you pay for it every month without fail. If you have a bad quarter, you can't just not pay the bill for a few months and tru-up later. You're spending it, or they're turning your account off. This takes away some of the cost-shifting flexibility the old style had.

Unlike the forklift model, AWS prices its stuff assuming a three year turnover rate. Many companies have a 5 to 7 years lifetime for IT assets. Three to four years in production, with an afterlife of two to five years in various pre-prod, mirror, staging, and development roles.The cost of those assets therefore amortizes over 5-9 years, not 3.

Predictable spending, at last.

Hah.

Yes, it is predictable over time given accurate understanding of what is under management. But when your initial predictions end up being wildly off, it seems like it isn't predictable. It seems like you're being held over the coals.

And when you get a new system into AWS and the cost forecast is wildly off, it doesn't seem predictable.

And when your system gets the rocket-launch you've been craving and you're scaling like mad; but the scale-costs don't match your cost forecast, it doesn't seem predictable.

It's only predictable if you fully understand the cost items and how your systems interact with it.

Reserved instances will save you money

Yes! They will! Quite a lot of it, in fact. They let a company go back to the forklift-method of cost-accounting, at least for part of it. I need 100 m3.large instances, on a three year up-front model. OK! Monthly charges drop drastically, and the monthly spend chart begins to look like the old model again.

Except.

Reserved instances cost a lot of money up front. That's the point, that's the trade-off for getting a cheaper annual spend. But many companies get into AWS because they see it as cheaper than on-prem. Which means they're sensitive to one-month cost-spikes, which in turn means buying reserved instances doesn't happen and they stay on the high cost on-demand model.

AWS is Elastic!

Elastic in that you can scale up and down at will, without week/month long billing, delivery and integration cycles.

Elastic in that you have choice in your cost accounting methods. On-demand, and various kinds of reserved instances.

It is not elastic in when the bill is due.

It is not elastic with individual asset pricing, no matter how special you are as a company.


All of these things trip up upper, non-technical management. I've seen it happen three times now, and I'm sure I'll see it again at some point.

Maybe this will help you in illuminating the issues with your own management.

Redundancy in the Cloud

Strange as it might be to contemplate, but imagine what would happen if AWS went into receivership and was shut down to liquidate assets? What would that mean for your infrastructure? Project? Or even startup?

It would be pretty bad.

Startups have been deploying preferentially on AWS or other Cloud services for some time now, in part due to venture-capitalist push to not have physical infrastructure to liquidate should the startup go *pop* and to scale fast should a much desired rocket-launch happen. If AWS shut down fully for, say, a week, the impact to pretty much everything would be tremendous.

Or what if it was Azure? Fully debilitating for those that are on it, but the wide impacts would be less.

Cloud vendors are big things. In the old physical days we used to deal with the all-our-eggs-in-one-basket problem by putting eggs in multiple places. If you're on AWS, Amazon is very big about making sure you deploy across multiple Availability Zones and helping you become multi-region in the process if that's important to you. See? More than one basket for your eggs. I have to presume Azure and the others are similar, since I haven't used them.

Do you put your product on multiple cloud-vendors as your more-than-one-basket approach?

It isn't as easy as it was with datacenters, that's for sure.

This approach can work if you treat the Cloud vendors as nothing but Virtualization and block-storage vendors. The multiple-datacenter approach worked in large part because colos sell only a few things that impact the technology (power, space, network connectivity, physical access controls), though pricing and policies may differ wildly. Cloud vendors are not like that, they differentiate in areas that are technically relevant.

Do you deploy your own MySQL servers, or do you use RDS?
Do you deploy your now MongoDB servers, or do you use DynamoDB?
Do you deploy your own CDN, or do you use CloudFront?
Do you deploy your own Redis group, or do you use SQS?
Do you deploy your own Chef, or do you use OpsWorks?

The deeper down the hole of Managed Services you dive, and Amazon is very invested in pushing people to use them, the harder it is to take your toys and go elsewhere. Or run your toys on multiple Cloud infrastructures. Azure and the other vendors are building up their own managed service offerings because AWS is successfully differentiating from everyone else by having the widest offering. The end-game here is to have enough managed services offerings that virtual private servers don't need to be used at all.

Deploying your product on multiple cloud vendors requires either eschewing managed-services entirely, or accepting greater management overhead due to very significant differences in how certain parts of your stack are managed. Cloud vendors are very much Infrastructure-as-Code, and deploying on both AWS and Azure is like deploying the same application in Java and .NET; it takes a lot of work, the dialect differences can be insurmountable, and the expertise required means different people are going to be working on each environment which creates organizational challenges. Deploying on multiple cloud-vendors is far harder than deploying in multiple physical datacenters, and this is very much intentional.

It can be done, it just takes drive.

  • New features will be deployed on one infrastructure before the others, and the others will follow on as the integration teams figure out how to port it.
  • Some features may only ever live on one infrastructure as they're not deemed important enough to go to all of the effort to port to another infrastructure. Even if policy says everything must be multi-infrastructure, because that's how people work.
  • The extra overhead of running in multiple infrastructures is guaranteed to become a target during cost-cutting drives.

The ChannelRegister article's assertion that AWS is now in "too big to fail" territory, and thus requiring governmental support to prevent wide-spread industry collapse, is a reasonable assertion. It just plain costs too much to plan for that kind of disaster in corporate disaster-response planning.

As I look around the industry with an eye towards further employment, I've noticed a difference of philosophy between startups and the more established players. One easy way to see this difference is on their job postings.

  • If it says RHEL and VMWare on it, they believe in support contracts.
  • If it says CentOS and OpenStack on it, they believe in community support.

For the same reason that tech startups almost never use Windows if they can get away with it, they steer clear of other technologies that come with license costs or mandatory support contracts. Why pay the extra support cost when you can get the same service by hiring extremely smart people and use products with a large peer support community? Startups run lean, and all that extra cost is... cost.

And yet some companies find that they prefer to run with that extra cost. Some, like StackExchange, don't mind the extra licensing costs of their platform (Windows) because they're experts in it and can make it do exactly what they want it to do with a minimum of friction, which means the Minimum Viable Product gets kicked out the door sooner. A quicker MVP means quicker profitability, and that can pay for the added base-cost right there.

Other companies treat support contracts like insurance: something you carry just in case, as a hedge against disaster. Once you grow to a certain size, business continuity insurance investments start making a lot more sense. Running for the brass ring of market dominance without a net makes sense, but once you've grabbed it keeping it needs investment. Backup vendors love to quote statistics on the percentage of business that fail after a major data-loss incident (it's a high percentage), and once you have a business worth protecting it's good to start protecting it.

This is part of why I'm finding that the long established companies tend to use technologies that come with support. Once you've dominated your sector, keeping that dominance means a contract to have technology experts on call 24/7 from the people who wrote it.

"We may not have to call RedHat very often, but when we do they know it'll be a weird one."


So what happens when startups turn into market dominators? All that no-support Open Source stuff is still there...

They start investing in business continuity, just the form may be different from company to company.

  • Some may make the leap from CentOS to RHEL.
  • Some may contract for 3rd party support for their OSS technologies (such as with 10gen for MongoDB).
  • Some may implement more robust backup solutions.
  • Some may extend their existing high-availability systems to handle large-scale local failures (like datacenter or availability-zone outages).
  • Some may acquire actual Business Continuity Insurance.

Investors may drive adoption of some BC investment, or may actively discourage it. I don't know, I haven't been in those board meetings and can argue both ways on it.

Which one do I prefer?

Honestly, I can work for either style. Lean OSS means a steep learning curve and a strong incentive to become a deep-dive troubleshooter of the platform, which I like to be. Insured means someone has my back if I can't figure it out myself, and I'll learn from watching them solve the problem. I'm easy that way.

The evil genius of OSv

One of the talks here at LISA13 was one about a new Cloud-optimized operating system called OSv. This is a new thing, and I hadn't heard of it before. Why do we need yet another OS? And one that doesn't even run a Linux kernel? I was frowning through the talk until I got to this slide:

NotNetware.jpg

That's the point when I said:

Holy shit! They've built a 64-bit NetWare!

  • Cooperative multi-tasking? Check!
  • A shared memory space? Check!
  • Everything runs in Ring 0? Check!

There were a few other things that made the parallel even more clear to me, but this is a stunning display of evil genius. Even though Novell tried for ten years to promote NetWare as a perfectly legitimate general purpose server for application serving, it never really took off. There were several reasons for this (not exhaustive):

  • It was a pain to develop for. The NLM model never got anything approaching wide-spread adoption so you had to get everything just right.
  • The shared memory space meant that the OS allowed you to stomp all over other processes running on the system, something that other OSs (Windows, Linux) don't allow.
  • If something did manage to wiggle out of the app and into the kernel, it had free reign (though in practice all it did was abend the server; writing exploits is subject to the first bullet-point problem).
  • It didn't have any concept of forking, just threads. Which changed the multi-processing paradigm from what it was on most other platforms and made porting software to it a pain.
  • There were no significant user-space utilities (grep/sed/awk/bash), though they did get some of that well after they'd lost the battle.

All of these made NetWare a challenging platform to develop for, and challenging platforms don't get developed for. Novell tried to further encourage people to develop for it by getting the Java JVM ported to NetWare so people could run Java apps on it. Few did, though it was quite possible; search for "netstorage" on this blog to get one such application that saw a lot of use.

Have I mentioned that OSv's first release ships with a JVM on it?


The Evil Genius part is that they're not wrong, things really do run faster when you write a kernel like that and run things in the same memory space as the kernel. I got pretty nice scaling with Apache when I was running it on NetWare.

The Evil Genius part is that they're designing this system to be a single-app system, not a general purpose system like NetWare was supposed to be. It runs a JVM, and that's it. The JVM can only stomp on itself and the kernel, and apps can stomp on each other within the limits of the JVM.

The Evil Genius part is that if it does fall over, it's designed to be flushed and a fresh copy spun up in its place. Disposable servers! NetWare servers of old were bastion hosts that Shall Never Go Down. OSv? Not the same thing at all.

The Evil Genius part is that they're doing this in an era where a system like this can actually succeed.

The Evil Genius part is that everyone looks at what they're doing and goes, "...uh HUH. Riiiight. LIke that's a good idea." And like evil geniuses of the past will go unrecognized and slink off to some dark corner somewhere to cackle and dream of world domination that will never happen.

Anyone taking DevOps to heart should read about Normal Accidents. The book is about failure modes of nuclear power plants; those highly automated and extremely instrumented things that they are still manage to fail in spite of everything that we do. The lessons here carry well into the highly automated environments we try to build in our distributed systems.

There are a couple of key learnings to take from this book and theory:

  • Root cause can be something seemingly completely unrelated to the actual problem.
  • Contributing causes can sneak in and make what would be a well handled event into something that gets you bad press.
  • Monitoring instrumentation failures can be sneaky contributing causes.
  • Single-failure events are easily handled, and may be invisible.
  • Multiple-failure events are much harder to handle.
  • Multiple-failure events can take months to show up if the individual failures happened over the course of months and were invisible.

The book had a failure mode much like this one:

After analysis, it was known that the flow direction of a specific coolant pipe was a critical item. If backflow occurred, hot fluid could enter areas not designed for handling it. As a result, a system was put in place to monitor flow direction, and automation put in place to close a valve on the pipe if backflow was detected.

After analyzing the entire system after a major event, it was discovered that the flow-sensor had correctly identified backflow, and had activated the valve close automation. However, it was also discovered that the valve had frozen open due to corrosion several months prior to the event. Additionally, the actuator had broken when the solenoid moved to close the valve. As a result, the valve was reported closed, and showed as such on the Operator panel, when in fact it was open.

  • The valve had been subjected to manual examination 9 months before the event, and was due to be checked again in 3 more months. However, it had failed between checks.
  • The actuator system was checked monthly and had passed every check. The actuator breakage happened during one of these monthly checks.
  • The sensor on the actuator was monitoring power draw for the actuator. If the valve was frozen, the actuator should notice an above-normal current draw. However, as the actuator arm was disconnected from the valve it experienced a below-normal current draw and did not detect this as an alarm condition.
  • The breaking of the actuator arm was noted in the maintenance report during the monthly check as a "brief flicker of the lamp" and put down as a 'blip'. The arm failed before the current meter triggered its event. As the system passed later tests, the event was disregarded.
  • The backflow sensor actually installed was not directional. It alarmed on zero-flow, not negative-flow.

Remediations:

  • Instrument the valve itself for open/close state.
  • Introduce new logic so that if the backflow sensor continues to detect backflow, raise alarms.
  • Replace the backflow sensor with a directional one as originally called for.
  • Add a new flow sensor behind the valve.
  • Change the alerting on the actuator sensor to alarm on too-low voltages.
  • Increase the frequency of visual inspection of the physical plant

That valve being open caused Fun Times To Be Had. If that valve system had been operating correctly, the fault that caused the backflow would have been isolated as the system designers intended and the overall damage contained. However, this contributing cause, one that happened months before the triggering event, turned a minor problem into a major one.

So, why did that reactor release radioactive materials into the environment? Well, it's complicated...

And yet, after reading the post-mortem report you look at what actually failed and think, 'and these are the jokers running our nuclear power plants? We're lucky we're not all glowing in the dark!'

We get the same kind of fault-trees in massively automated distributed systems. Take this entirely fictional, but oh-so-plausible failure cascade:

ExampleCorp was notified by their datacenter provider of the need for emergency power maintenance in their primary datacenter. ExampleCorp (EC) operated a backup datacenter and had implemented a hot failover method, tested twice a year, for moving production to the backup facility. EC elected to perform a hot failover to the backup facility prior to the power work in their primary facility.

Shortly after the failover completed the backup facility crashed hard. Automation attempted to fail back to the primary facility, but technicians at the primary facility had already begun, but not yet completed, safe-shutdown procedures. As a result, the fail-back was interrupted part way through, and production stopped cold.

Service recovery happened at the primary site after power maintenance completed. However, the cold-start script was out of date by over a year so restoration was hampered by differences that came up during the startup process.

Analysis after the fact isolated several causes of the extensive downtime:

  • In the time between the last hot-failover test, EC had deployed a new three-node management cluster for their network switch configuration and software management system, one three node cluster for each site.
  • The EC-built DNS synchronization script used to keep the backup and primary sites in sync was transaction oriented. A network fault 5 weeks ago meant the transactions related to the DNS update for the cluster deployment were dropped and not noticed.
  • The old three-node clusters were kept online "just in case".
  • The differences in cluster software versions between the two sites was displayed in EC's monitoring panel, but was not alarmed, and disregarded as a 'glitch' by Operations. Interviews show that Ops staff are aware that the monitoring system will sometimes hold onto stale data if it isn't part of an alarm.
  • At the time of the cluster migration Operations was testing a new switch firmware image. The image on the old cluster was determined to have a critical loading bug, which required attention from the switch vendor.
  • Two weeks prior to the event EC performed an update of switch firmware using new code that passed validation. The new firmware was replicated to all cluster members in both sites using automation based on the IP addresses of the cluster members. The old cluster members were not updated.
  • The automation driving the switch firmware update relied on the non-synchronized DNS entries, and reported no problems applying updates. The primary site got the known-good firmware, the backup site got the known-bad firmware.
  • The hot-swap network load triggered the fault in the backup site's switch firmware, causing switches to reboot every 5 minutes.
  • Recovery logic in the application attempted to work around the massive network faults and ended up duplicating some database transactions, and losing others. Some corrupted data was transferred to the primary site before it was fully shut down.
  • Lack of technical personnel physically at the backup site hampered recovery from the backup site and extended the outage.
  • Out of date documentation hampered efforts restart services from a cold stop.
  • The inconsistent state of the databases further delayed recovery.

That is a terrible-horrible-no-good-very-bad-day, yes indeed. However, it shows what I'm talking about here. Several small errors crept in to make what was supposed to be a perfectly handleable fault something that caused many hours of downtime. This fault would have been discovered during the next routine test, but that hadn't happened yet.

Just like the nuke-plant failure, reading this list makes you go "what kind of cowboy outfit allows this kind of thing to happen?"

Or maybe, if it has happened to you, "Oh crimeny, I've so been there. Here's hoping I retire before it happens again."

It happens to us all. Netfix reduces this through the Chaos Monkey, using it to visibly trigger these small failures before they can cascade into big ones. And yet even they fall over when a really big failure happens naturally.

What can you do?

  • Accept that the multiple-failure combinatorics problem is infinite and you won't be able to capture every fail case.
  • Build your system to be as disaster resilient as possible.
  • Test your remediations, and do so regularly.
  • Validate your instrumentation is returning good results, and do so regularly.
  • Cross-check where possible.
  • Investigate glitches, and keep doing it after it gets tediously boring.
  • Cause small failures and force your system to respond to them.

These are all known best-practices, and yet people are lazy, or can't get sufficient management buy-in to do it (a 'minimum viable product' is likely excessively vulnerable to this kind of thing). We do what we can, snark at those who visibly can't, and hope our turn doesn't come up.

Perhaps you've seen this error:

Version mismatch with VMCI driver: expecting 11, got 10.

I get this every time I upgrade a kernel, and this is how I fix it.

The cloud will happen

Like many olde tyme sysadmins, I look at 'cloud' and shake my head. It's just virtualization the way we've always been doing it, but with yet another abstraction layer on top to automate deploying certain kinds of instances really fast.

However... it's still new to a lot of entities. The concept of an outsourced virtualization plant is very new. For entities that use compliance audits for certain kinds of vendors it is most definitely causing something of a quandary. How much data-assurance do you mandate for such suppliers? What kind of 3rd party audits do you mandate they pass? Lots of questions.

Over on 3 Geeks and a Law Blog, they recently covered this dynamic in a post titled The Inevitable Cloud as it relates to the legal field. In many ways, the Law field shares information handling requirements similar to the Health-Care field, though we don't have HIPPA. We handle highly sensitive information, and who had access to what, when, and what they did with it can be extremely relevant details (it's called spoliation). Because of this, certain firms are very reluctant to go for cloud solutions.

Some of their concerns:

  • Who at the outsourcer has access to the data?
  • What controls exist to document what such people did with the data?
  • What guarantees are in place to ensure that any modification is both detectable and auditable?

For an entity like Amazon AWS (a.k.a. Faceless Megacorp) the answer to the first may not be answerable without lots of NDAs being signed. The answers to the second may not even be given by Amazon unless the contract is really big. The answers to the third? How about this nice third-party audit report we have...

The pet disaster for such compliance officers is a user with elevated access deciding to get curious and exploiting a maintenance-only access method to directly access data files or network streams. The ability of an entity to respond to such fears to satisfaction means they can win some big contracts.

However, the costs of such systems are rather high; and as the 3 Geeks point out, not all revenue is profit-making. Firms that insist on end-to-end transport-mode IPSec and universally encrypted local storage all with end-user-only key storage are going to find fewer and fewer entities willing to play ball. A compromise will be made.




However, at the other end of the spectrum you have the 3 person law offices of the world and there are a lot more of them out there. These are offices who don't have enough people to bother with a Compliance Officer. They may very well be using dropbox to share files with each other (though possibly TrueCrypted), and are practically guaranteed to be using outsourced email of some kind. These are the firms that are going into the cloud first, pretty much by default. The rest of the market will follow along, though at a remove of some years.

Exciting times.

Charging by the hour, a story of clouds

Question: When a (IaaS) cloud provider charges per hour for a machine, what's it an hour of? Do I get charged when it's doing nothing? If so, why is that fair?

All the IaaS cloud providers I've run into (which isn't all of them by any stretch) charge by the running hour. If that micro-mini instance is doing nothing but emailing the contents of a single file once a day, it'll still get charged for 24 hours of activity if left on. The same goes for a gargantuGPU instance doing the same work, it'll just cost more to do nothing.

Why is that fair?

Because of resources.

The host machine running all of these virtual machines has many resources. CPU, memory, disk, network, the usual suspects. These resources have to be shared between all the virtual machines. Lets take a look at each and see how easy that is.

CPU

To share CPU between VMs the host has to be able to share execution between them. Much like we do... well, practically everywhere now. We've been doing multiprocess operating systems for a while now. Sharing CPU cycles is dead easy. If a process needs a lot it gets what's available. If it needs none, it gets none. A thousand processes all doing nothing causes... nothing to happen! It's perhaps the easiest thing to share. But, we'll see.

Memory

We've been sharing RAM between processes, with good isolation even, for some time now. Even Apple has joined that game to great effect. Unlike CPU, processes sit on RAM the entire time they're running. It may be swapped out by the OS, but it's still accounted for.

Disk

Disk? Disk is easy. It's just files. Each file gets so much, and more if needed up until you run out. At which point you run into problems. Each VM uses disk to store its files, as you'd expect.

Network

To share network, a host machine has to proxy network connections from a VM. Which... it kinda already does for normal OS processes, like, say, Apache, or MySQL. If a process doesn't need any network resources, none gets used. If it needs some, it uses up to what's available. A thousand processes all doing nothing uses no network resources. Same for VMs really. Its right up there with CPU for ease of sharing.

Now ask yourself. Of these four major resources, which of them are always consumed when a VM (or if you rather, a process) is running?

If you said "memory and disk" you've been paying attention.

If you said "all but network, and maybe even that too", you've been auditing this answer for technical accuracy and probably noticed a few (gross) simplifications so far. Please bear with me!

Now of the two always-consumed resources, memory and disk, which is going to be the more constrained one?

If you look at it from the old memory hierarchy chart based on "how long does the CPU have to wait if it needs to get data from a specific location", you can begin to see a glimmer of the answer here. This is usually measured in CPU cycles spent waiting for data. The lower down the chart you get (faster) the more expensive the storage. A 2.5GHz CPU will have 2.5 billion cycles in a second. Remember that number.

A 7.2K RPM hard-drive, the type you can get in 1TB sizes for cheap, has a retrieval latency of 8.9 miliseconds. Which means that best-case the 2.5GHz CPU will wait  22,250,000 cycles before it gets the data it needs. Thats... really slow, actually.

The RAM in that 2.5GHz server can be fetched in 10 nanoseconds. Which means that best-case the 2.5GHz CPU will wait only... 25 cycles.

Biiiiiiig differences there! RAM is vastly faster. Which means its also vastly more expensive[1]. Which in turn means that RAM is going to be the more constrained resource.



So we have determined that of the four resource types RAM is the most expensive, always-on resource. Because of that, RAM amount is the biggest driver of cost for cloud-computing providers. It's not CPU. This is why that 64MB RAM VM is so much cheaper per-hour than something with 1.6GB in it, even if they get the same CPU resources.

Because RAM amount used is the cost-center, and a 1.6GB VM is using that 1.6GB of RAM all the time, the cloud providers charge by hour of run-time. And this is fair. Now you know.



[1]: How much more expensive? A 1TB disk can be had for $90. 1 TB of RAM requires a special machine (higher end servers), and will run you a bit under $12,000 at today's prices.

Change-automation vs. LazyCoder

| 3 Comments
The lazyCoder is someone sees a need to write code, but doesn't because it's too much work. This describes a lot of sysadmins, as it happens. It also describes software engineers looking at an unfamiliar language. Part of the lazy_coder is definitely a disinclination to write something in a language they're not that familiar with, part of it is a disinclination to work.

It has been said in DevOps circles (though I can't hunt up the reference):
A good sysadmin can probably earn a living as a software engineer, though they choose not to.
A sentiment close to my heart as that definitely applies to me. I have that CompSci degree (before software engineering degrees were common, CSci was the degree-of-choice for the enterprising dot-com boom programmer) that says I know for code. And yet, when I hit the workplace I tacked as close to systems administration as I could. And I did. And like many sysadmins of my age cohort or older, I managed to avoid writing code for a very large part of my career.

I could do it as needed, as proven by a few rather complex scripts I cobbled together over that time. But I didn't go into full time code-writing because of the side-effects on my quality of life. In my regular day to day life problems came and went generally on the same day or with in a couple days of introduction. When I was heads down in front of an IDE the problem took weeks to smash, and I was angry most of the time. I didn't like being cranky that long, so I avoided long coding projects.

Problems are supposed to be resolved quickly, damnit.

Sysadmins also tend to be rather short of attention-span because there is always something new going on. Variety. It's what keeps some of us going. But being heads down in front of a wall of text? The only thing that changes is what aggravating bit of code is aggravating me right now[1]. Not variety.

So you take someone with that particular background and throw them into a modern-age scaled system. Such a system has a few characteristics:

  • It's likely cloud-based[2], so hardware engineering is no longer on the table.
  • It's likely cloud-based[2], so deploying new machines can be done from a GUI, or an API. And probably won't involve actual OS install tasks, just OS config tasks.
  • There are likely to be a lot of the same kind of machine about.

And they have a problem. This problem becomes glaringly obvious when they're told to apply one specific change to triple-digits of virtual machines. Even the laziest of LAZY_CODER will think to themselves:

Guh, there has got to be a better way than just doing it all by hand. There's only one of me.
If they're a Windows admin and the class of machines are all in AD as it should, they'll cheer and reach for a Group Policy Object. All done!

But if whatever needs changing isn't really doable via GPO, or requires a reboot to apply? Then... powershell starts looming[3].

If they're a *nix admin, the problem will definitely involve rolling some custom scripting.

Or maybe, instead, a configuration management engine like Puppet, CFEngine, Chef or the like. Maybe the environment already has something like that but the admin hasn't gone there since it's new to them and they didn't have time to learn the domain-specific-langage used by the management engine. Well, with triple digits of machines to update learning that DSL is starting to look like a good idea.

Code-writing is getting hard to avoid, even for sysadmin hold-outs. Especially now that Microsoft is starting to Strongly Encourage systems engineers to use automation tools to manage their infrastructures.

This changing environment is forcing the lazy coder to overcome the migration threshold needed to actually bother learning a new programming language (or better learning one they already kinda-sorta know). Sysadmins who really don't like to write code will move elsewhere, to jobs where hardware and OS install/config are still a large part of the job.

One of the key things that changes once the idea of a programmable environment starts really setting in is the workflow of applying a fix. For smaller infrastructures that do have some automation, I frequently see this cascade:

  1. Apply the fix.
  2. Automate the fix.

Figure out what you need to do, apply it to a few production systems to make sure it works, then put it into the automation once you're sure of the steps. Or worse, apply the fix everywhere by hand, and automate it so that new systems have it. However, for a fully programmable environment, this is backwards. It really should be:

  1. Automate the fix
  2. Apply the fix.

Because you'll get a much more consistent application of the fix this way. The older way will leave a few systems with slight differences of application; maybe config-files are ordered differently, or maybe the case used in a config file is different from the others. Small differences, but they can really add up. This transition is a very good thing to have happen.

The nice thing about Lazy Coders is that once they've learned the new thing they've been avoiding, they tend to stop being lazy about it. Once that DSL for Puppet has been learned, the idea of amending an existing module to fix a problem becomes something you just do. They've passed the migration threshold, and are now in a new state.

This workflow-transition is beginning to happen in my workplace, and it cheers me.



[1]: As Obi-Wan said, It all depends on your point of view. To an actual Software Engineer, this is not the same problem coming back to thwart me, it's all different problems. Variety! It's what keeps them going.
[2]: Or if you're like that, a heavily virtualized environment that may or may not belong to the company you're working for. So there just might be some hardware engineering going on, but not as much as there used to be. Sixteen big boxes with a half TB of RAM each is a much easier to maintain physical fleet than the old infrastructure with 80 phsysical boxes of mostly different spec.
[3]: Though if they're a certain kind of Windows admin who has had to reach for programming in the past, they'll reach instead for VBScript; Powershell being too new, they haven't bothered to learn it yet.