Terraforming in prod

Terraform from HashiCorp is something we've been using in prod for a while now. Simply ages in terraform-years, which means we have some experience with it.

It also means we have some seriously embedded legacy problems, even though it's less than two years old. That's the problem with rapidly iterating infrastructure projects that don't build in production usecases from the very start. You see, a screwdriver is useful in production! It turns screws and occasionally opens paint cans. You'd think that would be enough. But production screwdrivers conform to external standards of screwdrivers, are checked in and checked out of central tooling because quality control saves everyone from screwdriver related injuries, and have support for either outright replacement or refacing of the toolface.

Charity.wtf has a great screed on this you should read. But I wanted to share how we use it, and our pains.

In the beginning

I did our initial Terraform work in the Terraform version 6 days. The move to production happened about when TF 7 came out. You can see how painfully long ago that was (or wasn't). It was a different product back then.

Terraform Modules were pretty new back when I did the initial build. I tried them, but couldn't get them to work right. At the time I told my coworkers:

They seem to work like puppet includes, not puppet defines. I need them to be defines, so I'm not using them.

I don't know if I had a fundamental misunderstanding back then or if that's how they really worked. But they're defines now, and all correctly formatted TF infrastructures use them or be seen as terribly unstylish. It means there is a lot of repeating-myself in our infrastructure.

Because we already had an AMI baking pipeline that worked pretty well, we never bothered with Terraform Provisioners. We build ours entirely on making the AWS assets versonable. We tried with CloudFormation, but gave that up due to the terrible horrible no good very bad edge cases that break iterations. Really, if you have to write Support to unfuck an infrastructure because CF can't figure out the backout plan (and that backout is obvious to you), then you have a broken product. When Terraform gets stuck, it just throws up its hands and says HALP! Which is fine by us.

Charity asked a question in that blog-post:

Lots of people seem to eventually end up wrapping terraform with a script.  Why?

I wrote it for two big, big reasons.

  1. In TF6, there was zero support for sharing a Terraform statefile between multiple people (without paying for Atlas), and that critically needs to be done. So my wrapper implemented the sharing layer. Terraform now supports several methods for this out of the box, it didn't used to.
  2. Terraform is a fucking foot gun with a flimsy safety on the commit career suicide button. It's called 'terraform destroy' and has no business being enabled for any reason in a production environment ever. My wrapper makes getting at this deadly command require a minute or two of intentionally circumventing the safety mechanism. Which is a damned sight better than the routine. "Are you sure? y/n" prompt we're all conditioned to just click past. Of course I'm in the right directory! Yes!

And then there was legacy.

We're still using that wrapper-script. Partly because reimplementing it for the built-in statefile sharing is, like, work and what we have is working. But also because I need those welded on fire-stops on the foot-gun.

But, we're not using modules, and really should. However, integrating modules is a long and laborious process we haven't seen enough benefits to outweight the risk. To explain why, I need to explain a bit about how terraform works.

You define resources in Terraform, like a security group with rules. When you do an 'apply', Terraform checks the statefile to see if this has been created yet, and what state it was seen last. It then compares the last known state with the current state in the infrastructure to determine what changes need to be made. Pretty simple. The name of the resource in the statefile is a clear format. For non-module resources the string is "resource_type.resource_name", so our security group example would be "aws_security_group.prod_gitlab". For module resources it gets more complicated, such as "module_name.resource_type.resource_name". That's not the exact format -- which is definitely not bulk-sed friendly -- but it works for the example I'm about to share. If you change the name of a resource, Terraform's diff shows the old resource disappearing and a brand new one appearing and treats it as such. Sometimes this is what you want. Other times, like if they're your production load-balancers where delete-and-recreate will create a multi-minute outage, you don't.

To do a module conversion, this is the general workflow.

  1. Import the module and make your changes, but don't apply them yet.
  2. Use 'terraform statefile ls' to get a list of the resources in your statefile, note the names of the resources to be moved into modules.
  3. Use 'terraform statefile rm' to remove the old resources.
  4. Use 'terraform import' to import the existing resources into the statefile under their now module-based names.
  5. Use 'terraform plan' to make sure there are zero changes.
  6. Commit your changes to the terraform repo and apply.

Seems easy, except.

  • You need to lock the statefile so no one else can make changes when you're doing this. Critically important if you have automation that does terraform actions.
  • This lock could last a couple of hours depending on how many resources need to be modified.
  • This assumes you know how terraform statefiles work with resource naming, so you need someone experienced with terraform to do this work.
  • Your modules may do some settings subtly different than you did before, so it may not be a complete null-change.
  • Some resources, like Application Load Balancers, require a heartbreaking number of resources to define, which makes for a lot of import work.
  • Not all resources even have an import developed. Those resources will have to be deleted and recreated.
  • Step 1 is much larger than you think, due to dependencies from other resources that will need updating for the new names. Which means you may visit step 1 a few times by the time you get a passing step 5.
  • This requires a local working Terraform setup, outside of your wrapper scripts. If your wrapper is a chatbot and no one has a local TF setup, this will need to be done on the chatbot instance. The fact of the matter is that you'll have to point the foot-gun at your feet for a while when you do this.
  • This is not a change that can be done through Terraform's new change-management-friendly way of packaging changes for use in change-management workflows, so it will be a 'comprehensive' change-request when it comes.

Try coding that into a change-request that will pass auditorial muster. In theory it is possible to code up a bash-script that will perform the needed statefile changes automatically, but it will be incredibly fragile in the face of other changes to the statefile as the CR works its way through the process. This is why we haven't converted to a more stylish infrastructure; the intellectual purity of being stylish isn't yet outweighing the need to not break prod.

What it's good for

Charity's opinion is close to my own:

Terraform is fantastic for defining the bones of your infrastructure.  Your networking, your NAT, autoscaling groups, the bits that are robust and rarely change.  Or spinning up replicas of production on every changeset via Travis-CI or Jenkins -- yay!  Do that!

But I would not feel safe making TF changes to production every day.  And you should delegate any kind of reactive scaling to ASGs or containers+scheduler or whatever.  I would never want terraform to interfere with those decisions on some arbitrary future run.

Yes. Terraform is best used in cases where doing an apply won't cause immediate outages or instabilities. Even using it the way we are, without provisioners, means following some rules:

  • Only define 'aws_instance' resources if we're fine with those suddenly disappearing and not coming back for a couple of minutes. Because if you change the AMI, or the userdata, or any number of other details, Terraform will terminate the existing one and make a new one.
    • Instead, use autoscaling-groups and a process outside of Terraform to manage the instance rotations.
  • It's fine to encode scheduled-scaling events on autoscaling groups, and even dynamic-scaling triggers on them.
  • Rotating instances in an autoscaling-group is best done in automation outside of terraform.
  • Playing pass-the-IP for Elastic-IP addresses is buggy and may require a few 'applies' before they fully move to the new instances.
  • Cache-invalidation on the global Internet's DNS caches is still buggy as fuck, though getting better. Plan around that.
  • Making some changes may require multiple phases. That's fine, plan for that.

The biggest strength of Terraform is that it looks a lot like Puppet, but for your AWS config. Our auditors immediately grasped that concept and embraced it like they've known about Terraform since forever. Because if some engineer cowboys in a change outside of the CR process, Terraform will back that out the next time someone does an apply, much the way puppet will back out a change to a file it manages. That's incredibly powerful, and something CloudFormation only sort of does.

The next biggest strength is that it is being very actively maintained and tracks AWS API changes pretty closely. When Amazon announces a new service, Terraform will generally have support for it within a month (not always, but most of the time). If the aws-cli can do it, Terraform will also be able to do it; if not now, then very soon.

While there are some patterns it won't let you do, like have two security-groups point to each other on ingress/egress lists because that is a dependency loop, there is huge scope in the zones of what it will let you do.

This is a good tool and I plan to keep using it. Eventually we'll do a module conversion somewhere, but that may wait until they have a better workflow for it. Which may be in a month, or half a year. This project is moving fast.