Scrapbag of useful Terraform tips

After some kinda ranty posts about terraform and AWS and networking and orchestration life in general, it feels like a good time to braindump some helpful tidbits.

 

jokersunshine

Side note — a few people have asked me to open source my terraform config.  I’m actually super open to sharing it, but a) it’s still changing a lot and b) tf modules aren’t really reusable yet.  They just aren’t.  Eventually we’ll reach a maturity point where a tf module library makes sense, but open source tf modules haven’t been super helpful for me and I don’t expect mine will be any better.

I *did* embed a bunch of meaty gists in this post with some of the more interesting configs.  Let me know if you want to see more, I will happily send it to you, I just don’t want to be maintaining an OS repo right now.

So here are a few things that took me minutes, hours, or days to figure out, that hopefully will now take you less time.

ICMP for security groups

If you want your hosts to be pingable, you have to put this stanza in your security group.  The “from_port = 8” isn’t in the security group docs; I found it in this github issue.  Not being a networking person myself, I literally never would have guessed it.  If you want to read up more, here’s more about why.

  ingress {
    from_port = 8 
    to_port = 0 
    protocol = "icmp"
    cidr_blocks = ["0.0.0.0/0"]
  }

Here is a gist for my bastion security groups.  Note that all the security groups are in the aws_vpc module, which gets invoked separately by each environment.

Security groups are stackable in VPC, which is glorious.  But most of the time I thought I was having a problem with VPCs or routes or networking, it turned out to be a security group problem, or an interaction between the two.

Since you have no ability to debug AWS networking via normal linux utilities, my best debugging tip for VPC networking is still, when you get really stuck open up all the SG ports and see if that fixes it.  (Preferably in, you know, a staging environment, not prod …)

Resource description fields != comment fields

wonkacommentsDo not use your resource description fields as comments about those resources.

It feels like they should be comment strings, doesn’t it?  Well, they aren’t.  If you change your “comment” terraform will try to destroy and recreate the resource (which may or may not even work, if it’s like a security group that all your environments and other resources happen to inherit.  Hypothetically speaking.)

This isn’t a Hashicorp thing, it’s an AWS thing.  You can’t go edit the description in the console, either — try it!  It’s like a smelly, lingering remnant of the bad old days before we had tags.

So use tags, or use comments in your code.  Don’t use descriptions for documentation

Picking VPC ranges

The max # of hosts you can have in any VPC is a /16.  Probably don’t start your numbering with 10.0.0.0/16, just in case you ever want to peer with anyone else, who almost certainly started with 10.0.0.0/16 too.

Route tables

Only one route table can be associated with each subnet.  (Again, NONE of your routes will show up in netstat -nr or any of the normal Linux tools, which is fucking infuriating.)

I recommend not using aws_route_table with an inline blob of routes, but instead using aws_route resources.  These are additive resources, so it gives you more fine-grained control if you want different environments to have different routing tables.

Peering VPCs

Peering is so fucking rad.  I’m so, so happy with it.  Peering makes VPC-per-env tractable and flexible and not horribly annoying.

In order to peer VPCs, if you have a separate state file per environment (which you really should), you will need to import remote state.  It’s not very obvious from the documentation, but this is an incredibly powerful feature.  It lets you refer to variables from remote state files just like they were modules.

I use S3 for saving state, with versioning turned on for the bucket.

I have a locked-down dev VPC which is automatically peered with all other VPCs and peering-oprahallowed to ssh into them, but can’t connect to any other ports in those VPCs.  (Using security groups, but also network ACLs for an extra sanity check.)  And none of the other VPCs are peered with each other, so none of the test or staging or prod environments can accidentally connect to each other.

I ran into a few things while setting up peering.  (Relevant context: I have both 4 public subnets and 4 private NAT subnets for each VPC, one subnet per availability zone.)

  • First, like I said, I had to refactor my aws_route_table into a bunch of aws_route resources, because I didn’t want the route tables to look the same for every environment (staging shouldn’t be able to talk to prod but dev should, etc)
  • If you own both VPCs, you can set up auto-accept, which is super rad.  If not, someone has to go to the console and click ok somewhere.
  • You need to include your “owner id” in the peering config, which confused me for a bit but you just have to log in as the root account and look under billing somewhere.  (I don’t remember where, google it.)
  • Second, peering has to be set up in both directions before connections will actually work.  I naively assumed that if it was set up and auto-accepted from VPC-A to VPC-B, connections from VPC-A to VPC-B would work.  Nope!  you also have to establish the peering from VPC-B to VPC-A before either direction will work.
  • All public subnets share a single route table, but each private subnet has its own (necessary for NAT).  So I had to set up peering from every single one of the private subnets that I wanted to be able to connect out from.

Here’s the gist to the networking portion of my aws_vpc module.  (The rest of the module is mostly just security groups.)

And some sample peering configs (you need one for each VPC, like I mentioned, so it’s bidirectional for each pair).  Here’s a gist snippet from the dev side, and the paired snippet from the staging side.

(You can tell how confident I was in these changes by how I named the resources, and added blamey “Author” tags for a coworker who hadn’t actually started working with me yet.  I don’t think he’s noticed yet, lol.)

NAT gateways, IGWs

You probably already set up an IGW resource for your public subnets to talk to the internet.  Just add it to every public subnet, easy breezy:


resource "aws_internet_gateway" "mod" {
  vpc_id = "${aws_vpc.mod.id}"
  tags { 
    Name = "${var.env}_igw"
  }
}

# add a public gateway to each public route table
resource "aws_route" "public_gateway_route" {
  route_table_id = "${aws_route_table.public.id}"
  depends_on = ["aws_route_table.public"]
  destination_cidr_block = "0.0.0.0/0"
  gateway_id = "${aws_internet_gateway.mod.id}"
}

Lots of people seem to still be setting up custom Linux boxes to NAT traffic out from private subnets to the internet.  (A prominent internet service provider had an outage a couple weeks ago because they were doing this.)default-gw

Use NAT gateways instead, if you can.  They are basically just like ELBs but for natting out to the internet.  They scale out according to throughput in roughly the same way, up to 10 Gbps bursts.

BUT MIND THE FUCKING TRAP.  You do not attach these NAT gateways to your PRIVATE subnets, you attach them to the PUBLIC FUCKING SUBNETS, and then a route to from the private subnet to that gateway.  Gahhhhhh.

resource "aws_eip" "nat_eip" {
  count    = "${length(split(",", var.public_ranges))}"
  vpc = true
}

resource "aws_nat_gateway" "nat_gw" {
  count = "${length(split(",", var.public_ranges))}"
  allocation_id = "${element(aws_eip.nat_eip.*.id, count.index)}"
  subnet_id = "${element(aws_subnet.public.*.id, count.index)}"
  depends_on = ["aws_internet_gateway.mod"]
}

# for each of the private ranges, create a "private" route table.
resource "aws_route_table" "private" {
  vpc_id = "${aws_vpc.mod.id}"
  count = "${length(compact(split(",", var.private_ranges)))}"
  tags { 
    Name = "${var.env}_private_subnet_route_table_${count.index}"
  }
}
# add a nat gateway to each private subnet's route table
resource "aws_route" "private_nat_gateway_route" {
  count = "${length(compact(split(",", var.private_ranges)))}"
  route_table_id = "${element(aws_route_table.private.*.id, count.index)}"
  destination_cidr_block = "0.0.0.0/0"
  depends_on = ["aws_route_table.private"]
  nat_gateway_id = "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
}

(Thank you @ebroder, I would probably NEVER have figured this out on my own.  AWS docs are completely unintelligible on the subject.)

A note on ELB SGs

Oh … you probably know this, but your ELBs should be in a separate / more permissive SG than the instances backing those ELBs.  You don’t want people to be able to connect directly to e.g. port 80 or 8080 on an application host, bypassing the ELB.

ELB certificates

If you live in us-east, use the new AWS certificate manager.  It’s free and you’ll never have to worry about cert expirations ever ever again.

If you don’t — or if you didn’t notice the announcement LITERALLY A FEW DAYS BEFORE you purchased your own Digicert wildcard cert (wahhhh) — you should just add the cert to your ELB in the console and refer to the ARN in your tf configs, because otherwise your private key will be in the state file.

Ok that’s it

Yesterday I spun up another whole new VPC clone by adding about 5 lines and copying a couple files + sed -e’ing the name of the environment.  Took about two minutes, felt like a fucking badass.  ^_^

I will now proceed to forget as much as possible about all the things I have learned about networking over the past two months.

fuck-networking

Scrapbag of useful Terraform tips

Terraform, VPC, and why you want a tfstate file per env

Hey kids!  If you’ve been following along at home, you may have seen my earlier posts on getting started with terraform and figuring out what AWS network topology to use.  You can think of this one as like if those two posts got drunk and hooked up and had a bastard hell child.

Some context: our terraform config had been pretty stable for a few weeks.  After I got it set up, I hardly ever needed to touch it.  This was an explicit goal of mine.  (I have strong feelings about delegation of authority and not using your orchestration layer for configuration, but that’s for another day.)

And then one day I decided to test drive Aurora in staging, and everything exploded.

Trigger warning: rants and scary stories about computers ahead.  The first half of this is basically a post mortem, plus some tips I learned about debugging terraform outages.  The second half is about why you should care about multiple state files, how to set up and manage multiple state files, and the migration process.

First, the nightmare

This started when I made a simple tf module for Aurora, spun it up in staging, assigned a CNAME, turned on multi-AZ support for RDS, was really just fussing around with minor crap in staging, like you do.  So I can’t even positively identify which change it was that triggered it, but around 11 pm Thursday night the *entire fucking world* blew up.  Any terraform command I tried to run would just crash and dump a giant crash.log.

Terraform crashed! This is always indicative of a bug within Terraform.

So I start debugging, right?  I start backing out changes bit by bit.  I removed the Aurora module completely.  I backed out several hours worth of changes.  It became clear that the tfstate file must be “poisoned” in some way, so  I disabled remote state storage and starting performing surgery on the local tfstate file, extracting all the Aurora references and anything else I could think of, just trying to get tf plan to run without crashing.

What was especially crazy-making was the fact that I could apply *any* of the modules or resources to *any* of the environments independently.  For example:

 $ for n in modules/* ; do terraform plan -target=module.staging_$n ; done

… would totally work!  But “terraform plan” in the top level directory took a giant dump.

I stabbed at this a bunch of different ways.  I scrolled through a bunch of 100k line crash logs and grepped around for things like “ERROR”.  I straced the terraform runs, I deconstructed tens of thousands of lines of tfstates.  I spent far too much time investigating the resources that were reporting “errors” at the end of the run, which — spoiler alert — are a total red herring.  Stuff like this —

 15 module.staging_vpc.aws_eip.nat_eip.0: Refreshing state... (ID: eipalloc-bd6987da)
 16 Error refreshing state: 34 error(s) occurred:
 17 
 18 * aws_s3_bucket.hound-terraform-state: unexpected EOF
 19 * aws_rds_cluster_instance.aurora_instances.0: unexpected EOF
 20 * aws_s3_bucket.hound-deploy-artifacts: unexpected EOF
 21 * aws_route53_record.aurora_rds: connection is shut down

It’s all totally irrelevant, disregard.

It’s like 5 am by this point which is why I feel only slightly less fucking retarded about the fact that @stack72 had to gently point out that all you have to do is find the “panic” in the crash log, because terraform is written in Go so OF COURSE THERE IS A PANIC buried somewhere in all that dependency graph spew.  God, I felt like such an idiot.

The painful recovery

I spent several hours trying to figure out how to recover gracefully by isolating and removing whatever was poisoned in my state.  Unfortunately, some of the techniques I used to try and systematically validate individual components or try to narrow down the scope of the problem ended up making things much, much worse.

For example: applying individual modules (“terraform apply -target={module}”) can be extremely problematic.  I haven’t been able to get an exact repro yet, but it seems to play out something like this: if you’re applying a module that depends on other resources that get dynamically generated and passed into it, but you aren’t applying the modules that do that work in the same run, terraform will sometimes just create them again.

Like, a whole duplicate set of all your DNS zones with the same domain names but different zone ids, or a duplicate set of VPCs with the same VPC names and subnets, routes, etc, and you only find out if it gets to a point where it tries to create one of those rare resources where AWS actually enforces unique names, like autoscaling groups.

And yes, I do run tf plan.  Religiously.  But when you’re already in a bad state and you’re trying to do unhealthy things with mutated or corrupt tfstate files … shit happens.  And thus, by the time this was all over:

I ended up deleting my entire infrastructure by hand, three times.

I’m literally talking about clicking select-all-delete in the console on fifty different pages, writing grungy shell scripts to cycle through and delete every single AWS resource, purging local cache files, tracking down resources that exist but don’t show up on the console via the CLI, etc.

Of course every time I purged and started from scratch, it had to create a new zone ID for the root domain, so I had to go back and update my registrar with the new nameservers because each AWS zone ID is associated with a different set of resolvers.  Meanwhile our email, website, API etc were all unresolvable.

If we were in production, this would have been one of the worst outages of my career, and that’s … saying a lot.

So … Hi!  I learned a bunch of things from this.  And I am *SO GLAD* I got to learn them before we have any customers!

The lessons learned

(in order of least important to most important)

1. Beware of accidental duplicates.

It actually really pisses me off how easily you can just nonchalantly and unwaryingly create a duplicate VPC or Route53 zone with the same name, same delegation, same subnets, etc.  Like why would anyone ever WANT that behavior?  I blame AWS for this, not Hashicorp, but jesus christ.

So that’s going on my “Wrapper Script TODO” list: literally check AWS for any existing VPC or Route53 zone with the same name, and bail on dupes in the plan phase.

(And by the way — this outage was hardly the first time I’ve run into this.  I’ve had tf  unexpectedly spin up duplicate VPCs many, many, many times.  It is *not* because I am running apply from the wrong directory, I’ve checked for any telltale stray .terraforms.  I usually don’t even notice for a while so it’s hard to figure out what exactly I did to cause it, but definitely seems related to applying modules or subsets of a full run.  Anyway, I am literally setting up a monitoring check for duplicate VPC names which is ridiculous but whatever, maybe it will help me track this down.)

(Also, I really wish there was a $TF_ROOT.)

2. Tag your TF-owned resources, and have a kill script.

This is one of many great tips from @knuckolls that I am totally stealing.  Every taggable resource that’s managed by terraform, give it a tag like “Terraform: true”, and write some stupid boto script that will just run in a loop until it’s successfully terminates everything with that tag + maybe an env tag.  (You probably want to explicitly exclude some resources, like your root DNS zone id, S3 buckets, data backups.)

But if you get into a state where you *can’t* run terraform destroy, but you *could* spin up from a clean slate, you’re gonna want this prepped.  And I have now been in that situation at least four or five times not counting this one.  Next time it happens, I’ll be ready.

Which brings me to my last and most important point — actually the whole reason I’m even writing this blog post.  (she says sheepishly, 1000 words later.)

3. Use separate state files for each environment.

Sorry, let me try  this again in a font that better reflects my feelings on the subject:

USE SEPARATE STATE FILES.

FOR EVERY ENVIRONMENT.

This is about limiting your blast radius.  Remember: this whole thing started when I made some simple changes to staging.

To STAGING.

But all my environments shared a state file, so when something bad happened to that state file they all got equally fucked.

If you can’t safely test your changes in isolation away from prod, you don’t have infrastructure as code.

Look, you all know how I feel about terraform by now.  I love composable infra, terraform is the leader in the space, I love the energy of the community and the responsiveness of the tech leads.  I am hitching my wagon to this horse.

It is still as green as the motherfucking Shire and you should assume that every change you make could destroy the world.  So your job as a responsible engineer is to add guard rails, build a clear promotion path for validating changesets into production, and limit the scope of the world it is capable of destroying.  This means separate state files.

So Monday I sat down and spent like 10 hours pounding out a version of what this could look like.  There aren’t many best practices to refer to, and I’m not claiming my practices are the bestest, I’m just saying I built this thing and it makes sense to me and I feel a lot better about my infra now.  I look forward to seeing how it breaks and what kinds of future problems it causes for me.


HOWTO: Migrating to multiple state files

Previous config layout, single state file

If you want to see some filesystem ascii art about how everything was laid out pre-Monday, here you go.

terraform-layout-old

Basically: we used s3 remote storage, in a bucket with versioning turned on.  There was a top-level .tf file for each environment {production,dev,staging}, top-level .tf files for env-independent resources {iam,s3}, and everything else in a module.

Each env.tf file would call out to modules to build its VPC, public/private subnets, IGW, NAT gateway, security groups, public/private route53 subdomains, an auto-scaling group for each service (including launch config, int or ext ELB, bastion hosts, external DNS, tag resources, and so forth.

New config layout, with state file per environment

In the new world there’s one directory per environment and one base directory, each of which has their own remote state file (you source the init.sh file to initialize a new environment, after that it just works if you run tf commands from that directory).  “Base” has a few resources that don’t correspond to any environment — s3 buckets, certain IAM roles and policies, the root route53 zone.

Here’s an ascii representation of the new filesystem layout: terraform-layout-current

All env subdirectories have a symlink to ../variables.tf and ../initialize.tf.  Variables.tf declares has the default variables that are shared by all environments — things like

$var.region, $var.domain, $var.tf_s3_bucket

Initialize.tf contains empty variable declarations for the variables that will be populated in each env’s .tfvars file, things like like

$var.cidr, $var.public_subnets, $var.env, $var.subdomain_int_name

Other than that, each environment just invokes the same set of modules the same way they did before.

The thing that makes all this possible?  Is this little sweetheart, terraform_remote_state:

resource "terraform_remote_state" "master_state" {
  backend = "s3"
  config {
    bucket = "${var.tf_s3_bucket}"
    region = "${var.region}"
    key = "${var.master_state_file}"
  }
}

It was not at all apparent to me from the docs that you could not only store your remote state, but also query values from it.  So I can set up my root DNS zones in the base environment, and then ask for the zone identifiers in every other module after that.

module "subdomain_dns" {
  source = "../modules/subdomain_dns"
  root_public_zone_id = "${terraform_remote_state.master_state.output.route53_public_zone}"
  root_private_zone_id = "${terraform_remote_state.master_state.output.route53_internal_zone}"
  subdomain_int = "${var.subdomain_int_name}"
  subdomain_ext = "${var.subdomain_ext_name}"
}

How. Fucking. Magic. is that.

SERIOUSLY.  This feels so much cleaner and better.  I removed hundreds of lines of code in the refactor.

(I hear Atlas doesn’t support symlinks, which is unfortunate, because I am already in love with this model.  If I couldn’t use symlinks, I would probably use a Makefile that copied the file into each env subdir and constructed the tf commands.)

Rolling it out

Switching from single statefile to multiple state files was by far the trickiest part of the refactor.  First, I started by building a new dev environment from scratch just to prove that it would work.

Second, I did a “terraform destroy -target=module.staging”, then recreated it from the env_staging directory by running “./init.sh ; terraform plan -var-file=./staging.tfvars”.  Super easy, worked on the first try.

For production and base though, I decided to try doing a live migration from the shared state file to separate state files without any production impact.  This was mostly for my own entertainment and to prove that it could be done.  And it WAS doable, and I did it, but it was preeeettty delicate work and took about as long as literally everything else combined. (~5 hours?).  Soooo much careful massaging of tfstate files.

(Incidentally, if you ever have a syntax error in a 35k line JSON file and you want to find what line it’s on, I highly recommend http://jsonlint.com.  Or perhaps just reconsider your life choices.)

Stuff I still don’t like

There’s too much copypasta between environments, even with modules.  Honestly, if I could pass around maps and interpolate

$var.env

into every resource name, I could get rid of _so_ much code.  But Paul Hinze says that’s a bad idea that would make the graph less predictable and he’s smarter than me so I believe him.

TODO

There is lots left to do, around safety and tooling and sanity checks and helping people not accidentally clobber things.  I haven’t bothered making it safe for multiple engineers  because right now it’s just me.

This is already super long so I’m gonna wrap it up.  I still have a grab bag of tf tips and tricks and “things that took me hours/days to figure out that lots of people don’t seem to know either”, so I’ll probably try and dump that out too before I’ve moved on to the next thing.

Hope this is helpful for some of you.  Love to hear your feedback, or any other creative ways that y’all have gotten around the single-statefile hazard!

P.S. Me after the refactor was done =>

 

 

Terraform, VPC, and why you want a tfstate file per env

Two weeks with Terraform

I’ve been using terraform regularly for 2-3 weeks now.  I have terraformed in rage, I have terraformed in delight.  I thought it might be helpful to share some of my notes and lessons learned.

Why Terraform?

Because I am fucking sick and tired of not having versioned infrastructure.  Jesus christ, the ways my teams have bent over backwards to fake infra versioning after the fact (nagios checks running ec2 diffs, anyone?).

Because I am starting from scratch on a green field project, so I have the luxury of experimenting without screwing over existing customers.  Because I generally respect Hashicorp and think they’re on the right path more often than not.

If you want versioned infra, you basically get to choose between 1) AWS CloudFormation and its wrappers (sparkleformation, troposphere), 2) chef-provisioner, and 3) Terraform.

The orchestration space is very green, but I think Terraform is the standout option.  (More about why later.)  There is precious little evidence that TF was developed by or for anyone with experience running production systems at scale, but it’s … definitely not as actively hostile as CloudFormation, so it’s got that going for it.

First impressions

Stage one: my terraform experiment started out great.  I read a bunch of stuff and quickly spun up a VPC with public/private subnets, NAT, routes, IAM roles etc in < 2 days.  This would be nontrivial to do in two days *without* learning a new tool, so TOTAL JOY.

Stage two: spinning up services.  This is where I started being like … “huh.  Has anyone ever actually used this thing?  For a real thing?  In production?”  Many of the patterns that seemed obvious and correct to me about how to build robust AWS services were completely absent, like any concept of a subnet tier spanning availability zones.  I did some inexcusably horrible things with variables to get the behavior I wanted.

Stage three: … modules.  Yo, all I wanted to do was refactor a perfectly good working config into modules for VPC, security groups, IAM roles/policies/users/groups/profiles, S3 buckets/configs/policies, autoscaling groups, policies, etc, and my entire fucking world just took a dump for a week.  SURE, I was a TF noob making noob mistakes, but I could not believe how hard it was to debug literally anything..

This is when I started tweeting sad things.

The best (only) way of debugging terraform was just reading really, really carefully, copy-pasting back and forth between multiple files for hours to get all the variables/outputs/interpolation correct.  Many of the error messages lack any context or line numbers to help you track down the problem.  Take this prime specimen:

Error downloading modules: module aws_vpc: Error loading .terraform
/modules/77a846c64ead69ab51558f8c5be2cc44/main.tf: Error reading 
config for aws_route_table[private]: parse error: syntax error

Any guesses?  Turned out to be a stray ‘}’ on line 105 in a different file, which HCL vim syntax highlighting thought was A-OK.  That one took me a couple hours to track down.

Or this:

aws_security_group.zookeeper_sg: cannot parse '' as int: 
strconv.ParseInt: parsing "": invalid syntax

Which *obviously* means you didn’t explicitly define some inherited port as an int, so there’s a string somewhere there lurking in your tf tree.  (*Obviously* in retrospect, I mean, after quite a long time poking haplessly about.)

Later on I developed more sophisticated patterns for debugging terraform.  Like, uhhh, bisecting my diffs by commenting out half of the lines I had just added, then gradually re-adding or re-commenting out more lines until the error went away.

Security groups are the worst for this.  SO MANY TIMES I had security group diffs run cleanly with “tf apply”, but then claim to be modifying themselves over and over.  Sometimes I would track this down to having passed in a variable for a port number or range, e.g. “cidr_blocks = [“${var.ip_range}”]”.  Hard-coding it to “cidr_blocks [“10.0.0.0/8″]” or setting the type explicitly would resolve the problem.  Or if I accidentally entered a CIDR range that AWS didn’t like, like 10.0.20.0/20 instead of 10.0.16.0/20.  The change would apply and usually it would work, it just didn’t think it had worked, or something.  TF wasn’t aware there was a problem with the run so it would just keep “successfully” reapplying the diff every time it ran.

Some advice for TF noobs

  • As @phinze told me, “modules are basically like functions — a variable is an argument, output is a return value”.  This was helpful, because that was completely unintuitive to me when I started refactoring.  It took a few days of wrestling with profoundly inscrutable error messages before modules really clicked for me.
  • Strings.  Lists.  You can only pass variables around as strings.  Split() and join() are your friends.  Oh my god I would sell so many innocent children for the ability to pass maps back and forth between modules.
  • No interpolation for resource names makes me so sad.  Basically you can either use local variable maps, or multiple lists and just … run those index counters like a boss I guess..
  • Use AWS termination protection for stateful services or anything risky once you’re in production.  Use create_before_destroy on resources like ASG launch configs.  Use “don’t destroy” where you must — but as sparingly as possible, because that basically breaks the entire TF model.
  • If you change the launch config for an ASG, like replacing the AMI for example, you might expect TF to kick off an instance recycle.  It will not.  You must manually terminate the instances to pick up the new config.
  • If you’re collaborating with a team — ok, even if you’re not — find a remote place to store the tfstate files.  Try S3 or github, or shell out for Atlas.  Local state on laptops is for losers.
  • TF_LOG=DEBUG has never once been helpful to me.  I can only assume it was written for the Hashicorp developers, not for those of us using the product.

Errors returned by AWS are completely opaque.  Like “You were not allowed to apply this update”.  Huh?  Ok well if it fails on “tf plan”, it’s probably a bad terraform config.  If it successfully plans but fails on “tf apply”, your AWS logic is probably at fault.

Terraform does not do a great job of surfacing AWS errors.

For example, here is some terraform output:

tf output: "* aws_route_table.private: InvalidNatGatewayID.NotFound
: The natGateway ID 'nat-0e5f4ea507113b423' does not exist"

Oh!~  Okay, I go to the AWS console and track down that NAT gateway object and find this:

"Elastic IP address [eipalloc-8583b7e1] is already associated"

Hey, that seems useful!  Seems like TF just timed out bringing up one of the route tables, so it tried assigning the same EIP twice.  It would be nice to surface more of this detail into the terraform output, I hate having to resort to a web console.

Last but not least: one time I changed the comment string on a security group, and “tf plan” went into an infinite dependency loop.  I had to roll back the change, run terraform destroy against all resources in a bash for loop, and create an new security group with all new instances/ASGs just to change the comment string.  You cannot change comment strings or descriptions for resources without the resources being destroyed.  This seems PROFOUNDLY weird to me.

Wrapper scripts

Lots of people seem to eventually end up wrapping terraform with a script.  Why?

  • There is no concept of a $TF_ROOT.  If you run tf from the wrong directory, it will do some seriously confusing and screwed up shit (like duping your config, but only some of it).
  • If you’re running in production, you prob do not want people to be able to accidentally “terraform destroy” the world with the wrong environment
  • You want to enforce test/staging environments, and promotion of changes to production after they are proven good
  • You want to automatically re-run “tf plan” after “tf apply” and make sure your resources have converged cleanly.
  • So you can add slack hooks, or hipchat hooks, or github hooks.
  • Ummm, have I mentioned that TF can feel somewhat undebuggable?  Several people have told me they create rake tasks or YML templates that they then generate .tf files from so they can debug those when things break.  (Erf …)

Okay, so …..

God, it feels I’ve barely gotten started but I should probably wrap it up.[*]  Like I said, I think terraform is best in class for infra orchestration.  And orchestration is a thing that I desperately want.  Orchestration and composability are the future of infrastructure.

But also terraform is green as fuck and I would not recommend it to anyone who needs a 4-nines platform.

Simply put, there is a lot of shit I don’t want terraform touching.  I want terraform doing as little as possible.  I have already put a bunch of things into terraform that I plan on taking right back out again.  Like, you should never be running a userdata.sh script after TF has bootstrapped a node.  Yuck.. That is a job for your cfg management, or possibly a job for packer or a custom firstboot script, but never your orchestration tool!  I have already stuffed a bunch of Route53 DNS into TF and I will be ripping that right back out soon.  Terraform should not be managing any kind of dynamic data.  Or service registry, or configs, or ….

Terraform is fantastic for defining the bones of your infrastructure.  Your networking, your NAT, autoscaling groups, the bits that are robust and rarely change.  Or spinning up replicas of production on every changeset via Travis-CI or Jenkins — yay!  Do that!

But I would not feel safe making TF changes to production every day.  And you should delegate any kind of reactive scaling to ASGs or containers+scheduler or whatever.  I would never want terraform to interfere with those decisions on some arbitrary future run.

Which is why it is important to note that terraform does not play nicely with others.  It wants to own the whole thing.  Monkeypatching TF onto an existing infra is kind of horrendous.  It would be nice if you could tag certain resources or products as “this is managed by some other system, thx”.

So: why terraform?

Well, it is fairly opinionated.  It’s actively developed by some really smart people.  It’s moving fast and has most of the momentum in the space.  It’s composable and interacts well with other players iff you make some good life choices.  (Packer, for example, is amazing, by far the most unixy utility of the Hashicorp library.)

Just look at the rate of bug fixes and releases for Terraform vs CloudFormation.  Set aside crossplatform compatibility etc, and just look at the energy of the respective communities.  Not even a fair fight.

Want more?  Ok, well I would rather adopt one opinionated philosophy for my infrastructure, supplementing where necessary, than duct tape together fifty different half baked philosophies about how software and infrastructure should work and spend all my time mediating their conflicts.  (This is one of my beefs with CloudFormation: AWS has no opinions, only slobbering, squidlike, directionless-flopping optionalities.  And while we’re on the topic it also has nothing like “tf plan” for previewing changes, so THAT’S PRETTY STUPID TOO.)

I do have some concerns about Hashicorp spreading themselves too thin on too many products.  Some of those products probably shouldn’t exist.  Meh.

Terraform has a ways to go before it feels predictable and debuggable, but I think it’s heading in the right direction.  It’s been a fun couple weeks and I’m excited to start contributing to the ecosystem and integrating with other components, like chef-solo & consul.

 

[*] OMGGGGGGG, I never even got to the glorious horrors of the terraforming gem and how you are most definitely going to end up manually editing your *.tfstate files.  Ahahahahaa.

[**] Major thanks to @phinze, @solarce, @ascendantlogic, @lusis, @progrium and others who helped me limp through my first few weeks.

Two weeks with Terraform