Two weeks with Terraform

I’ve been using terraform regularly for 2-3 weeks now.  I have terraformed in rage, I have terraformed in delight.  I thought it might be helpful to share some of my notes and lessons learned.

Why Terraform?

Because I am fucking sick and tired of not having versioned infrastructure.  Jesus christ, the ways my teams have bent over backwards to fake infra versioning after the fact (nagios checks running ec2 diffs, anyone?).

Because I am starting from scratch on a green field project, so I have the luxury of experimenting without screwing over existing customers.  Because I generally respect Hashicorp and think they’re on the right path more often than not.

If you want versioned infra, you basically get to choose between 1) AWS CloudFormation and its wrappers (sparkleformation, troposphere), 2) chef-provisioner, and 3) Terraform.

The orchestration space is very green, but I think Terraform is the standout option.  (More about why later.)  There is precious little evidence that TF was developed by or for anyone with experience running production systems at scale, but it’s … definitely not as actively hostile as CloudFormation, so it’s got that going for it.

First impressions

Stage one: my terraform experiment started out great.  I read a bunch of stuff and quickly spun up a VPC with public/private subnets, NAT, routes, IAM roles etc in < 2 days.  This would be nontrivial to do in two days *without* learning a new tool, so TOTAL JOY.

Stage two: spinning up services.  This is where I started being like … “huh.  Has anyone ever actually used this thing?  For a real thing?  In production?”  Many of the patterns that seemed obvious and correct to me about how to build robust AWS services were completely absent, like any concept of a subnet tier spanning availability zones.  I did some inexcusably horrible things with variables to get the behavior I wanted.

Stage three: … modules.  Yo, all I wanted to do was refactor a perfectly good working config into modules for VPC, security groups, IAM roles/policies/users/groups/profiles, S3 buckets/configs/policies, autoscaling groups, policies, etc, and my entire fucking world just took a dump for a week.  SURE, I was a TF noob making noob mistakes, but I could not believe how hard it was to debug literally anything..

This is when I started tweeting sad things.

The best (only) way of debugging terraform was just reading really, really carefully, copy-pasting back and forth between multiple files for hours to get all the variables/outputs/interpolation correct.  Many of the error messages lack any context or line numbers to help you track down the problem.  Take this prime specimen:

Error downloading modules: module aws_vpc: Error loading .terraform
/modules/77a846c64ead69ab51558f8c5be2cc44/main.tf: Error reading 
config for aws_route_table[private]: parse error: syntax error

Any guesses?  Turned out to be a stray ‘}’ on line 105 in a different file, which HCL vim syntax highlighting thought was A-OK.  That one took me a couple hours to track down.

Or this:

aws_security_group.zookeeper_sg: cannot parse '' as int: 
strconv.ParseInt: parsing "": invalid syntax

Which *obviously* means you didn’t explicitly define some inherited port as an int, so there’s a string somewhere there lurking in your tf tree.  (*Obviously* in retrospect, I mean, after quite a long time poking haplessly about.)

Later on I developed more sophisticated patterns for debugging terraform.  Like, uhhh, bisecting my diffs by commenting out half of the lines I had just added, then gradually re-adding or re-commenting out more lines until the error went away.

Security groups are the worst for this.  SO MANY TIMES I had security group diffs run cleanly with “tf apply”, but then claim to be modifying themselves over and over.  Sometimes I would track this down to having passed in a variable for a port number or range, e.g. “cidr_blocks = [“${var.ip_range}”]”.  Hard-coding it to “cidr_blocks [“10.0.0.0/8″]” or setting the type explicitly would resolve the problem.  Or if I accidentally entered a CIDR range that AWS didn’t like, like 10.0.20.0/20 instead of 10.0.16.0/20.  The change would apply and usually it would work, it just didn’t think it had worked, or something.  TF wasn’t aware there was a problem with the run so it would just keep “successfully” reapplying the diff every time it ran.

Some advice for TF noobs

  • As @phinze told me, “modules are basically like functions — a variable is an argument, output is a return value”.  This was helpful, because that was completely unintuitive to me when I started refactoring.  It took a few days of wrestling with profoundly inscrutable error messages before modules really clicked for me.
  • Strings.  Lists.  You can only pass variables around as strings.  Split() and join() are your friends.  Oh my god I would sell so many innocent children for the ability to pass maps back and forth between modules.
  • No interpolation for resource names makes me so sad.  Basically you can either use local variable maps, or multiple lists and just … run those index counters like a boss I guess..
  • Use AWS termination protection for stateful services or anything risky once you’re in production.  Use create_before_destroy on resources like ASG launch configs.  Use “don’t destroy” where you must — but as sparingly as possible, because that basically breaks the entire TF model.
  • If you change the launch config for an ASG, like replacing the AMI for example, you might expect TF to kick off an instance recycle.  It will not.  You must manually terminate the instances to pick up the new config.
  • If you’re collaborating with a team — ok, even if you’re not — find a remote place to store the tfstate files.  Try S3 or github, or shell out for Atlas.  Local state on laptops is for losers.
  • TF_LOG=DEBUG has never once been helpful to me.  I can only assume it was written for the Hashicorp developers, not for those of us using the product.

Errors returned by AWS are completely opaque.  Like “You were not allowed to apply this update”.  Huh?  Ok well if it fails on “tf plan”, it’s probably a bad terraform config.  If it successfully plans but fails on “tf apply”, your AWS logic is probably at fault.

Terraform does not do a great job of surfacing AWS errors.

For example, here is some terraform output:

tf output: "* aws_route_table.private: InvalidNatGatewayID.NotFound
: The natGateway ID 'nat-0e5f4ea507113b423' does not exist"

Oh!~  Okay, I go to the AWS console and track down that NAT gateway object and find this:

"Elastic IP address [eipalloc-8583b7e1] is already associated"

Hey, that seems useful!  Seems like TF just timed out bringing up one of the route tables, so it tried assigning the same EIP twice.  It would be nice to surface more of this detail into the terraform output, I hate having to resort to a web console.

Last but not least: one time I changed the comment string on a security group, and “tf plan” went into an infinite dependency loop.  I had to roll back the change, run terraform destroy against all resources in a bash for loop, and create an new security group with all new instances/ASGs just to change the comment string.  You cannot change comment strings or descriptions for resources without the resources being destroyed.  This seems PROFOUNDLY weird to me.

Wrapper scripts

Lots of people seem to eventually end up wrapping terraform with a script.  Why?

  • There is no concept of a $TF_ROOT.  If you run tf from the wrong directory, it will do some seriously confusing and screwed up shit (like duping your config, but only some of it).
  • If you’re running in production, you prob do not want people to be able to accidentally “terraform destroy” the world with the wrong environment
  • You want to enforce test/staging environments, and promotion of changes to production after they are proven good
  • You want to automatically re-run “tf plan” after “tf apply” and make sure your resources have converged cleanly.
  • So you can add slack hooks, or hipchat hooks, or github hooks.
  • Ummm, have I mentioned that TF can feel somewhat undebuggable?  Several people have told me they create rake tasks or YML templates that they then generate .tf files from so they can debug those when things break.  (Erf …)

Okay, so …..

God, it feels I’ve barely gotten started but I should probably wrap it up.[*]  Like I said, I think terraform is best in class for infra orchestration.  And orchestration is a thing that I desperately want.  Orchestration and composability are the future of infrastructure.

But also terraform is green as fuck and I would not recommend it to anyone who needs a 4-nines platform.

Simply put, there is a lot of shit I don’t want terraform touching.  I want terraform doing as little as possible.  I have already put a bunch of things into terraform that I plan on taking right back out again.  Like, you should never be running a userdata.sh script after TF has bootstrapped a node.  Yuck.. That is a job for your cfg management, or possibly a job for packer or a custom firstboot script, but never your orchestration tool!  I have already stuffed a bunch of Route53 DNS into TF and I will be ripping that right back out soon.  Terraform should not be managing any kind of dynamic data.  Or service registry, or configs, or ….

Terraform is fantastic for defining the bones of your infrastructure.  Your networking, your NAT, autoscaling groups, the bits that are robust and rarely change.  Or spinning up replicas of production on every changeset via Travis-CI or Jenkins — yay!  Do that!

But I would not feel safe making TF changes to production every day.  And you should delegate any kind of reactive scaling to ASGs or containers+scheduler or whatever.  I would never want terraform to interfere with those decisions on some arbitrary future run.

Which is why it is important to note that terraform does not play nicely with others.  It wants to own the whole thing.  Monkeypatching TF onto an existing infra is kind of horrendous.  It would be nice if you could tag certain resources or products as “this is managed by some other system, thx”.

So: why terraform?

Well, it is fairly opinionated.  It’s actively developed by some really smart people.  It’s moving fast and has most of the momentum in the space.  It’s composable and interacts well with other players iff you make some good life choices.  (Packer, for example, is amazing, by far the most unixy utility of the Hashicorp library.)

Just look at the rate of bug fixes and releases for Terraform vs CloudFormation.  Set aside crossplatform compatibility etc, and just look at the energy of the respective communities.  Not even a fair fight.

Want more?  Ok, well I would rather adopt one opinionated philosophy for my infrastructure, supplementing where necessary, than duct tape together fifty different half baked philosophies about how software and infrastructure should work and spend all my time mediating their conflicts.  (This is one of my beefs with CloudFormation: AWS has no opinions, only slobbering, squidlike, directionless-flopping optionalities.  And while we’re on the topic it also has nothing like “tf plan” for previewing changes, so THAT’S PRETTY STUPID TOO.)

I do have some concerns about Hashicorp spreading themselves too thin on too many products.  Some of those products probably shouldn’t exist.  Meh.

Terraform has a ways to go before it feels predictable and debuggable, but I think it’s heading in the right direction.  It’s been a fun couple weeks and I’m excited to start contributing to the ecosystem and integrating with other components, like chef-solo & consul.

 

[*] OMGGGGGGG, I never even got to the glorious horrors of the terraforming gem and how you are most definitely going to end up manually editing your *.tfstate files.  Ahahahahaa.

[**] Major thanks to @phinze, @solarce, @ascendantlogic, @lusis, @progrium and others who helped me limp through my first few weeks.

Two weeks with Terraform

39 thoughts on “Two weeks with Terraform

  1. One thing I would add is that there are now resources to manage route and sg entries OUTSIDE of the main security_group and such resources. Start using those today because then you can break out security group rule responsibility to individual terraform “projects” that share a common SG:

    https://www.terraform.io/docs/providers/aws/r/security_group_rule.html (vs. the whole sg as a resource)
    https://www.terraform.io/docs/providers/aws/r/route.html (vs. the whole route table as a resource)

    Awesome post!

  2. Doug says:

    Thanks for writing this up. I haven’t yet played with Terraform, but I’m suspecting I might be using it soon.

    What issues did you run into with managing DNS in Terraform? Rate of change? Externally managed records? What tool do you think is a better fit for managing DNS configuration in your system?

    1. I like to have an external domain (for www, MX etc) which can be managed either by hand or by terraform, and an internal domain which is managed completely by config management software. If you use autoscaling groups, terraform *can’t* manage DNS for those nodes because it has no knowledge of them. So I run chef-zero + a route53 cookbook to register internal DNS on first boot.

  3. pll says:

    Are you ever on #terraform-tool? I hold frequent rant sessions on the very topic of this blogpost. I feel like I could have written this. In fact, I feel like, if this were a movie, this is merely the teaser trailer. You left so many good rant-worthy subjects untouched. I look forward to your many future posts as you uncover more of the thrills of working with terraform. As they say (or maybe it’s just me), “It’s the absolute worst tool for cloud orchestration. Except for all the rest!” 🙂

    Cheers!

    1. no! is this in irc? I do hang out in #hashicorp on the hangops slack, but have basically abandoned irc as all my communities have slowly migrated there. 🙁 Would love some good rantbuddies tho.

      1. pll says:

        Yes, it’s in freenode. Every now and then a hashicorp person will show up, but mostly it’s just us terraformers slugging it out and trying to help each other out. I haven’t gone over to slack yet other than for company internal stuff. I hate the interface, and it doesn’t play nicely with other apps, especially my IRC client of choice, emacs. There are still some pretty strong communities on irc though, #terraform-tool, #puppet, #lopsa (almost exclusively sysadmins, most of us old gray-beards 🙂

        Drop on by sometime, I promise we can help fill your rant quota for the day 🙂

  4. And then the whole provisioner stuff. If you abort a run before the provisioning starts, the machines will not be recorded in the state file. Good luck removing them with destroy. If the provisioner fails, you know, that missing sudo? terraform destroy will completely ignore the fact that the machines are running.

    The security groups: why not create security groups and only after create the rules? Then this would actually work:

    resource “aws_security_group” “master” {
    name = “master”
    description = “Master”
    ingress {
    from_port = 5050
    to_port = 5050
    protocol = “tcp”
    security_groups = [“${aws_security_groups.agent.name}”]
    }
    }

    resource “aws_security_group” “agent” {
    name = “agent”
    description = “Agent”
    ingress {
    from_port = 5051
    to_port = 5051
    protocol = “tcp”
    security_groups = [“${aws_security_groups.master.name}”]
    }
    }

    Instead of having to fiddle with minimum 3 groups to make this work.
    Oh, and not being able to use ${path.module} in a variable…
    My few cents. Heh.

    1. lol yea, i’m staying so far away from the provisioner crap. division of labor && firstboot.sh script built into the AMI with packer, pulls down my chef tarball and runs it to do all that work.

      1. pll says:

        /me smacks head!

        Of course! Why on earth am I passing crap in via user_data and remote-execs ? I should just have the that firstboot.sh script pull down my puppet tarball and be done with it! I’m off to fix that right now. Well, maybe not *right* now, but first thing in the morning! 🙂

        Thanks!

  5. dchoquette123 says:

    Great post! if you are interested in composable infrastructure automation you might want to check out http://rebar.digital/. For the reasons you state, we are actively architecting how Digital Rebar TF. Nothing wrong with opinionated config. The problem is that the approach does not work for everyone.

  6. RR says:

    Great post. One thing that astonished me about Teraform, though I haven’t used it yet, is the seeming lack of support for any conditional statements (if/else). Even Cloudformation has that, so it was unbelievable that not only was TF missing it but that no one was talking about it. How do you have one template for dev and prod? (In CFN, you just say if the parameter env is prod, it’s a medium Instance, if the environment is dev, small instance)

    1. pll says:

      @RR – The lack of conditionals is intentional. It’s a DSL, not a programming language. This is the same issue ansible suffers from as well.

      As far as using the same template for dev and prod, those are just different values for environment variable settings. One set for dev, one for prod. Tell terraform which set of variables to use, and it’s perfectly happy to do so.

      The lack of conditionals, though, is a pain. There are work-arounds using boolean logic, but they’re total hacks and make your brain twist in unnatural ways.

  7. rnhurt says:

    I know I’m a bit late to the party but I’ve built a CloudFormation tool that seems to work really well (at least for my purposes) called CFNBuilder (https://github.com/rnhurt/CFNBuilder). It still under some pretty heavy-duty construction it but it does the job and I have used it in production. 🙂

  8. Márton Boros says:

    I struggled with terraform’s limited language, so here’s an example of how to replace it with a proper (purely functional) language that was in fact designed to assemble complex sytems.
    https://github.com/brainrape/nixform
    I haven’t gotten around to adding type checking yet, but It should be pretty easy since nix has support for typesafe and validated config options.

  9. martonboros says:

    I was frustrated by terraform’s limited language too, so here’s an example of how to replace it with a proper (purely functional) language that was actually built to assemble complex systems.
    https://github.com/brainrape/nixform
    I haven’t gotten around to adding type checking yet, but it’s pretty easy using nix’s typesafe and validated config option support.

  10. Cole Bittel says:

    Solid article. Great writing.
    Much appreciated for your expertise and ability to speak like a normal human (and make us laugh!) about tech topics.

  11. Hey, Great post! You have left out at least one of the very capable contenders in the versioned infrastructure space, Cloudify would be another worth looking at.

  12. Great article.
    like a few others, replying I’ve had some similar pain. I’m surprised you didn’t mention the joy that is manually hacking the tfstate file if, for whatever reason, you’ve had to change an EBS volume on an EC2 instance.

      1. pll says:

        Yes, dramatically. It also helps to remember that Terraform is a DSL and NOT a programming language. It still needs a wrapper for many programmatic things IMO, but it has gotten significantly better in the past 2+ years I’ve been using it. They also have quite an active developer community who are also incredibly responsive!

  13. Those errors you mentioned are straight up poetry. WOW! I thought i had it bad.

    Quick question. If I source a github terraform module. One I don’t have control over. Do I depend on outputs the AUTHOR included in their source? Or can I add new outputs as long as I reference their names properly?

    Because I keep getting this stupid

    Error: output ‘i_hate_terraform’: “aws_subnet.private.id” is not a valid output for module “bastion_vpc_template”

    And when I reference an output that *was* defined I still get crap:

    Error: output ‘i_hate_terraform’: “aws_vpc.id” is not a valid output for module “bastion_vpc_template”

    Please lord god, save my soul!

  14. charles says:

    As it turns out, I found a use for TF_LOG=DEBUG . After spending a few hours reading

    Error: Error finding the specified Linode DomainRecord: [002] unexpected end of JSON input

    over and over, setting TF_LOG revealed the reason — linode API was returning a 429 response to too much parallelism.

  15. Michele says:

    Guys, please, let me remind you this:

    The Abstraction Principle is one of the most fundamental principles in engineering. It states: The interface of a component should be independent of its implementation.

    http://www.cs.sjsu.edu/~pearce/modules/lectures/ood/principles/Abstraction.htm

    Terraform is just a domain language to more-or-less easily script various random services all together. It gives a lot of sense of control, yes, but how the hell can you have benefits for Web architectures, when there’s Openshift out there?
    If you need to script and automate strange architectures so, yes it looks eventually useful, but besides support for many services, what is the difference between terraform and ansible? what do you get from it that you can rely on? it’s just a library of services

Leave a Reply