Scrapbag of useful Terraform tips

After some kinda ranty posts about terraform and AWS and networking and orchestration life in general, it feels like a good time to braindump some helpful tidbits.

 

jokersunshine

Side note — a few people have asked me to open source my terraform config.  I’m actually super open to sharing it, but a) it’s still changing a lot and b) tf modules aren’t really reusable yet.  They just aren’t.  Eventually we’ll reach a maturity point where a tf module library makes sense, but open source tf modules haven’t been super helpful for me and I don’t expect mine will be any better.

I *did* embed a bunch of meaty gists in this post with some of the more interesting configs.  Let me know if you want to see more, I will happily send it to you, I just don’t want to be maintaining an OS repo right now.

So here are a few things that took me minutes, hours, or days to figure out, that hopefully will now take you less time.

ICMP for security groups

If you want your hosts to be pingable, you have to put this stanza in your security group.  The “from_port = 8” isn’t in the security group docs; I found it in this github issue.  Not being a networking person myself, I literally never would have guessed it.  If you want to read up more, here’s more about why.

  ingress {
    from_port = 8 
    to_port = 0 
    protocol = "icmp"
    cidr_blocks = ["0.0.0.0/0"]
  }

Here is a gist for my bastion security groups.  Note that all the security groups are in the aws_vpc module, which gets invoked separately by each environment.

Security groups are stackable in VPC, which is glorious.  But most of the time I thought I was having a problem with VPCs or routes or networking, it turned out to be a security group problem, or an interaction between the two.

Since you have no ability to debug AWS networking via normal linux utilities, my best debugging tip for VPC networking is still, when you get really stuck open up all the SG ports and see if that fixes it.  (Preferably in, you know, a staging environment, not prod …)

Resource description fields != comment fields

wonkacommentsDo not use your resource description fields as comments about those resources.

It feels like they should be comment strings, doesn’t it?  Well, they aren’t.  If you change your “comment” terraform will try to destroy and recreate the resource (which may or may not even work, if it’s like a security group that all your environments and other resources happen to inherit.  Hypothetically speaking.)

This isn’t a Hashicorp thing, it’s an AWS thing.  You can’t go edit the description in the console, either — try it!  It’s like a smelly, lingering remnant of the bad old days before we had tags.

So use tags, or use comments in your code.  Don’t use descriptions for documentation

Picking VPC ranges

The max # of hosts you can have in any VPC is a /16.  Probably don’t start your numbering with 10.0.0.0/16, just in case you ever want to peer with anyone else, who almost certainly started with 10.0.0.0/16 too.

Route tables

Only one route table can be associated with each subnet.  (Again, NONE of your routes will show up in netstat -nr or any of the normal Linux tools, which is fucking infuriating.)

I recommend not using aws_route_table with an inline blob of routes, but instead using aws_route resources.  These are additive resources, so it gives you more fine-grained control if you want different environments to have different routing tables.

Peering VPCs

Peering is so fucking rad.  I’m so, so happy with it.  Peering makes VPC-per-env tractable and flexible and not horribly annoying.

In order to peer VPCs, if you have a separate state file per environment (which you really should), you will need to import remote state.  It’s not very obvious from the documentation, but this is an incredibly powerful feature.  It lets you refer to variables from remote state files just like they were modules.

I use S3 for saving state, with versioning turned on for the bucket.

I have a locked-down dev VPC which is automatically peered with all other VPCs and peering-oprahallowed to ssh into them, but can’t connect to any other ports in those VPCs.  (Using security groups, but also network ACLs for an extra sanity check.)  And none of the other VPCs are peered with each other, so none of the test or staging or prod environments can accidentally connect to each other.

I ran into a few things while setting up peering.  (Relevant context: I have both 4 public subnets and 4 private NAT subnets for each VPC, one subnet per availability zone.)

  • First, like I said, I had to refactor my aws_route_table into a bunch of aws_route resources, because I didn’t want the route tables to look the same for every environment (staging shouldn’t be able to talk to prod but dev should, etc)
  • If you own both VPCs, you can set up auto-accept, which is super rad.  If not, someone has to go to the console and click ok somewhere.
  • You need to include your “owner id” in the peering config, which confused me for a bit but you just have to log in as the root account and look under billing somewhere.  (I don’t remember where, google it.)
  • Second, peering has to be set up in both directions before connections will actually work.  I naively assumed that if it was set up and auto-accepted from VPC-A to VPC-B, connections from VPC-A to VPC-B would work.  Nope!  you also have to establish the peering from VPC-B to VPC-A before either direction will work.
  • All public subnets share a single route table, but each private subnet has its own (necessary for NAT).  So I had to set up peering from every single one of the private subnets that I wanted to be able to connect out from.

Here’s the gist to the networking portion of my aws_vpc module.  (The rest of the module is mostly just security groups.)

And some sample peering configs (you need one for each VPC, like I mentioned, so it’s bidirectional for each pair).  Here’s a gist snippet from the dev side, and the paired snippet from the staging side.

(You can tell how confident I was in these changes by how I named the resources, and added blamey “Author” tags for a coworker who hadn’t actually started working with me yet.  I don’t think he’s noticed yet, lol.)

NAT gateways, IGWs

You probably already set up an IGW resource for your public subnets to talk to the internet.  Just add it to every public subnet, easy breezy:


resource "aws_internet_gateway" "mod" {
  vpc_id = "${aws_vpc.mod.id}"
  tags { 
    Name = "${var.env}_igw"
  }
}

# add a public gateway to each public route table
resource "aws_route" "public_gateway_route" {
  route_table_id = "${aws_route_table.public.id}"
  depends_on = ["aws_route_table.public"]
  destination_cidr_block = "0.0.0.0/0"
  gateway_id = "${aws_internet_gateway.mod.id}"
}

Lots of people seem to still be setting up custom Linux boxes to NAT traffic out from private subnets to the internet.  (A prominent internet service provider had an outage a couple weeks ago because they were doing this.)default-gw

Use NAT gateways instead, if you can.  They are basically just like ELBs but for natting out to the internet.  They scale out according to throughput in roughly the same way, up to 10 Gbps bursts.

BUT MIND THE FUCKING TRAP.  You do not attach these NAT gateways to your PRIVATE subnets, you attach them to the PUBLIC FUCKING SUBNETS, and then a route to from the private subnet to that gateway.  Gahhhhhh.

resource "aws_eip" "nat_eip" {
  count    = "${length(split(",", var.public_ranges))}"
  vpc = true
}

resource "aws_nat_gateway" "nat_gw" {
  count = "${length(split(",", var.public_ranges))}"
  allocation_id = "${element(aws_eip.nat_eip.*.id, count.index)}"
  subnet_id = "${element(aws_subnet.public.*.id, count.index)}"
  depends_on = ["aws_internet_gateway.mod"]
}

# for each of the private ranges, create a "private" route table.
resource "aws_route_table" "private" {
  vpc_id = "${aws_vpc.mod.id}"
  count = "${length(compact(split(",", var.private_ranges)))}"
  tags { 
    Name = "${var.env}_private_subnet_route_table_${count.index}"
  }
}
# add a nat gateway to each private subnet's route table
resource "aws_route" "private_nat_gateway_route" {
  count = "${length(compact(split(",", var.private_ranges)))}"
  route_table_id = "${element(aws_route_table.private.*.id, count.index)}"
  destination_cidr_block = "0.0.0.0/0"
  depends_on = ["aws_route_table.private"]
  nat_gateway_id = "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
}

(Thank you @ebroder, I would probably NEVER have figured this out on my own.  AWS docs are completely unintelligible on the subject.)

A note on ELB SGs

Oh … you probably know this, but your ELBs should be in a separate / more permissive SG than the instances backing those ELBs.  You don’t want people to be able to connect directly to e.g. port 80 or 8080 on an application host, bypassing the ELB.

ELB certificates

If you live in us-east, use the new AWS certificate manager.  It’s free and you’ll never have to worry about cert expirations ever ever again.

If you don’t — or if you didn’t notice the announcement LITERALLY A FEW DAYS BEFORE you purchased your own Digicert wildcard cert (wahhhh) — you should just add the cert to your ELB in the console and refer to the ARN in your tf configs, because otherwise your private key will be in the state file.

Ok that’s it

Yesterday I spun up another whole new VPC clone by adding about 5 lines and copying a couple files + sed -e’ing the name of the environment.  Took about two minutes, felt like a fucking badass.  ^_^

I will now proceed to forget as much as possible about all the things I have learned about networking over the past two months.

fuck-networking

Scrapbag of useful Terraform tips

22 thoughts on “Scrapbag of useful Terraform tips

  1. jgoldschrafe says:

    Great post!

    One other useful thing to know about VPC NAT gateways is that compared to running your own NAT infrastructure, they’re potentially really expensive at $0.045 per GB of data processed. If you do a ton of traffic out to/from the Internet, you should look really carefully at your traffic rates before deciding if one route or the other is better for your specific use case.

  2. Awesome article! Thanks Charity, this is the only example I’ve found that uses a nat gateway rather than a nat instance which seemed too much for the project I’m planning out. Now that Terraform has elastic beanstalk support, I’ve just got to find a way to tie my RDS instance to it. I’m guessing through this example you posted I “hope” I can piece something together.

    1. Yeah, I was lucky enough to be writing my terraform config not long after the nat gateway support was released. 🙂 It is sooooo much better!

      You can always use the ARN of your existing RDS instance to refer to it, I think. Or *carefully* backport your existing RDS into your tf config as an aws rds resource, using the terraforming gem maybe.

      1. Thanks for the advice on using the ARN, didn’t even know about that. It’s been awhile since I used AWS so been playing catchup. Although I do see the ARN is an available attribute, there seems to be no way to “tie” it to the EB instance via the API. However, upon more research I learned that you should spin up RDS independently from EB and just use the credentials in your application to communicate with RDS. If you do tie EB with RDS and EB is shutdown, your RDS instance would go bye bye too. Note worthy for anyone else trying to figure this stuff out. Thanks again.

  3. EB could be Elastic Beanstalk. In EB they let you start an RDS as part of that stack, but if that stack dies from accidently clicking “Terminate”, well your DB is gone too. What people can do instead is just build an RDS instance and then pass the user/url/dbname/pass to EB – somehow.

  4. bolaji says:

    Hi ;
    i can see that you are able to use variable interpolation to set tags in your blog but i seem to get issues doing it. All it does it tag to literal strings. Please assist .. thanks

    resource “aws_internet_gateway” “mod” {
    vpc_id = “${aws_vpc.mod.id}”
    tags {
    Name = “${var.env}_igw” ( i am trying to do something similar to this)
    }
    }

  5. Thank you for sharing your (seemingly painful?) learnings, Charity.

    I’m in a multi-account setup (one for prod, one for everything else), and struggling with how to get Terraform ran by CI (without Atlas, anyway) to do the actual changes. Have you gotten to this point yet, or are you folks still running the actual ‘apply’ on your workstation(s)?

    1. We run it from dev boxes on EC2. The routing and peering make it possible for dev boxes to ssh out to other hosts, but not vice versa, and not between any two other environments. No, we do not run terraform apply directly from CI — what a terrifying idea that is. 🙂

    2. Or are you talking about just for testing changes? Yeah, we use the staging environment for testing changes rather than running it from CI. What is so hard about that though?

      1. Jason Harley says:

        Oh, not hard at all! I’m just trying to get an understanding of people’s workflows and trying to understand best practises. I’ve talked to a few folks who are using terraform via CI, but it seems like a bit of a “leap of faith” (pardon the term) at the moment and (as you wrote in an other article) ideally you’re in a pretty steady state with a lot of this infrastructure setup.

        Thanks for the speedy reply!

  6. if only I had read your blog a couple of hours earlier… “You do not attach these NAT gateways to your PRIVATE subnets”
    There is a tiny mention in AWS docs at http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-nat-gateway.html “To create a NAT gateway, you must specify the public subnet in which the NAT gateway will reside” – something you’ll only notice once you already know what you are looking for 🙁

  7. I had the same issue with descriptions, except with Ansible. Imagine some nut putting whitespace in an otherwise immutable spot. I wanted to force choke them into oblivion.

  8. alanwtf says:

    Love your terraform posts, but I see you haven’t really posted much since 2016. Would be great to hear what has happened since, do you still love TF, what you think of the current status and plans for next version?

Leave a Reply