Terraform, VPC, and why you want a tfstate file per env

Hey kids!  If you’ve been following along at home, you may have seen my earlier posts on getting started with terraform and figuring out what AWS network topology to use.  You can think of this one as like if those two posts got drunk and hooked up and had a bastard hell child.

Some context: our terraform config had been pretty stable for a few weeks.  After I got it set up, I hardly ever needed to touch it.  This was an explicit goal of mine.  (I have strong feelings about delegation of authority and not using your orchestration layer for configuration, but that’s for another day.)

And then one day I decided to test drive Aurora in staging, and everything exploded.

Trigger warning: rants and scary stories about computers ahead.  The first half of this is basically a post mortem, plus some tips I learned about debugging terraform outages.  The second half is about why you should care about multiple state files, how to set up and manage multiple state files, and the migration process.

First, the nightmare

This started when I made a simple tf module for Aurora, spun it up in staging, assigned a CNAME, turned on multi-AZ support for RDS, was really just fussing around with minor crap in staging, like you do.  So I can’t even positively identify which change it was that triggered it, but around 11 pm Thursday night the *entire fucking world* blew up.  Any terraform command I tried to run would just crash and dump a giant crash.log.

Terraform crashed! This is always indicative of a bug within Terraform.

So I start debugging, right?  I start backing out changes bit by bit.  I removed the Aurora module completely.  I backed out several hours worth of changes.  It became clear that the tfstate file must be “poisoned” in some way, so  I disabled remote state storage and starting performing surgery on the local tfstate file, extracting all the Aurora references and anything else I could think of, just trying to get tf plan to run without crashing.

What was especially crazy-making was the fact that I could apply *any* of the modules or resources to *any* of the environments independently.  For example:

 $ for n in modules/* ; do terraform plan -target=module.staging_$n ; done

… would totally work!  But “terraform plan” in the top level directory took a giant dump.

I stabbed at this a bunch of different ways.  I scrolled through a bunch of 100k line crash logs and grepped around for things like “ERROR”.  I straced the terraform runs, I deconstructed tens of thousands of lines of tfstates.  I spent far too much time investigating the resources that were reporting “errors” at the end of the run, which — spoiler alert — are a total red herring.  Stuff like this —

 15 module.staging_vpc.aws_eip.nat_eip.0: Refreshing state... (ID: eipalloc-bd6987da)
 16 Error refreshing state: 34 error(s) occurred:
 17 
 18 * aws_s3_bucket.hound-terraform-state: unexpected EOF
 19 * aws_rds_cluster_instance.aurora_instances.0: unexpected EOF
 20 * aws_s3_bucket.hound-deploy-artifacts: unexpected EOF
 21 * aws_route53_record.aurora_rds: connection is shut down

It’s all totally irrelevant, disregard.

It’s like 5 am by this point which is why I feel only slightly less fucking retarded about the fact that @stack72 had to gently point out that all you have to do is find the “panic” in the crash log, because terraform is written in Go so OF COURSE THERE IS A PANIC buried somewhere in all that dependency graph spew.  God, I felt like such an idiot.

The painful recovery

I spent several hours trying to figure out how to recover gracefully by isolating and removing whatever was poisoned in my state.  Unfortunately, some of the techniques I used to try and systematically validate individual components or try to narrow down the scope of the problem ended up making things much, much worse.

For example: applying individual modules (“terraform apply -target={module}”) can be extremely problematic.  I haven’t been able to get an exact repro yet, but it seems to play out something like this: if you’re applying a module that depends on other resources that get dynamically generated and passed into it, but you aren’t applying the modules that do that work in the same run, terraform will sometimes just create them again.

Like, a whole duplicate set of all your DNS zones with the same domain names but different zone ids, or a duplicate set of VPCs with the same VPC names and subnets, routes, etc, and you only find out if it gets to a point where it tries to create one of those rare resources where AWS actually enforces unique names, like autoscaling groups.

And yes, I do run tf plan.  Religiously.  But when you’re already in a bad state and you’re trying to do unhealthy things with mutated or corrupt tfstate files … shit happens.  And thus, by the time this was all over:

I ended up deleting my entire infrastructure by hand, three times.

I’m literally talking about clicking select-all-delete in the console on fifty different pages, writing grungy shell scripts to cycle through and delete every single AWS resource, purging local cache files, tracking down resources that exist but don’t show up on the console via the CLI, etc.

Of course every time I purged and started from scratch, it had to create a new zone ID for the root domain, so I had to go back and update my registrar with the new nameservers because each AWS zone ID is associated with a different set of resolvers.  Meanwhile our email, website, API etc were all unresolvable.

If we were in production, this would have been one of the worst outages of my career, and that’s … saying a lot.

So … Hi!  I learned a bunch of things from this.  And I am *SO GLAD* I got to learn them before we have any customers!

The lessons learned

(in order of least important to most important)

1. Beware of accidental duplicates.

It actually really pisses me off how easily you can just nonchalantly and unwaryingly create a duplicate VPC or Route53 zone with the same name, same delegation, same subnets, etc.  Like why would anyone ever WANT that behavior?  I blame AWS for this, not Hashicorp, but jesus christ.

So that’s going on my “Wrapper Script TODO” list: literally check AWS for any existing VPC or Route53 zone with the same name, and bail on dupes in the plan phase.

(And by the way — this outage was hardly the first time I’ve run into this.  I’ve had tf  unexpectedly spin up duplicate VPCs many, many, many times.  It is *not* because I am running apply from the wrong directory, I’ve checked for any telltale stray .terraforms.  I usually don’t even notice for a while so it’s hard to figure out what exactly I did to cause it, but definitely seems related to applying modules or subsets of a full run.  Anyway, I am literally setting up a monitoring check for duplicate VPC names which is ridiculous but whatever, maybe it will help me track this down.)

(Also, I really wish there was a $TF_ROOT.)

2. Tag your TF-owned resources, and have a kill script.

This is one of many great tips from @knuckolls that I am totally stealing.  Every taggable resource that’s managed by terraform, give it a tag like “Terraform: true”, and write some stupid boto script that will just run in a loop until it’s successfully terminates everything with that tag + maybe an env tag.  (You probably want to explicitly exclude some resources, like your root DNS zone id, S3 buckets, data backups.)

But if you get into a state where you *can’t* run terraform destroy, but you *could* spin up from a clean slate, you’re gonna want this prepped.  And I have now been in that situation at least four or five times not counting this one.  Next time it happens, I’ll be ready.

Which brings me to my last and most important point — actually the whole reason I’m even writing this blog post.  (she says sheepishly, 1000 words later.)

3. Use separate state files for each environment.

Sorry, let me try  this again in a font that better reflects my feelings on the subject:

USE SEPARATE STATE FILES.

FOR EVERY ENVIRONMENT.

This is about limiting your blast radius.  Remember: this whole thing started when I made some simple changes to staging.

To STAGING.

But all my environments shared a state file, so when something bad happened to that state file they all got equally fucked.

If you can’t safely test your changes in isolation away from prod, you don’t have infrastructure as code.

Look, you all know how I feel about terraform by now.  I love composable infra, terraform is the leader in the space, I love the energy of the community and the responsiveness of the tech leads.  I am hitching my wagon to this horse.

It is still as green as the motherfucking Shire and you should assume that every change you make could destroy the world.  So your job as a responsible engineer is to add guard rails, build a clear promotion path for validating changesets into production, and limit the scope of the world it is capable of destroying.  This means separate state files.

So Monday I sat down and spent like 10 hours pounding out a version of what this could look like.  There aren’t many best practices to refer to, and I’m not claiming my practices are the bestest, I’m just saying I built this thing and it makes sense to me and I feel a lot better about my infra now.  I look forward to seeing how it breaks and what kinds of future problems it causes for me.


HOWTO: Migrating to multiple state files

Previous config layout, single state file

If you want to see some filesystem ascii art about how everything was laid out pre-Monday, here you go.

terraform-layout-old

Basically: we used s3 remote storage, in a bucket with versioning turned on.  There was a top-level .tf file for each environment {production,dev,staging}, top-level .tf files for env-independent resources {iam,s3}, and everything else in a module.

Each env.tf file would call out to modules to build its VPC, public/private subnets, IGW, NAT gateway, security groups, public/private route53 subdomains, an auto-scaling group for each service (including launch config, int or ext ELB, bastion hosts, external DNS, tag resources, and so forth.

New config layout, with state file per environment

In the new world there’s one directory per environment and one base directory, each of which has their own remote state file (you source the init.sh file to initialize a new environment, after that it just works if you run tf commands from that directory).  “Base” has a few resources that don’t correspond to any environment — s3 buckets, certain IAM roles and policies, the root route53 zone.

Here’s an ascii representation of the new filesystem layout: terraform-layout-current

All env subdirectories have a symlink to ../variables.tf and ../initialize.tf.  Variables.tf declares has the default variables that are shared by all environments — things like

$var.region, $var.domain, $var.tf_s3_bucket

Initialize.tf contains empty variable declarations for the variables that will be populated in each env’s .tfvars file, things like like

$var.cidr, $var.public_subnets, $var.env, $var.subdomain_int_name

Other than that, each environment just invokes the same set of modules the same way they did before.

The thing that makes all this possible?  Is this little sweetheart, terraform_remote_state:

resource "terraform_remote_state" "master_state" {
  backend = "s3"
  config {
    bucket = "${var.tf_s3_bucket}"
    region = "${var.region}"
    key = "${var.master_state_file}"
  }
}

It was not at all apparent to me from the docs that you could not only store your remote state, but also query values from it.  So I can set up my root DNS zones in the base environment, and then ask for the zone identifiers in every other module after that.

module "subdomain_dns" {
  source = "../modules/subdomain_dns"
  root_public_zone_id = "${terraform_remote_state.master_state.output.route53_public_zone}"
  root_private_zone_id = "${terraform_remote_state.master_state.output.route53_internal_zone}"
  subdomain_int = "${var.subdomain_int_name}"
  subdomain_ext = "${var.subdomain_ext_name}"
}

How. Fucking. Magic. is that.

SERIOUSLY.  This feels so much cleaner and better.  I removed hundreds of lines of code in the refactor.

(I hear Atlas doesn’t support symlinks, which is unfortunate, because I am already in love with this model.  If I couldn’t use symlinks, I would probably use a Makefile that copied the file into each env subdir and constructed the tf commands.)

Rolling it out

Switching from single statefile to multiple state files was by far the trickiest part of the refactor.  First, I started by building a new dev environment from scratch just to prove that it would work.

Second, I did a “terraform destroy -target=module.staging”, then recreated it from the env_staging directory by running “./init.sh ; terraform plan -var-file=./staging.tfvars”.  Super easy, worked on the first try.

For production and base though, I decided to try doing a live migration from the shared state file to separate state files without any production impact.  This was mostly for my own entertainment and to prove that it could be done.  And it WAS doable, and I did it, but it was preeeettty delicate work and took about as long as literally everything else combined. (~5 hours?).  Soooo much careful massaging of tfstate files.

(Incidentally, if you ever have a syntax error in a 35k line JSON file and you want to find what line it’s on, I highly recommend http://jsonlint.com.  Or perhaps just reconsider your life choices.)

Stuff I still don’t like

There’s too much copypasta between environments, even with modules.  Honestly, if I could pass around maps and interpolate

$var.env

into every resource name, I could get rid of _so_ much code.  But Paul Hinze says that’s a bad idea that would make the graph less predictable and he’s smarter than me so I believe him.

TODO

There is lots left to do, around safety and tooling and sanity checks and helping people not accidentally clobber things.  I haven’t bothered making it safe for multiple engineers  because right now it’s just me.

This is already super long so I’m gonna wrap it up.  I still have a grab bag of tf tips and tricks and “things that took me hours/days to figure out that lots of people don’t seem to know either”, so I’ll probably try and dump that out too before I’ve moved on to the next thing.

Hope this is helpful for some of you.  Love to hear your feedback, or any other creative ways that y’all have gotten around the single-statefile hazard!

P.S. Me after the refactor was done =>

 

 

Terraform, VPC, and why you want a tfstate file per env

72 thoughts on “Terraform, VPC, and why you want a tfstate file per env

  1. A few thoughts:

    – Do you have a list of things you have in your terraform wrapper? I knew I needed to validate things like environments etc., though there might be more to it than that.
    – Thoughts on different accounts to completely segregate envs vs VPCs? We’re using VPC as the isolation point, but maybe for infra-wide changes, it might make sense to completely segregate them as IAM permissions don’t always seem to go down to the infra level.
    – Maybe useful for your tfstate wrapper would be to integrate with jsonlint (locally) using something like this package: https://github.com/zaach/jsonlint

    1. Ooooo, thanks for the pointer to jsonlint. That will definitely come in handy!

      My very last blog post was actually about VPCs and environments (http://charity.wtf/2016/03/23/aws-networking-environments-and-you/) — but since then a few people have given me interesting feedback that makes me think I was a little too hard on the per-account model, and not appreciative enough of the things it offers. I’m going to try and post a followup soon with what I’ve learned.

      For most of us though, I think vpc-per-env is a really great model that offers the most isolation and predictability with the least hassle.

      Wrapper script-wise, I haven’t done much yet because it’s just me so I can take a lot of things for granted. Eventually we will need a much better guardrails. Like not allowing you to promote a module change to production until it’s been applied to staging, taking a lockfile for an env if an engineer is working on it, outputting a plan to a file and applying that, linting, sanity checks around anything fragile … have lots of ideas, but not building ahead of our actual needs at the moment. 🙂

  2. Till says:

    What scares me the most is all this automation and RDS/S3. The rest of my resources are throw away, but these two bits scare me because they contain important data. Do you have any thoughts on that yet?

    1. Yes, a few thoughts. S3 is pretty easy; it simply won’t destroy an S3 bucket unless it’s empty. This one has _never_ been a problem for me throughout all my various trials and tribulations.

      RDS … is more suspect. I’m pretty sure one of the problems I had last week was related to turning on multi-AZ support for RDS, which is pretty bullshit. The RDS/Aurora support in TF seems less tested and robust than pretty much anything else. I still haven’t decided if I’m gonna use RDS in production instead of galera or orchestrator. It’s nice to have RDS for staging and dev environments, but if I end up using RDS or Aurora in production, I will probably manage them outside of tfstate and do something like, make the production RDS address a variable in my production.tfvars file (“rds_primary_address=”prod-mysql.asdfsdferw32.us-east-1.rds.amazonaws.com”) and refer to it that way when creating a CNAME or managing any dependencies.

      1. Till says:

        RE: S3 — good call. I have to investigate how Sparkleformation does this.

        And yeah, as far as databases are concerned, I am leaning towards doing it either manually for now or create a separate “database stack” for RDS/possibly Aurora (per environment), so we don’t screw it up accidentally and still get some automation.

      1. Yeah, I was really excited for that. But if you set prevent_destroy on a flag and your change tries to destroy it, the whole run dies there. 🙁 I really want a version of this that warns you in the output, but continues on to complete the run. Right now it’s just a protection against accidentally deleting that resource; what you wanted to be able to spin up and configure, but not manage the resource from now on, unless the flag is removed?

        I imagine the reason they don’t do this boils down to “something something dependency graph”, but the alternatives are pretty sucky too. You can create the resource in the web console and use the ARN to access it forever, with no way to manage it with TF, or you just accept that every time there’s a modification you get notified and go manually clear/resolve the conflict. Much worse.

  3. Gearing up to start experimenting with tf for our base infrastructure, but definitely going to use a new account. 😛 Thought about just living in a separate AWS region but decided definitely too many base things (R53 esp) that I could screw up. Thanks for being out front and taking the slings, arrows and bullshit, and passing on the wisdomz.

  4. […] Terraform, VPC, and why you want a tfstate file per env Charity Majors gives us this awesomely detailed article about a Terraform nightmare. An innocent TF run in staging led to a merry bug-hunt down the rabbit-hole and ended in wiping out production — thankfully on a not-yet-customer-facing service. She continues on with an excellent HOWTO on fixing your Terraform config to avoid this kind of pitfall. If you can’t safely test your changes in isolation away from prod, you don’t have infrastructure as code. […]

  5. Twirrim says:

    Interesting blog post, thank you for posting this. Sorry you had to go through all the hassles.

    Couple of quick observations.
    1) python has the json module natively. Pipe your json file through python ($ cat foo.json | python -m json.tool) and you can establish validity from the exit code. It also cleans up the formatting etc. That’s native with any python install so will be there wherever you need it. The error messages are terrible, however! If you want to understand what is wrong, the simplejson library, installable via pip, gives you meaningful output (same syntax as the native json library). I had lots of stuff handled through json files at my last job, and these were invaluable for staying sane!

    2) I might be missing something but you indicate the S3 bucket your state files were in had versioning enabled. Couldn’t you revert all your state back to a pre-experiment state? (I haven’t tried to use S3 versioning in the past, not sure what the limitations are)

    1. 1) thanks for the python json tip!! someone else pointed me to the npm jsonlint which does a similar thing. Definitely need some cli tools like this on hand when debugging stupid massive json blobs, both for packer and tfstate files.

      2) yes, the s3 bucket has versioning on. But it’s not as simple as “take this old tfstate file and revert to it”, even if you can identically match the state file to the git SHA of the config files that generated it (which you should be). This has to do with ….. dependency graphs, and existing resources, and the ordering of dependencies, and how it converges against existing resources with overlapping unique names, and …. It’s Complicated. You can *definitely* revert to an older known good tfstate if you have a blank slate. But not from a partially divergent state in a complex config.

    2. Daniel Kang says:

      I’d have to put in jq here if you want to talk about json tools. Of course it can’t beat python being batteries included.

  6. Hey, thanks again for your awesome insides.

    After looking at the new ASCII art, I wonder why you don’t have a base.tf and base.tfvars like in the other environments…. Wher do you set the terraform_remote_state resource for the master state??

    Also, why you symlink the initialize.tf in the base environment?? It looks that it have variable declarations for “real” environments, like cidr, public_subnets, etc… variables that are meant for “real” envs, at least envs that will have a VPC involved.

    thanks again!

      1. Thanks, I still don’t understand one thing.

        If you are symlinking the initialize.tf -> ../initialize.tf and the variables.tf -> ../variables.tf inside the base folder, and you don’t have a base/base.tfvars to use and to fill all the empty variables declared in initialize.tf, when you run from base the command terraform plan, you have to fill all the variables when asked in the terminal with values that doesn’t matter??

        and also, just for curiosity, where you initialize the provider “aws”??

    1. David says:

      Hi Thanks for all this great info! Could you help me understand the reason for declaring the environment specific variables in initialize.tf if ultimately you will specify them in the env directory and run terraform plan/apply etc from the env directory.

  7. I have pretty much the same than you, but I don’t use different folders for each environment. Instead I have in variable.tf file (in root dir), all variables defined a lookup table:
    https://gist.github.com/egarbi/2eb3406ff9eac4543d510db040d3c908
    …and in order to plan and apply the code I use a Makefile to pass the proper env value:
    https://gist.github.com/egarbi/fe4c3ebec073d9b82cb1175227c1dc3b

    **Note: I’m not using remote state in this example

  8. This was great. Thank you.

    I wonder why some_file.tf as CLI argument and some form of ‘import’ directive isn’t a Terraform thing. Can you think of a reason why this would be a bad pattern?

  9. ABK says:

    Hey, interesting post. I believe there’re a couple of improvements that can be made to streamline the code and reduce duplication…

    development.tf, production.tf, staging.tf can be renamed to main.tf, placed on the same level on variables.tf and initialize.tf and symlinked too.. Why? because you want all your envs to be consistent and right now it seems that development.tf, production.tf and staging.tf contain the same code.

    This way you can make a change to main.tf apply the plan in env-dev for example, if happy with it you can then then apply the plan to env-staging and env-production respectively.

    You can also parameterize init.sh, symlink it as use it like above.

    1. Those are all terrific ideas, though they come with tradeoffs, like everything. 🙂

      Parameterizing init.sh etc was on my todo list for a while, until I decided I didn’t want to do the first part. The environments are not actually identical and will diverge further, like we will move off RDS for production, at some point during beta. A separate web service has been added just to staging and prod, etc. The dev environment is dramatically different than all the other VPCs — it has peering so devs can ssh to any other env but no env can accidentally connect to any other environment’s services, etc.

      Oh, and the actual kicker reason — just remembered — is that you can’t use interpolation for variable names. I played with trying to work around this for a long time. It can be a deal breaker when you’re dealing with variable names that turn into dns names or some of the few AWS resources that enforce uniqueness.

      BUT. Overall that is a terrific approach, your thinking is very much in line with mine about this problem. If I had a few more prod-like VPCs or if there was more dynamism in creating and destroying them, it would totally be worth figuring out how to work through the obstacles I ran into. Thanks for writing it all out for everyone!

      1. ABK says:

        Ahh I see…. weird how your environments have different services/infra/resources in them… As for my use case, my environments are identical…so my approach works perfectly.

        In terms of your actual kicker reason, can you expand on that please. To keep everything unique I do something like this;

        asg_name = “${concat(“asg-“,var.role,”-“,var.project,”-“,var.env,”.aws.qwickgrub.com”,module.lc.lc)}

  10. ofer says:

    Last friday I succeeded to see your new terraform-layout. Today I tried again and It just redirect me to https://www.honeycomb.io/. Any chance that you can upload it again, I really want to see the directory structure you build.

  11. Awesome post. We ran into similar issues and now keep the code for each VPC in a separate set of templates. It not only provides far better isolation/safety, but also makes the folder structure easier to navigate and understand, even if you’ve never seen it before.

    I wrote up some other thoughts about Terraform best practices here:

    http://stackoverflow.com/a/38749508/483528

    One of the most useful practices is a small open source tool we built called Terragrunt, which is a thin wrapper for Terraform that supports locking for .tfstate files:

    https://blog.gruntwork.io/add-automatic-remote-state-locking-and-configuration-to-terraform-with-terragrunt-656a57565a4d.

  12. Alexander.Savchuk says:

    Interesting post. We had some similar challenges, what we ended up doing:
    – use CloudFormation for base / global stack (to create and manage the S3 bucket to store TF state files and some IAM users / roles)
    – use different accounts for different environments (this is really the kind of isolation that we want – VPCs are not enough)
    – we keep test / stage / prod similar (but not same), so we share the same config files for them. We create environment-specific parameter maps like this:
    `
    variable “instance_type” {
    type = “map”
    default = {
    test = “t2.small”
    uat = “t2.small”
    prod = “t2.medium”
    }
    }
    `
    and then reference this parameter while creating an actual resource:
    `
    instance_type = “${lookup(var.instance_type, var.aws-env)}”
    `

    We also have separate region-specific tfvar files for things that are different across regions (like VPC ids, for example).

    So, for example to deploy application to 2 regions, we run Terraform twice, passing the appropriate region tfvar file as a parameter (and also purging the state file between runs!).

    1. Oh, that’s *really* cool.

      The environment-specific parameter map lookup is *exactly* what we did for our autoscaling groups. 🙂 We would rather have standardized on variables passed in by the *.tfvars file for each env, but you couldn’t pass maps around then. Now you can, and I’m just waiting for a day to spend refactoring everything to use 0.7 and maps.

      I really like the idea of bootstrapping with CFN and then using TF though. That first step is basically unpossible otherwise. Very cool, thanks for sharing!!

  13. Curious if you have any VPC peering in your set up. I’m working with a very similar TF organization scheme and feel like I have a chicken and egg scenario. If you have one VPC that needs to be peered into other VPCs and there have to be routes set up on each side of the peer how do you manage that. I don’t think remote state gets you all the way there.

    1. Yes. Check my other tf blog posts, I posted an example of the peering config. We have peering between dev VPC and prod, staging, and dogfood VPCs, but not between other environments.

      1. Excellent thanks. Using individual aws_network_acl_rule and aws_route resources was the crucial piece I was missing.

      2. First thanks so very much for these TF-related posts, they are most informational. I am new to TF and find myself with a circular dependency between environments situation regarding VPC Peering. I have studied in detail your gists and realise that each environment requires a “terraform_remote_state”. More explicitly, “env-dev” needs to pull in “staging_state” to know the peer “vpc_id” and “env-staging” needs to pull in “dev_state” to know the “vpc_peering_connection_id”. If within the same environment I can see how such a dependency situation is resolved within Terraform, however, how does this work from green fields state with the two environments? What the heck am I missing?

  14. Ernest says:

    I’m still testing/learning terraform and not going into modules yet. What main.tf is used for?
    If I don’t use modules yet and have things like:
    vpc.tf, firewall.tf, app1.tf, app2.tf

    and I want to use s3 for remote state store, where would I put that configuration?

    If I simply add resource “terraform_remote_state” “remote_state” into main.tf as the only 1 thing then everything errors out.

    Thank you

    1. names aren’t special. main.tf is like any other tf. i recommend googling the docs for remote state, or reading in my blog posts where i posted examples. i don’t have time to do personal support, sorry.

  15. I’m working on a tool that can not have these issues … (only others). The idea is not to have a state file. AWS supports names and tags. I’m using these as “Anchors” in resources definitions. The tool currently supports security groups, ELB and partially Route 53 record sets. The tool is part of the standard library of my shell.

    Digest of working with the tool (VPC was already created externally):


    vpc = AwsVpc({'aws:cloudformation:stack-name': 'censored-vpc'}).expect(1) # fails if it does not find exactly one such VPC
    elb_sg = AwsSecGroup(["${ENV.ENV}-provision-elb", vpc], { ... ELB SG props here ... })
    elb_sg.converge()
    elb = AwsElb("${ENV.ENV}-provision-2", { ... ELB props here ...})
    elb.converge()

    In “ELB props here you can specify that existing instances should be added”:

    {
    ...
    'Instances': AwsInstance({'env': ENV.ENV, 'role': 'provision'}).expect()
    ...
    }

    Note that all the code above is a programming language, not a narrow definitions DSL so you will not end up generating your configuration.

    You are all very welcome to work with me on NGS and the AWS tool it has.
    https://github.com/ilyash/ngs

  16. Charity, thanks for sharing your pain and learning! I love how much you share with the community.

    IMO the problem you had isn’t specific to Terraform and its state files. Even with CloudFormation, putting all of your environments into one big stack is asking for trouble.

    My preferred way to deal with this is different. Rather than having a separate definition file for each environment, I like to have a file which defines one environment, and then re-use it across each environment. Basically, CD for infrastructure. Parameterize the location of the state file, so you can run it once for a test environment. If that works, run it again for staging. Then for prod. Insert more stages to taste.

    Aside from making sure you don’t bork prod, this also avoids little differences/mistakes that you might have across different environment files.

    I use a CI/CD server to manage the process. So I work on my stuff locally, run it and test it on my own sandbox environment. Then I commit to git, and “promote” the terraform files from one environment/stage to the next. The CI server sets environment variables, so I don’t have to worry about screwing that up. If I need differences between environments (e.g. bigger server pools in prod), then I put that into parameters.

    Most of the development teams I work with do this with their application code, so it’s pretty natural to do it with the infrastructure as well.

  17. Mark Casey says:

    Hi Charity,

    Thanks for sharing your results here. I’ve more or less recreated your stack using the Gists you shared as a starting point, and I’m enjoying the workflow.

    I was wondering though if the number and cost of NAT Gateways has been an issue for you, or if I’ve done something differently to make it an issue for me?

    I’m sure it wouldn’t be too hard to get the 5-per-AZ limit raised but they’re also not trivially cheap. An env running in 3 AZs adds $90 so that will add up very quickly.

    Apparently you can route an entire VPC’s NAT traffic to a single NAT Gateway, but if the AZ it’s in goes down you’re hosed for some things. The only real workaround I see right now is to use non-Elastic Public IPs for most instances (so, free) so that they can all use IGWs instead (also free), and then just lock things down with Security Groups.

    Any thoughts?

    1. sorry, my quick answer is that no, it hasn’t been an issue for me. =( am generally operating on a budget of at least thousands a month, so this isn’t a thing i’ve come up against as an issue. and personally i think the simplicity of the NAT gateway config outweighs the pain of any more complicated solution, by a long shot.

      1. Mark Casey says:

        Thanks for your reply! I’m at least glad I’m not doing something wildly different that is causing this.

        It both is and is not about cost. Big bills aren’t a shock in and of themselves though my proof-of-concept account was on the smaller side. It was more the massive percent increase of nat gateways as a cost category for the account (~$30 -> ~$1k) and its total share of this particular bill.

        I can’t argue with your last sentence at all, and it kind of solidifies to me that this isn’t a tech issue. The old env had multiple products in multiple AZs on a single natgw, so I suppose it’ll just be a matter of perception showing why that was bad and this is more correct.

        Thanks again.

  18. Hi Charity,
    Thanks for a nice article. It was a bit more a year since article was first posted, and I was wondering how your approach worked out, have you find any drawbacks? Update post would be great! Thanks!

    1. oh man. i am pleased with the direction tf is going, and wish i could find a spare weekend to update my config to use variable interpolation etc. cross your fingers for soon!@!

  19. I am starting with Terraform. I am trying to make a tree structure just like yours. How do I run and from where do I run different terraform commands to manage different modules?

    For eg. If I go into staging folder and run terraform apply, will it be able to read the different modules I have defined? And within that folder only the state file will be created for that environment?

  20. You can use the following terraform commands now:

    terraform state pull local.tfstate (pull from remote state into “local.tfstate”)
    terraform state mv -state=local.tfstate -state-out=new.tfstate module.xxx.resource123 resource123
    … (multiple resources) …
    terraform state push local.tfstate
    cd subdir
    terraform state push ..\new.tfstate

    You need multiple entries for the different remote states to use in both your top-level and subdirectories.

    Using these commands you can safely move resources from one remote state to another.

  21. cracky mule says:

    Hi Charity. Loved this write-up. What’s in the init.sh? And how are you managing mutliple credentials so you don’t mix up dev/prod and screw yourself 🙂

  22. […] I recently was setting up a couple of AWS environments for a client. This client had a typical web application which talked to an RDS database. There was DNS, a CDN and other components involved. We wanted to use Terraform to maintain traceability and replicability, and have the same configuration for production and staging, with perhaps small differences like ec2 instance size. We also wanted to separate out the components into their own Terraform workspaces to limit the blast radius (so if one component had changes that caused issues or Terraform corruption, it wouldn’t affect others). Finally, we wanted each environment to have its own Terraform backend, again to separate the environments. […]

Leave a Reply to mipsytipsyCancel reply