AWS Networking, Environments and You

Last week we hit an important milestone in the life of our baby startup: a functional production environment, with real data flowing from ingestion to storage to serving queries out!  (Of course everything promptly exploded , but whatever.)

This got me started thinking about how to safely stage and promote terraform changes, in order to isolate the blast radius of certain destructive changes.  I tweeted some random thing about it,

… and two surprisingly fun things happened.  I learned some new things about AWS that I didn’t know (after YEARS of using it).  And hilariously, I learned that I and lots of other people still believe things about AWS that aren’t actually true anymore.

That’s why I decided to write this.  I’m not gonna get super religious on you, but I want to recap the current set of common and/or best practices for managing environments, network segmentation, and security automation in AWS, and I also want to remind you that this post will be out of date as soon as I publish it, because things are changing pretty crazy fast.

Multiple Environments: Why I Should Care?

An environment exists to provide resource isolation.  You might want this for security reasons, or testability, or performance, or just so you can give your developers a place to fuck around and not hurt anything.

Maybe you run a bunch of similar production environments so you can give your customers security guarantees or avoid messy performance cotenancy stuff.  Then you need a template for stamping out lots of these environments, and you need to be *really* sure they can’t leak into each other.

Or maybe you are more concerned about the vast oceans of things you need to care about beyond whatever unit tests or functional tests are running on your laptop.  Like: capacity planning, load testing, validating storage changes against production workloads, exercising failovers, etc.  Any scary changes you  have, you need a production-like env to practice in.

Bottom line: If you can’t spin up a full copy of your infra and test it, you don’t actually have “infrastructure as code”.  You just have … some code, and duct tape.

https://twitter.com/phinze/status/710227204349624321

The basics are simple:

  • Non-production environments must be walled off from production as strongly as possible.  You should NEVER be able to accidentally to connect to a prod db from staging (or from one prod env to another).
  • Production and non-production environments (or all other prod envs) should share as much of the same tooling and code paths as possible.  Like, some amount asymptotically approaching 100%.  Any gaps there will inevitably, eventually bite you in the ass.

Managing Multiple Environments in AWS

There are baaaasically three patterns that people use to manage multiple environments in AWS these days:

  1. One AWS billing account and one flat network (VPC or Classic), with isolation performed by routes or security groups.
  2. Many AWS accounts with consolidated billing.  Each environment is a separate account (often maps to one acct per customer).
  3. One AWS billing account and many VPCs, where each environment ~= its own VPC.

Let’s start old school with a flat network.

1:  One Account, One Flattish Network

This is what basically everyone did before VPC.  (And ummm let’s be honest, lots of us kept it up for a while because GOD networking is such a pain.)

In EC2 Classic the best you got was security groups.  And — unlike VPC security groups — you couldn’t stack them, or change the security group of an instance type without destroying it, and there was a crazy low hard cap on the (# of secgroup rules * # of secgroups).  You could kind of gently “suggest” environments with things like DNS subdomains and config management, and sometimes you would see people literally just roll over and resort to  $ENV variables.

Most people either a) gave up and just had a flat network, or b) this happened.

At Parse we did a bunch of complicated security groups plus chef environments that let us spin up staging clusters and the occasional horrible silo’d production stack for exceptional customer requirements.  Awful.

VPC has made this better.  Even if you’re still using a flat network model.  You can now manage your route tables, stack security groups and IAM rules and reapply to existing nodes without destroying the node or dropping your internet connections.  You can define private network subnets with NAT gateways, etc.

https://twitter.com/sudarkoff/status/710264656367996928

Some people tell me they are using a single VPC with network ACLs to separate environments for ease of use, because you can reuse security groups across environments.  Honestly this seems a bit more like a bug than a feature to me, because a) you give up isolation and b) I don’t see how that helps you have a versioned, tested stack.

https://twitter.com/tomtheguvnor/status/710226892633333762

Ok moving on to the opposite of the spectrum, the crazy kids who are doing tens (or TENS OF THOUSANDS) of linked accounts.

2:  One AWS Account Per Environment

A surprising number of people adopted this model in the bad old EC2 Classic days, not because they necessarily wanted it but because they need a stronger security model and looser resource caps.  This is why AWS released Consolidated Billing way back in 2010.

I actually learned a lot this week about the multi-account model!  Like that you can create IAM roles that span accounts, or that you can share AMIs between accounts..  This model is complicated, but there are some real benefits.  Like real, actual, hardcore security/perf isolation.  And you will run into fewer resource limits than jamming everything into a single account, and revoking/managing IAM creds is clearer.

Security nerds love this model, but it’s not clear that …. literally anyone else does.

Some things that make me despise it without even trying it:

  • AWS billing is ALREADY a horrendous hairball coughed up by the world’s ugliest cat, I can’t even imagine trying to wrangle multiple accounts.
  • It’s more expensive, you incur extra billing costs.
  • Having to explicitly list any resource you want to share between any accounts just makes me want to tear my hair out strand by strand while screaming on a street corner
  • Account creation API still has manual steps, like getting certs/keypairs.
  • You cannot make bulk changes to accounts, and AWS doesn’t like you having thousands of linked accounts.  Also limits your flexibility with Reserved Instances.

Here is a pretty reasonable blog p0st laying out some of the benefits though, and as you can see, there are plenty of other crazy people who like it.  Mostly security nerds.

3:  One AWS Account, One VPC Per Environment

I have saved the best for last.  I think this is the best model, and the one I am adopting.  You spin up a production VPC, a staging VPC, dev VPC, Travis-CI VPC.  EVERYBODY GETS A VPC!#@!

One of those things that everybody seems to “know” but isn’t true is that you can’t have lots of VPCs.  Yes, it is capped at 5 by default, and many people have stories about how they couldn’t get it raised, and that used to be true.  But the hard cap is now 200, not 10, so VPC awayyyyyyyy my pretties!

Here’s another reason to love VPC <-> env mapping: orchestration is finally coming around to the party.  Even recently people were still trying to make chef-metal a thing, or developing their own coordination software from scratch with Boto, or just using the console and diffing and committing to git.

Dude, stop.  We are past the point where you should default to using Terraform or CloudFormation for the bones of your infrastructure, for the things that rarely change.  And once you’ve done that you’re most of the way to a reusable, testable stack.

Most of the cotenancy problems that account-per-env solved are a lot less compelling to me now that VPCs exist.

VPCs are here to help you think about infrastructure like regular old code.  Lots of VPCs are approximately as easy to manage as as one VPC.  Unlike lots of accounts, which are there to give you headaches and one-offs and custom scripts and pointy-clicky shit and complicated horrible things to work around.

VPCs have some caveats of their own.  Like, you can only assign a /16 to any VPC.  If you’re using 4 availability zones and public subnets + natted private subnets, that’s only ~8k IPs per subnet/AZ pair.  Shrug.

You can peer security groups across different VPCs, but not across regions (yet).  Also, if you’re a terraform user, be aware that it handles VPC peering fine but doesn’t handle multiple accounts very well.

Lots of people seem to have had issues with security group limits per VPC, even though the limit is 500 and says it can be raised by request.  I’m …. not sure what to think of that.  I *feel* like if you’re building a thing with > 500 sec group rules on a single VPC, you’re probably doing something wrong.

Test my code you piece of shit I dare you

Here’s the thing that got me excited about this from the start though, which is having the ability to do things like test terraform modules on a staging VPC from a branch before promoting the clean changes to master.  If you plan on doing things like running bleeding-edge software in production *cough* you need allllll the guard rails and test coverage you can possibly scare up.  VPCs help you get this in a massive way.

Super quick example, say you’re using adding a NAT gateway to your staging cluster, you would use the remote git source with your changes:

[code language=”javascript”]
// staging
module "aws_vpc" {
source = "git::ssh://git@github.com/houndsh/infra.git//terraform/modules/aws_vpc?ref=charity.adding-nat-gw"
env = ${var.env}

}
[/code]

And then once you’ve validated the change, you simply merge your branch to master and run terraform plan/apply to production.

[code language=”javascript”]
// production
module "aws_vpc" {
source = "github.com/houndsh/infra/terraform/modules/aws_vpc"
env = "${var.env}"

}
[/code]

And for GOD’S SAKE USE DIFFERENT STATE FILES FOR EACH VPC / ENVIRONMENT okayyyyyy but that is a different rant, not an AWS rant, so let’s move along.

In Conclusion

There are legit reasons to use all three of these models, and infinite variations upon them, and your use case is not my use case, blah blah.

But moving from one VPC to multiple VPCs is really not a big deal.  I know a lot of us bear scars, but it is nothing like the horror that was trying to move from EC2 Classic to VPC.

https://twitter.com/mistofvongola/status/710239882912747524

VPC has a steeper learning curve than Classic, but it is sooooo worth it.  Every day I’m rollin up on some new toy you get with VPC (ICMP for ELBs! composable IAM host roles! new VPC NAT gateway!!!).  The rate at which they’re releasing new shit for VPC is staggering, I can barely keep up.

Alright, said I wasn’t gonna get religious but I totally lied.

VPC is the future and it is awesome, and unless you have some VERY SPECIFIC AND CONVINCING reasons to do otherwise, you should be spinning up a VPC per environment with orchestration and prob doing it from CI on every code commit, almost like it’s just like, you know, code.

That you need to test.

Cause it is.

(thank you to everybody who chatted with me and taught me things and gave awesome feedback!!  knuckle tatts credit to @mosheroperandi)

 

AWS Networking, Environments and You

29 thoughts on “AWS Networking, Environments and You

  1. d.jones@icloud.com says:

    You could also use consolidated billing for multi environments. This can help in isolation for obvious reasons. Just an idea.

  2. Absolutely awesome article. I’ve been going through the “which way is better” mind game for a couple of months, now, and this article really helped provide clarity.

  3. Like I said on twitter, AMAZING write up, love the writing style and I’m still laughing my ass out after reading “Test my code you piece of shit I dare you” hahaha.

    The only thing I would like to say is that there is no silver bullet indeed and each use case/company will likely have a custom one out of these 3. One thing that also get a lot of people (seen this in my last few years) is that other limits are often overlooked (instances limit, service component/feature X), for example an gaming company will likely to have an account per game to avoid one famous game end up stopping another’s growth and so on.

    Another trick is also use Cloudfront and/or Direct Connect as Data transfer is cheaper in some use cases.

    You’ve just got yourself a new follower, Charity – Keep it up with the writing as it definitely cheers people up.

    1. “Another trick is also use Cloudfront and/or Direct Connect as Data transfer is cheaper in some use cases.”

      For anyone who uses Edgecast as a CDN, they have a service called Origin Shield, which has saved my company a bit of money. Anything that gets pulled from origin gets populated throughout CDN for $0.02/GB, instead of the $0.07 pulling directly from origin would cost. It was a hassle to get it going, but def a money saver.

    2. Thank you!! Yessss, there are so many other tips and tricks we can cover — I learned a lot just *today* about more of the benefits of multi-account models. Will post a followup if I can find time. 🙂

  4. I don’t get why you would want to use different state files for environments. I would rather have on single source of truth that encompasses all my workloads. Ideally, you want to have terraform module that you can invoke every time you want a new environment, this way it’s easily repeatable.

    1. i totally get where you’re coming from, and had a single state file for all my environments until a couple of innocent changes (adding aurora, making RDS multi-az) literally poisoned my state file and all terraform commands just crashed. I tried doing a bunch of manual surgery on state files to get it back into working order, and ended up having to manually destroy *my entire aws infra* and recreate it once i isolated the bug. Would have been like a 16 hour downtime if we had been live.

      So now, I’m hell bent on separate state files so I can stage tests and promote good changes with confidence. However, this is a 100% completely different question than “should you have reusable modules which you should use across all environments”, to which the answer is “yes” and “has nothing to do with separate state files”. 🙂

  5. xpaul says:

    > You can peer security groups across different VPCs, but not across regions (yet). Also, if you’re a terraform user, be aware that it handles VPC peering fine but doesn’t handle multiple accounts very well.

    I’ve actually been quite successful using multiple accounts/regions with terraform recently; not sure which version you tried out, but in 0.6.13 and the previous couple of releases immediately prior, this pattern was immensely helpful:

    provider “aws” {
    alias=”main-ue1″
    profile=”mainaccount”
    region=”us-east-1″
    }
    provider “aws” {
    alias=”team-ue1″
    profile=”teamaccount”
    region=”us-east-1″
    }

    …. ….

    resource “aws_resource_thingy” “thingy_teamacct” {
    provider=”aws.team-ue1″

    }
    resource “aws_other_thingy” “grand_main_poobah” {
    provider=”aws.main-ue1″

    }

    In particular, this has been handy accessing outputs from remote state in one s3 bucket in one account/region to provide inputs to modules used to create resources like security groups in another account and route53 zones in one or more somewhat arbitrary accounts (depending on who and when the zone was registered).

    1. that’s a good point! i just started doing more stuff with remote resources and it’s been incredibly useful. another example of info that everybody seems to “know” but is actually out of date. 🙂

  6. Hi Charity,

    I totally appreciate the article and in general agree with most bits. However I would like to also add another scenario where multi-account is a huge win. Any sort of regulated/compliance requiring industry. Those types of industries often need such strict oversight that it is a huge win when I can say “look, Prod for XYZ product LITERALLY only has 3 people in the world that can access it by any means”.

    Also, something should be said about the complexity of managing multiple environments/VPC’s with IAM. This is, unfortunately, far more of a chore than it should be. I long for them to release the concept of environments so we can more easily restrict users without getting into the hell of IAM policy crafting where you can end up with policies so long that you think you’re writing cloudformation…

    Otherwise, props! People often overlook some of this and we definitely need to get more word out there about the options.

    1. Totally, totally get where you’re coming from on the compliance bits. A few people have pointed out stuff I missed or glossed over about the multi-account option. I will try to find the time to aggregate their feedback and post an update at some point, so the multi-account model gets a fairer representation than I really gave it. 🙂

  7. Awesome summary of the options available at the moment. Currently implementing #2 due to compliance (HIPAA) as well as dealing with 2 different legal/paying entities. The cross account IAM is initially a pain, but now that it is sorted (with Ansible), I have a single yml file with each user and a list of which groups they may belong to. This simplifies the security management immensely.

    I had a look at Terraform, but ended up going with Ansible as it didn’t feel like you had quite enough control for multi-account setups – with the STS calls, you can easily impersonate different accounts and perform actions on them. This is extremely useful when dealing with the cross-account permissions. The previous comment from Arlington touches on this – to try and manage IAM users granularly in a single account is near impossible.

    It feels like the tooling around managing and testing multiple VPC / accounts is still very lacking – ended up extending Ansible modules as they didn’t have the ability to set the cross account trust policy on 2.0.1.

    Looking forward to follow up with more multi-account love 🙂

  8. Great article and totally agree with the approach, which is how we do it as well

    Few comments:
    1. Another advantage of VPC is that AWS gives you enough great security/networking tools so that you no longer have to do security in AWS old school and use expensive firewall appliances. ACLs, SG, route tables, NAT gateway & WAF cover 90% of what firewall appliances do at a fraction of a cost. Plus you can code this easily & deploy with all your VPC environments at far less cost & much more easily than w/firewall appliances. The only thing missing for now is easy VPN & if you aren’t adverse to a proprietary solution companies like Ocedo have a great product that makes VPN peering a piece of cake & also programmable. Plus if you do things properly you should never have to ssh connect to your customer environments anyway so VPN may be overkill except in very specific use cases.
    2. You haven’t mentioned logging, another area AWS is upping the ante. If you use ELK and want a solution that allows you to consolidate it all across all your multiple VPC environments, LogZ.io offers a good solution, but you can also roll your own w/just open source ELK & AWS Cloudformation + logs in S3 capabilities.
    3. One thing to be sure is that you totally lock down your VPCs w/ACL. I ran into a Nat’ing issue following AWS doco & they told me it’s a bug in their ACL doco that most people don’t run into because they leave all Outbound ports open – a scary thought!!

    Be sure to restrict Outbound on subnets to just the services you need going out to the world (usually just 80/443) and add an outbound rule for the NAT subnet w/ ephemeral range. Plus a similar inbound rule on your private subnets to allow egress from returning traffic from the NAT gateway. While yum works fine w/o these rules since it goes directly to AWS rpm repos, you can’t add other repos w/o these rules if you lock it down right.

    Also, if you use a Bastion host for ssh, be sure to add accept rules for the NAT ephemeral ports for both inbound and outbound in the Bastion host subnet. The IGW uses the same port range as the NAT gateway and if you use the standard Linus range things might get randomly blocked.
    Just as an aside we put NAT gateway and Bastion in separate public subnets for added security. if you have other public facing servers puts those in separate public subnets as well.

  9. FWIW we use a hybrid of #2 and #3 where we have all non-production environments in one AWS account (each with their own VPC) and production in a different AWS account. Best of both worlds or just extra pain? You decide. It’s been working for us for over a year without too many issues, though.

  10. My Reserved Instances (RIs) pattern for multiple accounts:
    – Use the new regional RIs, rather than the ones that reserve capacity in an Availability Zone. I get the billing discount without the hassle of guessing where to put them
    – Add RIs to the top paying account – these RIs count against all accounts underneath it. It’s useful for when I can’t forsee usage, like in multiple dev accounts.
    – Add Business-level support to the paying account, so that I can get recommendations on what RIs I need
    – Use the RI billing report to see how I’m doing
    – Use RIs in specific accounts only where I know I’ll have static infra

    It’s not a perfect system, but it gets me by pretty well. Where I work, the multiple account pattern works well for us, though we don’t do a lot of locking down usage on accounts because of hassle and frequent need for expanded privileges.

  11. Just wanted to say a big thank you for this article, and every other post of yours I’ve read. I’m teaching an AWS architecture uni subject, currently grading papers, and by coincidence one of my (top performing) students referenced this post.

    If it’s alright with you, I’d love to put this on our reading list for next session.

    1. Very much alright! Happy to. 🙂 Btw I’m going to be in australia for two weeks in November … melbourne, perth, sydney, all over. Lots of meetups and stuff. Maybe I’ll run in to you. 🙂

Leave a Reply