The Accidental DBA

This morning there was yet another comment thread on hacker news about Yet Another outage involving MongoDB and data loss, this time by some company called “CleverTap”.

Recap

To summarize: the CleverTap engineering team noticed that the WiredTiger storage engine was faster than MMAPv1 for MongoDB.  They decided to … “upgrade the following weekend” (that sentence alone made my eyes bulge).

According to the blog post, they upgraded from 2.6 to 3.0, while simultaneously changing storage engines from MMAPv1 to WiredTiger, while leaving zero secondaries snapshot nodes with data on MMAPv1.  All over the course of 3 days.

(They are also running sharded mongo, with a mere 300 ops/sec on each primary, which RAISES A LOT OF QUESTIONS but I already feel like I’m beating up on these kids so I won’t pursue that.)

Questions …

(But seriously what the *hell* can you be doing to have such a low request rate, that you
rainbow-umbrella-hineed to shard at an infinitesimal volume?  Why did you specify it in req/min instead of req/sec?  What is the breakdown of reads/writes?  What is the lock percentage?  What is the avg object size??  Are these like multi-MB documents????  Why did you pause all incoming traffic and process it after the upgrade?  If the primary can’t take the extra load, why not rs.syncFrom() a secondary?   If that doesn’t work, don’t you have other, bigger problems??)

Most bafflingly of all: why wait only a few minutes after electing a new WiredTiger primary for the first time ever, and then immediately DELETE your only known-good copies of the data on MMAPv1 and re-sync over them with WiredTiger?

Accidental DBAs

Okay.  So here’s the thing: you are clearly a team of accidental DBAs.  You are operations and software engineers who have found yourselves in charge of the data.

It’s cool.  I am too!  It’s a really neat and fun place to be in.  DBAs and network admins are kind of the last remaining priesthoods in our industry. umbrella-rainbow_cm-f

There’s a lot of powerful and fun stuff to be done for generalists who pick up specialty knowledge in one of those areas, or specialists (like my neteng friend Leslie) who start bringing their skills back to the generalist side and merging the two.

(Oh Right, We Wrote A Book About This!!!)

My friend Laine and I are writing a book for people on the data side, called “Database Reliability Engineering“, which is aimed at generalist engineers who want to learn how to deal with data responsibly and effectively.

(Actually that’s a good point, I am supposed to be pitching this book! — which is really screen-shot-2016-10-01-at-7-00-15-pmmostly Laine with a smidgen of me but it’s going to be super awesome.  Consider this your sales pitch.)

So first, as an accidental DBA, you should obviously buy this book  :).  Second: stateful services require a different mindset[*].  It’s cool that you are running your own databases!  But reading post mortems like this where the conclusion is “MongoDB sucks” makes me fucking grind my teeth.

Stop treating your databases like stateless services.

There are lots of ways that MongoDB (and every other database on the planet) really sucks.  Mongo set themselves up for special rage by overpromising too much early on, and seeming tone deaf to criticism from real database engineers.

But *I* can criticize Mongo all day long.  You children on hacker news who have never run it don’t get to.😛  If you don’t know what the fuck you’re talking about, if you’re cargo culting other people’s years-old complaints, just shut up already.

Managing stateful services like databases means that you need to be more paranoid than you did with stateless services.  With stateless services the best practices are to to roll early, roll fast, roll often, roll back.  When you’re dealing with state, you need to be careful.

With stateful services you can’t play it fast and loose like that.  You’re going to have data loss, corruption, unpredictable results, catastrophic failures that you can’t simply roll back from.  Data loss can be ruinous to your company.  (This can also be true for stateless services that sit close to your data and mutate it a lot.)

But that’s what makes it fun.  :)

Be paranoid.

When we were moving from MMAPv1 to RocksDB at Parse, we ran hybrid replica sets for 6-9 months.  We were paranoid.  It was justified!  We spent half a year capturing production workloads and replaying them, electing Rocks primaries and rolling back, and even then keeping snapshlightningots and secondaries of both storage engines for *months*.

This isn’t because MongoDB sucks.  It’s the nature of the game, it’s the difference between stateful and stateless services.

Do you know that there was a total query engine rewrite in 2.6?  We spent months flushing out tons of crazy bugs.  Do you know about the index intersection changes?  We helped chase down bugs in those too.  (You’re welcome.)

You can’t just go “dudes it’s faster” and jump off a cliff.  This shit is basic.  Test real production workloads. Have a rollback plan.  (Not for *10 days* … try a month or two.)

Lessons

If CleverTap had run their plan past anyone experienced with data, they would have called out all of those completely predictable failures, and advised them to change it:

  1. Make one change at a time.  Do a major version upgrade separately from the storage engine upgrade.
  2. Delay between each change.  Two weeks is absolutely minimal, any thing less is careless.  Let them bake.
  3. Storage engine changes are scary.  It takes years to gain confidence in a new way of laying bits down on disk.  (Whenever people bitch and moan about mongo, I remind Rainbow-Umbrella-Z-5_5them that I’ve still lost WAY more data to MyISAM, InnoDB, and MySQL overall than Mongo.
  4. You can run lots and lots of replicas (up to 7 votes per replica set, even more nodes) per each replica set in Mongo.  This is a killer feature.  Why didn’t you use it?
  5. Keep backups around for months in the new storage engine *and* the old storage engine, just in case.  Have two hidden snapshot nodes.  The only cost is in dollars, which is fucking cheap compared to data or engineering time.

If you are a new accidental DBA, you have to make a point of learning things.  Go to conferences.  Read books.  Buy bottles of whiskey for your data friends and pick their brains.  Remember that they know things you do not.  Don’t blame the vendors when you fucked up.

Network engineering is the same way, but mistakes tend to be a lot less … permanent.  You drop some packets..  like grains of sand. ^_^

Remember that you’re in charge of keeping people’s data safe and secure.  You have much to learn.  Learn it.

And get off my fucking lawn.  <3

Some slides from a couple of relevant talks I’ve given on the subject:

 

[*] P.S.:  “Stop treating your stateful services like stateless services” … this is a fact, but it’s not the aspiration.  DB folks should all be leaning in to the model of learning to treat our stateful services like stateless services, with the same casual disregard for individual nodes.  This is hard, and it’s going to take some time, but it’s clearly where the world is heading and it’s definitely a good thing.  :)  The learning goes both ways!

 

rainbow-cloud-droplet

The Accidental DBA

DevOps vs SRE: delayed coverage of the dumbest war

Last week was the West Coast Velocity conference.  I had a terrific time — I think it’s the
best Velocity I’ve been to yet.  I also slipped in quite late, the evening before last, to catch Gareth’s session on DevOps vs SRE.

had to catch it, because Gareth Rushgrove (of DevOps Weekly glory) was taunting @lusis and me about it on the Internet.

rainbowdropletAnd it was worth it!   Holy crap, this was such a fun barnburner of a talk, with Gareth schizophrenically arguing both for and against the key premise of the talk, which was about “Google Infrastructure for Everyone Else (GIFEE)” and whether SRE is a) the highest, noblest goal that we should all aspire towards, or b) mostly irrelevant to anyone outside the Google confines.

Which Gareth won?  Check out the slides and judge for yourself.  🙃

 

At some point in his talk, though, Gareth tossed out something like “Charity probably already has a blog post on this drafted up somewhere.”  And I suddenly remembered “Fuck!  I DO!”  it’s been sitting in my Drafts for months god dammit.

So this is actually a thing I dashed off back in April, after CraftConf.  Somebody asked me for my opinion on the internet — always a dangerous proposition — and I went off on a bit of a rant about the differences and similarities between DevOps and SRE, as philosophies and practices.
colorful-rainbow

Time passed and I forgot about it, and then decided it was too stale.  I mean who really wants to read a rehash of someone’s tweetstorm from two months ago?

Well Gareth, apparently.

Anyway: enjoy.


SRE vs DevOps: TWO PHILOSOPHIES ENTER, BOTH ARE PHENOMENALLY SUCCESSFUL AND MUTUALLY DUBIOUS OF ONE ANOTHER


 

So in case you haven’t noticed, Google recently published a book about Site Reliability Engineering: How Google Runs Production Systems.  It contains some really terrific wisdom on how to scale both systems and orgs.  It contains chapters written by dear friends of mine.  It’s a great book, and you should buy it and read it!

Rainbow-Umbrella-Z-5_5It also has some really fucking obnoxious blurbs.  Things like about how “ONLY GOOGLE COULD HAVE DONE THIS”, and an whiff of snobbery throughout the book as though they actually believe this (which is far worse if true).

You can’t really blame the poor blurb’ers, but you can certainly look askance at a massive systems engineering org when it seems as though they’ve never heard of DevOps, or considered how it relates to SRE practices, and may even be completely unaware of what the rest of the industry has been up to for the past 10-plus years.  It’s just a little weird.

So here, for the record, is what I said about it.

 

Google is a great company with lots of terrific engineers, but you can only say they are THE

sre
The Google SRE Bible

BEST at what they do if you’re defining what they do tautologically, i.e. “they are the best at making Google run.”  Etsyans are THE BEST at running Etsy, Chefs are THE BEST at building Chef, because … that’s what they do with their lives.

Context is everything here.  People who are THE BEST at Googling often flail and flame out in early startups, and vice versa.  People who are THE BEST at early-stage startup engineering are rarely as happy or impactful at large, lumbering, more bureaucratic companies like Google.  People who can operate equally well and be equally happy at startups and behemoths are fairly rare.

And large companies tend to get snobby and forget this.  They stop hiring for uniquerainbow-swirl strengths and start hiring for lack of weaknesses or “Excellence in Whiteboard Coding Techniques,” and congratulate themselves alot about being The Best.  This becomes harmful when it translates into to less innovation, abysmal diversity numbers, and a slow but inexorable drift into dinosaurdom.

Everybody thinks their problems are hard, but to a seasoned engineer, most startup problems are not technically all that hard.  They’re tedious, and they are infinite, but anyone can figure this shit out.  The hard stuff is the rest of it: feverish pace, the need to reevaluate and reprioritize and reorient constantly, the total responsibility, the terror and uncertainty of trying to find product/market fit and perform ten jobs at once and personally deliver to your promises to your customers.

rainbow-cloud-dropletAt a large company, most of the hardest problems are bureaucratic.  You have to come to terms with being a very tiny cog in a very large wheel where the org has a huge vested interest in literally making you as replicable and replaceable as possible.  The pace is excruciatingly slow if you’re used to a startup.  The autonony is … well, did I mention the politics?  If you want autonomy, you have to master the politics.

 

Everyone.  Operational excellence is everyone’s job.  Dude, if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on  your team is an operational risk (not to mention, you know, like moral issues and stuff).

But the more engineering-focused your role is, the more direct your impact will be on operational outcomes.

As a software engineer, developing strong ops chops makes you powerful.  It makes you better at debugging and instrumentation, building resiliency and observability into your own systems and interdependent systems, and building systems that other people can come along and understand and maintain long after you’re gone.rainbow-shade

As an operations engineer, those skills are already your bread and butter.  You can increase your power in other ways, like by leveling up at software engineering skills like test coverage and automation, or DBA stuff like query optimization and storage engine internals, or by helping the other teams around you level up on their skills (communication and persuasion are chronically underrecognized as core operations engineering skills).

This doesn’t mean that everyone can or should be able to do everything.  (I can’t even SAYrainbow-dot-ball the words “full stack engineer” without rolling my eyes.)  Generalists are awesome!  But past a certain inflection point, specialization is the only way an org can scale.

It’s the only way you make room for those engineering archetypes who only want to dive deep, or who really really love refactoring, or who will save the world then disappear for weeks.  Those engineers can be incredibly valuable as part of a team … but they are most valuable in a large org where you have enough generalists to keep the oars rowing along in the meantime.

So, back to Google.  They’ve done, ahem, rather  well for themselves.  Made shitbuckets of money, pushed the boundaries of tech, service hardly ever goes down.  They have operational demands that most of us never have seen and never will, and their engineers are definitely to be applauded for doing a lot of hard technical and cultural labor to get there.

So why did this SRE book ruffle a few feathers?

Mostly because it comes off a little tone deaf in places.  I’m not personally pissed off by
the google SRE book, actually, just a little bemused at how legitimately unaware they seem to be about … anything else that the industry has been doing over the past 10 years, in terms of cultural transformation, turning sysadmins into better engineers, sharing on-call rotations, developing processes around empathy and cross-functionality, engineering best practices, etc.

devops
DevOps for the rest of us

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

However, it happens that Jen Davis and Katherine Daniels just published a book called Effective DevOps, which covers a lot of the same ground with a much more varied and inclusive approach.  And one of the things they return to over and over again is the power of context, and how one-size-fits-all solutions simply don’t exist, just like unisex OSFA t-shirts are a dirty fucking lie.

Google insularity is … a thing.  On the one hand it’s great that they’re opening up a bit!rainbow-umbrella-clipart-1 On the other hand it’s a little bit like when somebody barges onto a mailing list and starts spouting without skimming any of the archives.  And don’t even get me started on what happens when you hire long, longterm ex-Googlers back into to the real world.

So, so many of us have had this experience of hiring ex-Googlers who automatically assume that the way Google does a thing is CORRECT, not just contextually appropriate.   Not just right for Google, but right for everyone, always.  Which is just obviously untrue.  But the reassimilation process can be quite long and exhausting when the Kool-Aid is so strong.

Because yeah, this is a conversation and a transformation that the industry has been having for a long time now.  Compared with the SRE manifesto, the DevOps philosophy is much more crowd-sourced, more flexible, and adaptable to organizations of all stages of developments, with all different requirements and key business differentiators, because it’s benefited from loud, mouthy contributors who aren’t all working in the same bubble all along.

And it’s like Google isn’t even aware this was happening, which is weird.

Orrrrrr, maybe I’m just a wee bit annoyed that I’ve been drawn into this position rainbow-dot-ballof having to defend “DevOps”, after many excellent years spent being grumpy about the word and the 10000010101 ways it is used and abused.

(Tell me again about your “DevOps Engineering Team”, I dare you.)

P.S. I highly encourage you to go read the epic hours-long rant by @matthiasr that kicked off the whole thing.  some of which I definitely endorse and some of which not, but I think we could go drink whiskey and yell about this for a week or two easy breezy  <3

Anyway what the fuck do I know, I’ve never worked in the Google lair, so maybe I am just under-equipped to grasp the true glory, majesty and superiority of their achievements over us all.

Or maybe they should go read Katherine and Jen’s book and interact with the “UnGoogled” once in a while.  ☺️

colorful-rainbow

 

DevOps vs SRE: delayed coverage of the dumbest war

Operational Best Practices #serverless

This post is part two of my recap of last week’s terrific Serverless conference.  If you feel like getting bitchy with me about what serverless means or #NoOps or whatever, please refer back to the prequel post, where I talked about operations engineering in the modern world.

*Then* you can get bitchy with me.  (xoxoxxooxo)

The title of my talk was:

Screen Shot 2016-05-30 at 8.43.39 PM

The theme of my talk was basically: what should software engineers know and care about when it comes to operations in a world where we are outsourcing more and more core functionality?

If you care about running a quality service or product, or providing your customers with a reasonable level of service, you have to care about operational concerns like design, resiliency, instrumentation and debuggability.  No matter how many abstractions there are between you and the bare metal.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault.  You chose them, it’s on you.  It’s tacky to blame the software or the service, and besides your customers don’t give a shit whose “fault” it is.

So given an infinite number of things to care about, where do you start?

What is your mission, and what are your differentiators?

The first question must always be: what is your mission?  Your mission is not writing software.  Your mission is delivering whatever it is your customers are paying you for, and you use software to get there.  (Code is kind of a liability so you should write as little of it as necessary.  hey!! sounds like a good argument for #serverless!)

Second: what are your core differentiators?  What are the things that you are doing that are unique, and difficult to replicate, or the things where you have to actually be world class experts in those things?

Those are the things that you will have the hardest time outsourcing, or that you should think about very carefully before outsourcing.

Screen Shot 2016-05-30 at 7.57.06 PM

Facts

You can outsource labor, but you can’t outsource caring.  And nobody but you is in the position to think about your core differentiators and your product in a holistic way.

If you’re a typical early startup, you’re probably using somewhere between 5 and 20 SaaS products to get rid of some of the crap work and offload it to dedicated teams who can do it better than you can, much more cheaply, so you are freed up to work on your core value proposition.

GOOD.

But you still have to think about things like reliability, your security model, your persistent storage models, your query performance, how all these lovely services talk to each other, how you’re going to debug them, how you’re going to repro when things go wrong, etc.  You still own these things, even if you don’t run them.

For example, take AWS Lambda.  It’s a pretty great service on many dimensions.  It’s an early version of the future.  It is also INCREDIBLY irritating and challenging to debug in a practically infinite number of insanity-inducing ways.

** Important side note — I’m talking about actual production systems.  Parse, Heroku, Lambda, etc are GREAT for prototyping and can take you a long, long way.  Early stage startups SHOULD optimize for agility and rapid developer iteration, not reliability.  Thx to @joeemison for reminding me that i left that out of the recap.

Screen Shot 2016-05-30 at 8.03.01 PM

Focus on the critical path

Your users don’t care if your internal jenkins builds are broken.  They don’t care about a whole lot of things that you have to care about … eventually.  They do care a lot if your product isn’t actually functional.  Which means you have to think through the behavioral and failure characteristics of the providers you’re relying on in any user visible fashion.

Ask lots of questions if you can.  (AWS often won’t tell you much, but smaller providers will.)  Find out as much as you can about their cotenancy model (shared hardware or isolation?), their typical performance variance (run your own tests, don’t trust their claims), and the underlying storage systems.

Think about how you can bake in resiliency from the user’s perspective, that doesn’t rely on provider guarantees.  If you’re on mobile, can you provide a reasonable offline experience?  Like Parse did a lot of magic here in the APIs, where it would back off and retry saves if there were any errors.

Can you fail over to another provider if one is down?  Is it even worth it at your company’s stage of maturity and engineering resources to invest in this?

How willing are you to be locked into a vendor or provider, and what is the story if you find yourself forced to switch?  Or if that service goes away, as so many, many, many of them have done and will do.  (RIP, parse.com.)

Screen Shot 2016-05-30 at 8.11.10 PM

Tradeoffs

Listen, outsourcing is awesome.  I do it as much as I can.  I’m literally helping build a service that provides outsourced metrics, I believe in this version of the future!  It’s basically the latest iteration of capitalism in a nutshell: increased complexity –> increased specialization –> you pay other people to do the job better than you –> everybody wins.

But there are tradeoffs, so let’s be real.

The service, if it is smart, will put strong constraints on how you are able to use it, so they are more likely to deliver on their reliability goals.  When users have flexibility and options it creates chaos and unreliability.  If the platform has to choose between your happiness vs thousands of other customers’ happiness, they will choose the many over the one every time — as they should.

Limits may mysteriously change or be invented as they are discovered, esp with fledgling services.  You may be desperate for a particular feature, but you can’t build it.  (This is why I went for Kafka over Kinesis.)

You need to think way more carefully and more deeply about visibility and introspection up front than you would if you were running your own services, because you have no ability to log in and use strace or gdb or tail a logfile or run any system profiling commands when things go dark.

In the best case, you’re giving up some control and quality in exchange for experts doing the work better than you could for cheaper (e.g. i’m never running a fucking physical data center again, jesus.  EC24lyfe).  In a common worse case, it’s less reliable than what you would build AND it’s also opaque AND you can’t tell if it’s down for you or for everyone because frankly it’s just massively harder to build a service that works for thousands/millions of use cases than for any one of them individually.

Screen Shot 2016-05-30 at 8.21.36 PM.png

Stateful services

Ohhhh and let’s just briefly talk about state.

The serverless utopia mostly ignores the problems of stateful services.  If pressed they will usually say DynamoDB, or Firebase, or RDS or Aurora or something.

This is a big, huge, deep, wide lake of crap to wade in to so all I’m going to say is that there is no such thing as having the luxury of not having to understand how your storage systems work.  Queries will get slow, and you’ll need to be able to figure out why and fix them.  You’ll hit scaling cliffs where suddenly a perfectly-usable app just starts timing everything out because of that extra second of latency coming from …

¯\_(ツ)_/¯

The hardware underlying your instance will degrade (there’s a server somewhere under all those abstractions, don’t forget).  The provider will have mysterious failures.  They will be better than you, probably, but less inclined to give you satisfactory progress updates because there are hundreds or thousands or millions of you all clamoring.

The more you understand about your storage system (and the more you stay in the lane of how it was intended to be used), the happier you’ll be.

Screen Shot 2016-05-30 at 8.28.29 PM

In conclusion

These trends are both inevitable and, for the most part, very good news for everyone.

Operations engineering is becoming a more fascinating and specialized skill set.  The best engineers are flocking to solve category problems — instead of building the same system at company after company, they are building SaaS solutions to solve it for the internet at large.  Just look at the massive explosion in operational software offerings over the past 5-6 years.

This means that the era of the in-house dedicated ops team, which serves as an absorbent buffer for all the pain of software development, is mostly on its way out the door.  (And good riddance.)

People are waking up to the fact that software quality improves when feedback loops are tighter for software engineers, which means being on call and owning services end to end.  The center of gravity is shifting towards engineering teams owning the services they built.

This is awesome!  You get to rent engineers from Google, AWS, Pagerduty, Pingdom, Heroku, etc for much cheaper than if you hired them in-house — if you could even get them, which you probably can’t because talent is scarce.

Screen Shot 2016-05-30 at 8.39.42 PM

But the flip side of this is that application engineers need to get better at thinking in traditionally operations-oriented ways about reliability, architecture, instrumentation, visibility, security, and storage.  Figure out what your core differentiators are, and own the shit out of those.

Nobody but you can care about your mission as much as you can.  Own it, do it.  Have fun.

 

Operational Best Practices #serverless

WTF is operations? #serverless

I just got back from the very first ever @serverlessconf in NYC.  I have a soft spot for well-curated single-track conferences, and the organizers did an incredible job.  Major kudos to @iamstan and team for pulling together such a high-caliber mix of attendees as well as presenters.

I’m really honored that they asked me to speak.  And I had a lot of fun delivering my talk!  But in all honesty, I turned it down a few times — and then agreed, and then backed out, and then agreed again at the very last moment.  I just had this feeling like the attendees weren’t going to want to hear what I was gonna say, or like we weren’t gonna be speaking the same language.

Rainbow_dash_12_by_xpesifeindx-d5giyirWhich … turned out to be mmmmostly untrue.  To the organizers’ credit, when I expressed this concern to them, they vigorously argued that they wanted me to talk *because* they wanted a heavy dose of real talk in the mix along with all the airy fairy tales of magic and success.

 

So #serverless is the new cloud or whatever

Hi, I’m grouchy and I work with operations and data and backend stuff.  I spent 3.5 years helping Parse grow from a handful of apps to over a million.  Literally building serverless before it was cool TYVM.

So when I see kids saying “the future is serverless!” and “#NoOps!” I’m like okay, that’s cute.  I’ve lived the other side of this fairytale.  I’ve seen what happens when application developers think they don’t have to care about the skills associated with operations engineering.  When they forget that no matter how pretty the abstractions are, you’re still dealing with dusty old concepts like “persistent state” and “queries” and “unavailability” and so forth, or when they literally just think they can throw money at a service to make it go faster because that’s totally how services work.

I’m going to split this up into two posts.  I’ll write up a recap of my talk in a sec, but first let’s get some things straight.  Like words.  Like operations.

What is operations?

Let’s talk about what “operations” actually means, in the year 2016, assuming a reasonably high-functioning engineering environment.

Screen Shot 2016-05-30 at 4.29.09 PM

At a macro level, operational excellence is not a role, it’s an emergent property.  It is how you get shit done.

Operations is the sum of all of the skills, knowledge and values that your company has built up around the practice of shipping and maintaining quality systems and software.  It’s your implicit values as well as your explicit values, habits, tribal knowledge, reward systems.  Everybody from tech support to product people to CEO participates in your operational outcomes, even though some roles are obviously more specialized than others.

Saying you have an ops team who is solely responsible for reliability is about as silly as saying that “HR defines and owns our company culture!”  No.  Managers and HR professionals may have particular skills and responsibilities, but culture is an emergent property and everyone contributes (and it only takes a couple bad actors to spoil the bushel).

Thinking about operational quality in terms of “a thing some other team is responsible for” is just generally not associated with great outcomes.  It leads to software engineers who are less proficient or connected to their outcomes, ops teams who get burned out, and an overall lower quality of software and services that get shipped to customers.

These are the specialized skill sets that I associate with really good operations engineers.  Do these look optional to you?

Screen Shot 2016-05-30 at 4.41.21 PM

It depends on your mission, but usually these are not particularly optional.  If you have customers, you need to care about these things.  Whether you have a dedicated ops team or not.  And you need to care about the tax it imposes on your humans too, especially when it comes to the cognitive overhead of complex systems.

So this is my definition of operations.  It doesn’t have to be your definition.  But I think it is a valuable framework for helping us reason about shipping quality software and healthy teams.  Especially given the often invisible nature of operations labor when it’s done really well.  It’s so much easier to notice and reward shipping shiny features than “something didn’t break”.

The inglorious past

Don’t get me wrong — I understand why “operations” has fallen out of favor in a lot of crowds.  I get why Google came up with “SRE” to draw a line between what they needed and what the average “sysadmin” was doing 10+ years ago.

Ops culture has a number of well-known and well-documented pathologies: hero/martyr complexes, risk aversion, burnout, etc.  I understand why this is offputting and we need to fix it.

Also, historically speaking, ops has attracted a greater proportion of nontraditional oddballs who just love debugging and building things — fewer Stanford CS PhDs, more tinkerers and liberal arts majors and college dropouts (hi).  And so they got paid less, and had less engineering discipline, and burned themselves out doing too much ad hoc labor.Rainbow_Dash_3.png

But — this is no longer our overwhelming reality, and it is certainly not the reality we are hurtling towards.  Thanks to the SRE movement, and the parallel and even more powerful & diverse open source DevOps movement, operations engineers are … engineers.  Who specialize in infrastructure.  And there’s more value than ever in empathy and fluid skill sets, in engineers who are capable of moving between disciplines and translating between specialties.  This is where the “full-stack developer” buzzword comes from.  It’s annoying, but reflects a real craving for generalist skill sets.

The BOFH stereotype is dead.  Some of the most creative cultural and technical changes in the technical landscape are being driven by the teams most identified with operations and developer tooling.  The best software engineers I know are the ones who consistently value the impact and lifecycle of the code they ship, and value deployment and instrumentation and observability.  In other words, they rock at ops stuff.

The Glorious Future

And so I think it’s time to bring back “operations” as a term of pride.  As a thing that is valued, and rewarded.  As a thing that every single person in an org understands as being critical to success.  Every organization has unique operational needs, and figuring out what they are and delivering on them takes a lot of creativity and ingenuity on both the cultural and technical front.

“Operations” comes with baggage, no doubt.  But I just don’t think that distance and denial are aRainbow_Dash_in_flightn effective approach for making something better, let alone trash talking and devaluing the skill sets that you need to deliver quality services.

You don’t make operational outcomes magically better by renaming the team “DevOps” or “SRE” or anything else.  You make it better by naming it and claiming it for what it is, and helping everyone understand how their role relates to your operational objectives.

And now that I have written this blog post I can stop arguing with people who want to talk about “DevOps Engineers” and whether “#NoOps” is a thing and maybe I can even stop trolling them back about the nascent “#NoDevs” movement.  (Haha just kidding, that one is too much fun.)

Part 2: Operations in a #Serverless World

 

WTF is operations? #serverless

Scrapbag of useful Terraform tips

After some kinda ranty posts about terraform and AWS and networking and orchestration life in general, it feels like a good time to braindump some helpful tidbits.

 

jokersunshine

Side note — a few people have asked me to open source my terraform config.  I’m actually super open to sharing it, but a) it’s still changing a lot and b) tf modules aren’t really reusable yet.  They just aren’t.  Eventually we’ll reach a maturity point where a tf module library makes sense, but open source tf modules haven’t been super helpful for me and I don’t expect mine will be any better.

I *did* embed a bunch of meaty gists in this post with some of the more interesting configs.  Let me know if you want to see more, I will happily send it to you, I just don’t want to be maintaining an OS repo right now.

So here are a few things that took me minutes, hours, or days to figure out, that hopefully will now take you less time.

ICMP for security groups

If you want your hosts to be pingable, you have to put this stanza in your security group.  The “from_port = 8” isn’t in the security group docs; I found it in this github issue.  Not being a networking person myself, I literally never would have guessed it.  If you want to read up more, here’s more about why.

  ingress {
    from_port = 8 
    to_port = 0 
    protocol = "icmp"
    cidr_blocks = ["0.0.0.0/0"]
  }

Here is a gist for my bastion security groups.  Note that all the security groups are in the aws_vpc module, which gets invoked separately by each environment.

Security groups are stackable in VPC, which is glorious.  But most of the time I thought I was having a problem with VPCs or routes or networking, it turned out to be a security group problem, or an interaction between the two.

Since you have no ability to debug AWS networking via normal linux utilities, my best debugging tip for VPC networking is still, when you get really stuck open up all the SG ports and see if that fixes it.  (Preferably in, you know, a staging environment, not prod …)

Resource description fields != comment fields

wonkacommentsDo not use your resource description fields as comments about those resources.

It feels like they should be comment strings, doesn’t it?  Well, they aren’t.  If you change your “comment” terraform will try to destroy and recreate the resource (which may or may not even work, if it’s like a security group that all your environments and other resources happen to inherit.  Hypothetically speaking.)

This isn’t a Hashicorp thing, it’s an AWS thing.  You can’t go edit the description in the console, either — try it!  It’s like a smelly, lingering remnant of the bad old days before we had tags.

So use tags, or use comments in your code.  Don’t use descriptions for documentation

Picking VPC ranges

The max # of hosts you can have in any VPC is a /16.  Probably don’t start your numbering with 10.0.0.0/16, just in case you ever want to peer with anyone else, who almost certainly started with 10.0.0.0/16 too.

Route tables

Only one route table can be associated with each subnet.  (Again, NONE of your routes will show up in netstat -nr or any of the normal Linux tools, which is fucking infuriating.)

I recommend not using aws_route_table with an inline blob of routes, but instead using aws_route resources.  These are additive resources, so it gives you more fine-grained control if you want different environments to have different routing tables.

Peering VPCs

Peering is so fucking rad.  I’m so, so happy with it.  Peering makes VPC-per-env tractable and flexible and not horribly annoying.

In order to peer VPCs, if you have a separate state file per environment (which you really should), you will need to import remote state.  It’s not very obvious from the documentation, but this is an incredibly powerful feature.  It lets you refer to variables from remote state files just like they were modules.

I use S3 for saving state, with versioning turned on for the bucket.

I have a locked-down dev VPC which is automatically peered with all other VPCs and peering-oprahallowed to ssh into them, but can’t connect to any other ports in those VPCs.  (Using security groups, but also network ACLs for an extra sanity check.)  And none of the other VPCs are peered with each other, so none of the test or staging or prod environments can accidentally connect to each other.

I ran into a few things while setting up peering.  (Relevant context: I have both 4 public subnets and 4 private NAT subnets for each VPC, one subnet per availability zone.)

  • First, like I said, I had to refactor my aws_route_table into a bunch of aws_route resources, because I didn’t want the route tables to look the same for every environment (staging shouldn’t be able to talk to prod but dev should, etc)
  • If you own both VPCs, you can set up auto-accept, which is super rad.  If not, someone has to go to the console and click ok somewhere.
  • You need to include your “owner id” in the peering config, which confused me for a bit but you just have to log in as the root account and look under billing somewhere.  (I don’t remember where, google it.)
  • Second, peering has to be set up in both directions before connections will actually work.  I naively assumed that if it was set up and auto-accepted from VPC-A to VPC-B, connections from VPC-A to VPC-B would work.  Nope!  you also have to establish the peering from VPC-B to VPC-A before either direction will work.
  • All public subnets share a single route table, but each private subnet has its own (necessary for NAT).  So I had to set up peering from every single one of the private subnets that I wanted to be able to connect out from.

Here’s the gist to the networking portion of my aws_vpc module.  (The rest of the module is mostly just security groups.)

And some sample peering configs (you need one for each VPC, like I mentioned, so it’s bidirectional for each pair).  Here’s a gist snippet from the dev side, and the paired snippet from the staging side.

(You can tell how confident I was in these changes by how I named the resources, and added blamey “Author” tags for a coworker who hadn’t actually started working with me yet.  I don’t think he’s noticed yet, lol.)

NAT gateways, IGWs

You probably already set up an IGW resource for your public subnets to talk to the internet.  Just add it to every public subnet, easy breezy:


resource "aws_internet_gateway" "mod" {
  vpc_id = "${aws_vpc.mod.id}"
  tags { 
    Name = "${var.env}_igw"
  }
}

# add a public gateway to each public route table
resource "aws_route" "public_gateway_route" {
  route_table_id = "${aws_route_table.public.id}"
  depends_on = ["aws_route_table.public"]
  destination_cidr_block = "0.0.0.0/0"
  gateway_id = "${aws_internet_gateway.mod.id}"
}

Lots of people seem to still be setting up custom Linux boxes to NAT traffic out from private subnets to the internet.  (A prominent internet service provider had an outage a couple weeks ago because they were doing this.)default-gw

Use NAT gateways instead, if you can.  They are basically just like ELBs but for natting out to the internet.  They scale out according to throughput in roughly the same way, up to 10 Gbps bursts.

BUT MIND THE FUCKING TRAP.  You do not attach these NAT gateways to your PRIVATE subnets, you attach them to the PUBLIC FUCKING SUBNETS, and then a route to from the private subnet to that gateway.  Gahhhhhh.

resource "aws_eip" "nat_eip" {
  count    = "${length(split(",", var.public_ranges))}"
  vpc = true
}

resource "aws_nat_gateway" "nat_gw" {
  count = "${length(split(",", var.public_ranges))}"
  allocation_id = "${element(aws_eip.nat_eip.*.id, count.index)}"
  subnet_id = "${element(aws_subnet.public.*.id, count.index)}"
  depends_on = ["aws_internet_gateway.mod"]
}

# for each of the private ranges, create a "private" route table.
resource "aws_route_table" "private" {
  vpc_id = "${aws_vpc.mod.id}"
  count = "${length(compact(split(",", var.private_ranges)))}"
  tags { 
    Name = "${var.env}_private_subnet_route_table_${count.index}"
  }
}
# add a nat gateway to each private subnet's route table
resource "aws_route" "private_nat_gateway_route" {
  count = "${length(compact(split(",", var.private_ranges)))}"
  route_table_id = "${element(aws_route_table.private.*.id, count.index)}"
  destination_cidr_block = "0.0.0.0/0"
  depends_on = ["aws_route_table.private"]
  nat_gateway_id = "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
}

(Thank you @ebroder, I would probably NEVER have figured this out on my own.  AWS docs are completely unintelligible on the subject.)

A note on ELB SGs

Oh … you probably know this, but your ELBs should be in a separate / more permissive SG than the instances backing those ELBs.  You don’t want people to be able to connect directly to e.g. port 80 or 8080 on an application host, bypassing the ELB.

ELB certificates

If you live in us-east, use the new AWS certificate manager.  It’s free and you’ll never have to worry about cert expirations ever ever again.

If you don’t — or if you didn’t notice the announcement LITERALLY A FEW DAYS BEFORE you purchased your own Digicert wildcard cert (wahhhh) — you should just add the cert to your ELB in the console and refer to the ARN in your tf configs, because otherwise your private key will be in the state file.

Ok that’s it

Yesterday I spun up another whole new VPC clone by adding about 5 lines and copying a couple files + sed -e’ing the name of the environment.  Took about two minutes, felt like a fucking badass.  ^_^

I will now proceed to forget as much as possible about all the things I have learned about networking over the past two months.

fuck-networking

Scrapbag of useful Terraform tips

Nail polish: the superior paint

Here is a thing that more people need to know: nail polish is the toughest, brightest, sparkliest, most durable paint in the world.  It makes everything prettier and brighter and it lasts forever.

I have a tradition of always painting a new keyboard every time I start a new job.  Please witness exhibit A, my freshly painted Hound keyboard & num pad:

IMG_5784
microsoft sculpt ergonomic keyboard

Doesn’t that just make you happy to look at?  The idea of typing on a plain beige or black keyboard all day long is … actually one of the more depressing things I can imagine.

Now check out the one I painted when I started at Parse and used nearly every day for almost 4 years:

IMG_5707
old-school microsoft natural keyboard, RIP

Still pretty cute, right?  It’s a little dingy and a few chips here or there but is there literally any other paint on earth that would have taken this abuse for four years?  (And you can’t even tell how pretty the shimmery holographic silver keys are in this pic.)

Dude, it gets better.  My work laptop:

IMG_5264
work laptop, aluminum MBP with hound swag

And since I got my work and personal laptops around the same time, I made them fraternal twins.

IMG_4811 (1)
personal laptop, dark grey MBA

Yo, this shit can bang around in my laptop bag for *years* and never chip or wear off or fade.  Nail polish is the most badass paint in the entire world.

(The only thing I will not paint is my fingernails, I hate painted nails.)

I really really wish I had saved pictures of all the laptops and keyboards and monitors and mice and cell phones and other crap I have painted over the last decade or so.  But here are a few of the pics and pieces I still have lying around:

 

People keep asking me where I get my laptop “stickers”, or my awesome keyboards, and now you know.

Nail polish.  It is the shit.

Pro tip #1: if you want to paint bright colors on a black background, you need to lay down a very light base first.  Surprisingly, white doesn’t work well, it’s not opaque enough and ends up looking streaky.  You need something reflective, like silver aluminum foil or silver chrome — maybe this?  Dunno, I have a big stash so I haven’t really bought anything new in years and can’t recommend specific brands.

Pro Tip #2: cheap polish is A-OK.  Those 99 cent bottles are fine, just select for opacity.  Sheer colors don’t work very well, if you have to apply more than 2-3 coats it will get globby and weird as well as being tedious.

Pro Tip #3: sketch your designs out with a pencil before you paint them (easily erasable), and apply a top coat the next day to make everything even shinier and longer-lasting.

Please send me your pics.  :)

Happy painting!!

Nail polish: the superior paint

Terraform, VPC, and why you want a tfstate file per env

Hey kids!  If you’ve been following along at home, you may have seen my earlier posts on getting started with terraform and figuring out what AWS network topology to use.  You can think of this one as like if those two posts got drunk and hooked up and had a bastard hell child.

Some context: our terraform config had been pretty stable for a few weeks.  After I got it set up, I hardly ever needed to touch it.  This was an explicit goal of mine.  (I have strong feelings about delegation of authority and not using your orchestration layer for configuration, but that’s for another day.)

And then one day I decided to test drive Aurora in staging, and everything exploded.

Trigger warning: rants and scary stories about computers ahead.  The first half of this is basically a post mortem, plus some tips I learned about debugging terraform outages.  The second half is about why you should care about multiple state files, how to set up and manage multiple state files, and the migration process.

First, the nightmare

This started when I made a simple tf module for Aurora, spun it up in staging, assigned a CNAME, turned on multi-AZ support for RDS, was really just fussing around with minor crap in staging, like you do.  So I can’t even positively identify which change it was that triggered it, but around 11 pm Thursday night the *entire fucking world* blew up.  Any terraform command I tried to run would just crash and dump a giant crash.log.

Terraform crashed! This is always indicative of a bug within Terraform.

So I start debugging, right?  I start backing out changes bit by bit.  I removed the Aurora module completely.  I backed out several hours worth of changes.  It became clear that the tfstate file must be “poisoned” in some way, so  I disabled remote state storage and starting performing surgery on the local tfstate file, extracting all the Aurora references and anything else I could think of, just trying to get tf plan to run without crashing.

What was especially crazy-making was the fact that I could apply *any* of the modules or resources to *any* of the environments independently.  For example:

 $ for n in modules/* ; do terraform plan -target=module.staging_$n ; done

… would totally work!  But “terraform plan” in the top level directory took a giant dump.

I stabbed at this a bunch of different ways.  I scrolled through a bunch of 100k line crash logs and grepped around for things like “ERROR”.  I straced the terraform runs, I deconstructed tens of thousands of lines of tfstates.  I spent far too much time investigating the resources that were reporting “errors” at the end of the run, which — spoiler alert — are a total red herring.  Stuff like this —

 15 module.staging_vpc.aws_eip.nat_eip.0: Refreshing state... (ID: eipalloc-bd6987da)
 16 Error refreshing state: 34 error(s) occurred:
 17 
 18 * aws_s3_bucket.hound-terraform-state: unexpected EOF
 19 * aws_rds_cluster_instance.aurora_instances.0: unexpected EOF
 20 * aws_s3_bucket.hound-deploy-artifacts: unexpected EOF
 21 * aws_route53_record.aurora_rds: connection is shut down

It’s all totally irrelevant, disregard.

It’s like 5 am by this point which is why I feel only slightly less fucking retarded about the fact that @stack72 had to gently point out that all you have to do is find the “panic” in the crash log, because terraform is written in Go so OF COURSE THERE IS A PANIC buried somewhere in all that dependency graph spew.  God, I felt like such an idiot.

The painful recovery

I spent several hours trying to figure out how to recover gracefully by isolating and removing whatever was poisoned in my state.  Unfortunately, some of the techniques I used to try and systematically validate individual components or try to narrow down the scope of the problem ended up making things much, much worse.

For example: applying individual modules (“terraform apply -target={module}”) can be extremely problematic.  I haven’t been able to get an exact repro yet, but it seems to play out something like this: if you’re applying a module that depends on other resources that get dynamically generated and passed into it, but you aren’t applying the modules that do that work in the same run, terraform will sometimes just create them again.

Like, a whole duplicate set of all your DNS zones with the same domain names but different zone ids, or a duplicate set of VPCs with the same VPC names and subnets, routes, etc, and you only find out if it gets to a point where it tries to create one of those rare resources where AWS actually enforces unique names, like autoscaling groups.

And yes, I do run tf plan.  Religiously.  But when you’re already in a bad state and you’re trying to do unhealthy things with mutated or corrupt tfstate files … shit happens.  And thus, by the time this was all over:

I ended up deleting my entire infrastructure by hand, three times.

I’m literally talking about clicking select-all-delete in the console on fifty different pages, writing grungy shell scripts to cycle through and delete every single AWS resource, purging local cache files, tracking down resources that exist but don’t show up on the console via the CLI, etc.

Of course every time I purged and started from scratch, it had to create a new zone ID for the root domain, so I had to go back and update my registrar with the new nameservers because each AWS zone ID is associated with a different set of resolvers.  Meanwhile our email, website, API etc were all unresolvable.

If we were in production, this would have been one of the worst outages of my career, and that’s … saying a lot.

So … Hi!  I learned a bunch of things from this.  And I am *SO GLAD* I got to learn them before we have any customers!

The lessons learned

(in order of least important to most important)

1. Beware of accidental duplicates.

It actually really pisses me off how easily you can just nonchalantly and unwaryingly create a duplicate VPC or Route53 zone with the same name, same delegation, same subnets, etc.  Like why would anyone ever WANT that behavior?  I blame AWS for this, not Hashicorp, but jesus christ.

So that’s going on my “Wrapper Script TODO” list: literally check AWS for any existing VPC or Route53 zone with the same name, and bail on dupes in the plan phase.

(And by the way — this outage was hardly the first time I’ve run into this.  I’ve had tf  unexpectedly spin up duplicate VPCs many, many, many times.  It is *not* because I am running apply from the wrong directory, I’ve checked for any telltale stray .terraforms.  I usually don’t even notice for a while so it’s hard to figure out what exactly I did to cause it, but definitely seems related to applying modules or subsets of a full run.  Anyway, I am literally setting up a monitoring check for duplicate VPC names which is ridiculous but whatever, maybe it will help me track this down.)

(Also, I really wish there was a $TF_ROOT.)

2. Tag your TF-owned resources, and have a kill script.

This is one of many great tips from @knuckolls that I am totally stealing.  Every taggable resource that’s managed by terraform, give it a tag like “Terraform: true”, and write some stupid boto script that will just run in a loop until it’s successfully terminates everything with that tag + maybe an env tag.  (You probably want to explicitly exclude some resources, like your root DNS zone id, S3 buckets, data backups.)

But if you get into a state where you *can’t* run terraform destroy, but you *could* spin up from a clean slate, you’re gonna want this prepped.  And I have now been in that situation at least four or five times not counting this one.  Next time it happens, I’ll be ready.

Which brings me to my last and most important point — actually the whole reason I’m even writing this blog post.  (she says sheepishly, 1000 words later.)

3. Use separate state files for each environment.

Sorry, let me try  this again in a font that better reflects my feelings on the subject:

USE SEPARATE STATE FILES.

FOR EVERY ENVIRONMENT.

This is about limiting your blast radius.  Remember: this whole thing started when I made some simple changes to staging.

To STAGING.

But all my environments shared a state file, so when something bad happened to that state file they all got equally fucked.

If you can’t safely test your changes in isolation away from prod, you don’t have infrastructure as code.

Look, you all know how I feel about terraform by now.  I love composable infra, terraform is the leader in the space, I love the energy of the community and the responsiveness of the tech leads.  I am hitching my wagon to this horse.

It is still as green as the motherfucking Shire and you should assume that every change you make could destroy the world.  So your job as a responsible engineer is to add guard rails, build a clear promotion path for validating changesets into production, and limit the scope of the world it is capable of destroying.  This means separate state files.

So Monday I sat down and spent like 10 hours pounding out a version of what this could look like.  There aren’t many best practices to refer to, and I’m not claiming my practices are the bestest, I’m just saying I built this thing and it makes sense to me and I feel a lot better about my infra now.  I look forward to seeing how it breaks and what kinds of future problems it causes for me.


HOWTO: Migrating to multiple state files

Previous config layout, single state file

If you want to see some filesystem ascii art about how everything was laid out pre-Monday, here you go.

terraform-layout-old

Basically: we used s3 remote storage, in a bucket with versioning turned on.  There was a top-level .tf file for each environment {production,dev,staging}, top-level .tf files for env-independent resources {iam,s3}, and everything else in a module.

Each env.tf file would call out to modules to build its VPC, public/private subnets, IGW, NAT gateway, security groups, public/private route53 subdomains, an auto-scaling group for each service (including launch config, int or ext ELB, bastion hosts, external DNS, tag resources, and so forth.

New config layout, with state file per environment

In the new world there’s one directory per environment and one base directory, each of which has their own remote state file (you source the init.sh file to initialize a new environment, after that it just works if you run tf commands from that directory).  “Base” has a few resources that don’t correspond to any environment — s3 buckets, certain IAM roles and policies, the root route53 zone.

Here’s an ascii representation of the new filesystem layout: terraform-layout-current

All env subdirectories have a symlink to ../variables.tf and ../initialize.tf.  Variables.tf declares has the default variables that are shared by all environments — things like

$var.region, $var.domain, $var.tf_s3_bucket

Initialize.tf contains empty variable declarations for the variables that will be populated in each env’s .tfvars file, things like like

$var.cidr, $var.public_subnets, $var.env, $var.subdomain_int_name

Other than that, each environment just invokes the same set of modules the same way they did before.

The thing that makes all this possible?  Is this little sweetheart, terraform_remote_state:

resource "terraform_remote_state" "master_state" {
  backend = "s3"
  config {
    bucket = "${var.tf_s3_bucket}"
    region = "${var.region}"
    key = "${var.master_state_file}"
  }
}

It was not at all apparent to me from the docs that you could not only store your remote state, but also query values from it.  So I can set up my root DNS zones in the base environment, and then ask for the zone identifiers in every other module after that.

module "subdomain_dns" {
  source = "../modules/subdomain_dns"
  root_public_zone_id = "${terraform_remote_state.master_state.output.route53_public_zone}"
  root_private_zone_id = "${terraform_remote_state.master_state.output.route53_internal_zone}"
  subdomain_int = "${var.subdomain_int_name}"
  subdomain_ext = "${var.subdomain_ext_name}"
}

How. Fucking. Magic. is that.

SERIOUSLY.  This feels so much cleaner and better.  I removed hundreds of lines of code in the refactor.

(I hear Atlas doesn’t support symlinks, which is unfortunate, because I am already in love with this model.  If I couldn’t use symlinks, I would probably use a Makefile that copied the file into each env subdir and constructed the tf commands.)

Rolling it out

Switching from single statefile to multiple state files was by far the trickiest part of the refactor.  First, I started by building a new dev environment from scratch just to prove that it would work.

Second, I did a “terraform destroy -target=module.staging”, then recreated it from the env_staging directory by running “./init.sh ; terraform plan -var-file=./staging.tfvars”.  Super easy, worked on the first try.

For production and base though, I decided to try doing a live migration from the shared state file to separate state files without any production impact.  This was mostly for my own entertainment and to prove that it could be done.  And it WAS doable, and I did it, but it was preeeettty delicate work and took about as long as literally everything else combined. (~5 hours?).  Soooo much careful massaging of tfstate files.

(Incidentally, if you ever have a syntax error in a 35k line JSON file and you want to find what line it’s on, I highly recommend http://jsonlint.com.  Or perhaps just reconsider your life choices.)

Stuff I still don’t like

There’s too much copypasta between environments, even with modules.  Honestly, if I could pass around maps and interpolate

$var.env

into every resource name, I could get rid of _so_ much code.  But Paul Hinze says that’s a bad idea that would make the graph less predictable and he’s smarter than me so I believe him.

TODO

There is lots left to do, around safety and tooling and sanity checks and helping people not accidentally clobber things.  I haven’t bothered making it safe for multiple engineers  because right now it’s just me.

This is already super long so I’m gonna wrap it up.  I still have a grab bag of tf tips and tricks and “things that took me hours/days to figure out that lots of people don’t seem to know either”, so I’ll probably try and dump that out too before I’ve moved on to the next thing.

Hope this is helpful for some of you.  Love to hear your feedback, or any other creative ways that y’all have gotten around the single-statefile hazard!

P.S. Me after the refactor was done =>

 

 

Terraform, VPC, and why you want a tfstate file per env