The Future of Ops is Platform Engineering

First published on 2022-09-30 at https://www.honeycomb.io/blog/future-ops-platform-engineering.

Two years ago I wrote a piece in The New Stack about the Future of Ops Careers. Towards the end, I wrote:

The reality is that jack-of-all-trades systems infrastructure jobs are slowly vanishing: the world doesn’t need thousands of people who can expertly tune postfix, SpamAssassin, and ClamAV—the world has Gmail. (…)

Building infrastructure and operational expertise used to be bundled together into a single role. But the industry is now bifurcating along an infrastructure fault line, and the overlap between infrastructure-oriented engineers and operationally-minded engineers is swiftly eroding. Engineers who love this work increasingly have a choice to make. Either you can 1) go deep on infrastructure by joining a company that does infrastructure as a service, or 2) go broad on operability by joining a company to help them do as little infrastructure as possible.

I described the second category as “operations engineering minus the infrastructure,” dedicated to evaluating and assembling a production stack of third-party platform providers, enabling software engineers to self-serve their services and own their own code in production. I said:

  • Your job will be to aggressively minimize the cycles your org devotes to infrastructure by finding effective ways to outsource or minimize infra labor. Your job is to NOT go deep if there is any workable alternative.
  • Your job will be to work cross-functionally with all the other software engineering teams, looking for ways to speed up their time to value and helping them own their own code in production.
  • Your job will be to move past the kludgey old models of “outsourcing” to sophisticated understandings of how and where to leverage abstractions that can radically accelerate development.

That second category I was describing now has a name. We call those teams “platform engineering.”

The fifty-year arc of software careers

In the beginning, there were people who wrote and ran software. At some point, we spun away ops skills from dev skills into two different professions, but that turned out to be a ginormous mistake, so along came DevOps to reunify them. Nowadays, ops as an independent profession is in the process of fading out. Companies are spinning down their ops teams left and right. Engineers who formerly identified as sysadmins or operations have turned into DevOps engineers, and soon there will just be “software people” again. This is the way of things.

Please note that this is NOT the same thing as saying “ops is dead,” or “ops skills are no longer valuable or needed1.” Our systems are only getting more complex, more difficult to operate, and simultaneously more critical to life on earth, which means that operational excellence has never been more desperately needed (and if you don’t respect that, 🌈 you deserve to suffer 🌈).

The industry story of the past three to five years has been us trying to figure out how to help software engineers own their own code in production2, phasing out dedicated ops teams, and aggressively outsourcing as much infrastructure as possible.

As we should. Developer cycles are the scarcest resource in your company, and you want to spend as many of those as possible on your core product: the crown jewel, the code that makes you a business. Money is cheaper than engineering cycles, and teams that are focused on their core business will always outperform teams whose focus is spread across dozens of non-revenue-generating projects. Let someone else build and run all the dependencies and adjacencies.

Before: some engineers wrote code, and some engineers ran code.

Now: all engineers write code, and all engineers run the code they write.

Platform engineering is what stands between you and darkness

When you start talking about putting software engineers on call for their own code, and generally being more involved in production, some percentage of the time you will hear back a guttural wail of despair: “You can’t expect me to know EVERYTHING about EVERYTHING!”

Quite right; we can’t. Platform engineering teams are part of the answer to this perfectly reasonable complaint. It’s not that you’re being asked to do or understand more in toto, but the distribution of labor and responsibility is shifting:

Before: some engineers wrote code, and some engineers ran code.

Now: all engineers write code, and all engineers run the code they write—but we divide the areas of responsibility by layer or function.

The emergence of a minimum viable self-serve tier

In the earliest days of a company, your first few engineers end up bootstrapping an infrastructure by reading AWS docs or blog posts, or asking a friend for recommendations to get started. They might start by setting up a managed container service, or configuring Terraform, and for a while everybody deploys and owns their own code, just as god intended.

But cognitive limits kick in pretty quickly. The maze of APIs and SDKs and components out there is simply bewildering, even for an experienced ops hand. Before long, it becomes someone’s job to make good decisions, pick a suite of compute and storage options that serve the team’s needs, and write some tooling that pulls everything into a coherent whole—which, at a minimum, lets you:

  1. Run tests and generate new artifacts
  2. Deploy artifacts, version them, and roll back
  3. Instrument, monitor, and debug
  4. Store data somewhere, manage schemas and migrations
  5. Adjust capacity as needed
  6. Define and commit all components (and their relationships) as code

Once these are built, it should be trivial for an engineer to come along and spin up a new service using templates and components from existing services. It should be much simpler and easier to use the blessed paths than anything else, and there should be friction if you go off the beaten path.

Congratulations! You’ve just been platformed 🎉. One of the key principles of any developer platform is that it should be easy to do the right things, and hard to do the wrong things.

The differences between platform engineering and traditional ops

Platform teams are typically staffed by engineers who are comfortable writing software. Not just scripting and automation, but writing tests and doing code reviews. Platform teams also operate much more like product development teams do, with product managers (and occasionally, designers, developer advocates, or UX researchers).

This doesn’t mean that everybody on a platform team has to have originally been a software engineer; in fact, a super common failure condition for platform teams is simply thinking all they need to do is hire software engineers to build developer tools. A strong platform team has an equally deep grounding in operations experience and software development. Individuals who are experts in both areas are fairly rare, but you can pull together a strong, well-rounded team by assembling a mix of SWEs (with some ops experience) and ops or DevOps engineers (with some software experience) and having them learn and grow from each other.

Platform teams are decidedly cloud-native; they actually mostly involve platforms built atop the cloud itself—PaaS, IaaS, everything-aaS, serverless, and so forth.

Ops/DevOps teams are oriented around managing infrastructure, often several generations of infrastructure. Their turf is everything from data centers and bare metal up through virtualization, containers, and the cloud (they aren’t so much cloud-native as cloud-enabled). They measure themselves on things like SLOs and the DORA metrics. You know they’re doing a good job if the system is up/available and users are happy.

Platform teams are oriented around providing a good experience for developers to self-serve and self-manage their code. The more swiftly and easily developers can move, the better your platform team. Operational excellence, in the platform model, is actually more the responsibility of the other engineering teams (and/or an adjacent SRE team) than that of the platform team.

Platform teams typically work higher up the stack than operations, DevOps, or SRE teams do, and they involve a great deal less infrastructure. On the contrary, platform teams are bent on paying other people to run as much shit as possible, preserving their own scarce development cycles for their core product.

Here is a somewhat tongue-in-cheek table of the similarities and differences between the archetypes.

Platform engineers vs. DevOps engineers

Platform Engineer Ops (or DevOps) Engineer
% of job spent writing code > 50% < 50%
Rest of time spent Gathering product requirements, doing user research, architecture discussions, optimizing internal workflows, researching new tools and developer productivity ideas, reviewing other teams’ diffs for impact, performance tuning, helping other engineers own & scale their code, fixing CI/CD pipelines. Fixing cron jobs, automating old setup docs, converting PXE/rsync to Chef/Puppet, converting Chef/Puppet to Terraform, converting VMs to containers, deploying software, debugging broken deploys, writing monitoring checks, doing retros, building out new services, pairing with software engineers to understand and debug their code, investigating weird shit, documentation, etc.
Responsible for Enabling internal teams to self-serve their ability to run and own their code in production. Creating standard, reusable components and processes. Defining golden paths. Infrastructure capacity planning, scaling, performance tuning, upgrading. Reliability and resiliency, SLOs and monitoring/alerting. Delivering quality experience to customers.
Builds for Internal developer teams Customers
Development style Infrastructure as a product Infrastructure as code
Works with product managers Yes No
Works with UX researchers or designers Sometimes No
Dashboards & graphs Uses APM, observability, tracing. Cares a lot about instrumentation and OpenTelemetry. Uses metrics, logs, dashboards; monitoring, alerting, and agent/sidecar/blackbox telemetry.
What ‘coding’ means to them Developing new features & services, writing tests. These are (primarily) software people who do systems. Automation, configuration, DSLs, extending and debugging existing code. These are systems people who do software.
Preferred language Go, Rust Python, Ruby
Time spent in Linux Hardly any A lot
Succeeds when Developers can easily choose good defaults, self-serve their infra, and own their own code in production. Infrastructure is scalable, secure, cost-effective, reliable, and customers are happy.
Native terrain Serverless, *aaS, APIs for everything (cloud-native and above). Instances, VMs, containers, regions, multi-cloud (everything “below,” but up to and including the cloud).
Databases Uses hosted DBs Runs their own, blending automation & DBA expertise
SSH No Yes
Shell REPL bash/zsh
Mantra “Run Less Software” “Cattle, Not Pets”

What about DevOps vs. SRE?

Countless words have been spilled on the difference between DevOps and SRE3, which I won’t rehash.

Here’s what I’ll say: DevOps, to me, feels like a relevant concept for companies that have a lot of infrastructure to wrangle. Companies that do in fact have dev teams and ops teams, or dev teams and DevOps teams (🙄), tend to have a lot of operational shit to automate, test, and run. They use config management, virtualization, and containers, often managing several generations worth of technology, possibly even down to data centers and bare metal. DevOps is for companies that have some combination of bare metal, VMs, regions, AZs, multi-cloud, networking devices, self-managed databases, etc.

DevOps is capacious. It contains multitudes. DevOps writes code, and DevOps has a fuckload of code to manage.

It is also on its way to becoming irrelevant. We are swiftly entering a post-DevOps world.

SRE, to me, feels different. I associate SRE with very large companies, where they mostly have software engineers owning their own code in production, but maybe still struggle with it a bit. SREs are often embedded within software engineering teams or product groups, and they focus a lot on, well, reliability, as the name suggests.

This means they do less infrastructure jockeying or automating (although they still do some coding). They typically have a lot to say about instrumentation, monitoring and observability, and cross-functional coordination. They run incident response and do blameless retros, and they tend to be experts at scaling.

If a company has both a DevOps team and SRE, typically I expect to see the SRE team more on the frontlines, involved with incidents, telemetry, etc., and DevOps teams more on the backburner, slinging pipes and plumbing.

Observability engineering as a case study

In the same piece I referenced earlier, I also wrote about the role of observability teams. I said they should largely no longer be running their own monitoring and graphing software in-house. Yet there is still a place for observability teams to exist: they remain a critical link between outsourced solutions and internal developer needs.

That team should write libraries, generate examples, and drive standardization; ushering in consistency, predictability, and usability. They should partner with internal teams to evaluate use cases. They should partner with your vendors as roadmap stakeholders. They might also write glue code and helper modules to connect disparate data sources and create cohesive visualizations. Basically, that team becomes an integration point between your organization and the outsourced work.

I originally wrote this about observability, but it could just as easily be used to describe platform engineering as a whole. This is the role—being the bridge between other vendors and your own core software. It’s a very high-leverage place to sit.

Ops is dead, long live ops

I’ve spent a lot of time thinking about this because we’ve had such a hard time nailing down exactly who the Honeycomb customer is. Sometimes our buyer is an ops team buying it for their SWEs, sometimes it’s SREs in the midst of an outage, sometimes it’s a VP or director of engineering, or an architect, or a CTO, or a “full stack” engineering team, or even a product manager. It is hard to form a snappy answer out of that list.

The first couple questions every new go-to-market candidate asks us are “who is your buyer?” and “how do we help them?” To which I respond with a five minute ramble where I list every above persona and each of their pain points. Hardly the concrete answer they would like to receive.

As it goes, sociotechnical trends come and go. A year ago, Christine and I were speculating that platform engineering might be on the verge of consolidating the necessary ingredients that makes up our ideal buyer:

  1. Writing and shipping code, and needing to understand their own code
  2. Positioned to help other teams with their instrumentation patterns and tooling
  3. Firmly cloud-native+ and untethered to hardware or traditional infrastructure

To my delight, since that conversation, these trends have only accelerated—and I, for one, welcome our new platform engineering overlords to the observability table. ☺️

If you’d like to learn more about platform engineering, we’ll be running a Twitter space on ✨ October 20th ✨ at 12:00 p.m. PT. Come join us! I’ll be there along with two colleagues and we’ll be answering your questions and shedding more light on the topic.


1  I do hear people saying that, and it used to make me fucking furious, but now I just smugly remind myself how much self-inflicted suffering they are in for. Disrespecting operational expertise is the shortest path to never again sleeping through the night.

2 It is rather incredible how rapidly this idea has taken off. When we started talking about putting developers on call for their code in 2016, people got seriously angry with us. Before that, the only twitter mention I could find of putting devs on call was one by (of course) Adrian Cockcroft, but by 2019-2020 it had stopped being controversial and soon became common wisdom.

3 I actually wrote one of those myself: DevOps vs SRE: Delayed Coverage of the Dumbest War). LMAO. I think Liz had the final word on this back in … 2017? 2018? … when she said something like class SRE implements DevOps. And yes, DevOps is a philosophy or a methodology and not a job title, etc.

The Future of Ops is Platform Engineering

Why I hate the phrase “breaking down silos”

We hear this phrase constantly: “I worked at breaking down silos.” “We need to break down silos.” “What did I do in my last role? I broke down silos.”

It sets my fucking teeth on edge.

What is a ‘silo’, anyway? What specifically wasn’t working well, and how did you solve it; or how was it solved, and what was your contribution to the solution? did you just follow orders, or did you personally diagnose the problem, or did some of your suggestions pan out?

Solutions to complex problems rarely work on the first go, so … what else did you try? How did you know it wasn’t working, how did you know when to abandon earlier ideas? It’s fiendishly hard to know whether you’ve given a solution enough time to bake, for people to adjust, so that you can even evaluate whether it works better or worse off than before.

Communication is not magic pixie dust

Breaking down silos is supposed to be about increasing communication, removing barriers and roadblocks to collaboration.

But you can’t just blindly throw “more communication” at your teams. Too much communication can be just as much of a problem and a burden as too little. It can distract, and confuse, and create little eddies of information that is incorrect or harmful.

The quantity of the communication isn’t the issue, so much as the quality. Who is talking to whom, and when, and why? How does information flow throughout your company? Who gets left out? Whose input is sought, and when, and why? How can any given individual figure out who to talk to about any given responsibility?

Every time you say &quot;break down silos&quot;, I want to &quot;break down your face.&quot; | News EcardWhen someone says they are “breaking down silos”, whether in an interview, a panel, or casual conversation, it tells me jack shit about what they actually did.

cliches are a substitute for critical thinking

It’s just like when people say “it’s a culture problem”, or “fix your culture”, or “everything is about people”. These phrases tell me nothing except that the speaker has gone to a lot of conferences and wants to to sound cool.

If someone says “breaking down silos”, it immediately generates a zillion questions in my mind. I’m curious, because these problems are genuinely hard and people who solve them are incredibly rare.

Unfortunately, the people who use these phrases are almost never the ones who are out there in the muck and grind, struggling to solve real problems.

When asked, people who have done the hard labor of building better organizations with healthy communication flows, less inefficiency, and alignment around a single mission — people who have gotten all the people rowing in the same direction — tend to talk about the work.

People who haven’t, say they were “breaking down silos.”

Why I hate the phrase “breaking down silos”

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

This is part three of a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it.  🙏❤️  charity + friends


“Now Roll It Out And Keep Them Happy”

This is the third in a series of blog posts; previously we analyzed the security challenges of using a third party service, and we worked together with the security team to build empathy to deliver the project.  You might want to read those first, since we are going to build on a lot of the ideas there to ship and maintain this integration.

Ready for launch

You’ve convinced the security team and other stakeholders, you’ve gotten the integration running, you’re getting promising results from dev-test or staging environments… now it’s time to move from proof-of-concept to full implementation.  Depending on your situation this might be a transition from staging to production, or it might mean increasing a feature flipper flag from 5% to 100%, or it might mean increasing coverage of an integration from one API endpoint to cover your entire developer footprint.

Taking into account Murphy’s Law, we expect that some things will go wrong during the rollout.  Perhaps during coverage, a developer realizes that the schema designed to handle the app’s event mechanism can’t represent a scenario, requiring a redesign or a hacky solution.  Or perhaps the metrics dashboard shows elevated error rates from the API frontend, and while there’s no smoking gun, the ops oncall decides to rollback the integration Just In Case it’s causing the incident.

This gives us another chance to practice empathy — while it’s easy, wearing the champion hat, to dismiss any issues found by looking for someone to blame, ultimately this poisons trust within your organization and will hamper success.  It’s more effective, in the long run (and often even in the short run), to find common ground with your peers in other disciplines and teams, and work through to solutions that satisfy everybody.

Keeping the lights on

In all likelihood as integration succeeds, the team will rapidly develop experts and expertise, as well as idiomatic ways to use the product.  Let the experts surprise you; folks you might not expect can step up when given a chance.  Expertise flourishes when given guidance and goals; as the team becomes comfortable with the integration, explicitly recognize a leader or point person for each vendor relationship.  Having one person explicitly responsible for a relationship lets them pay attention to those vendor emails, updates, and avoid the tragedy of the “but I thought *you* were” commons.  This Integration Lead is also a center of knowledge transfer for your organization — they won’t know everything or help every user come up to speed, but they can help empower the local power users in each team to ramp up their teams on the integration.

As comfort grows you will start to consider ways to change your usage, for example growing into new kinds of data.  This is a good time to revisit that security checklist — does the change increase PII exposure to your vendor?  Would the new data lead to additional requirements such as per-field encryption?  Don’t let these security concerns block you from gaining valuable insight using the new tool, but do take the chance to talk it over with your security experts as appropriate.

Throughout this organic growth, the Integration Lead remains core to managing your changing profile of usage of the vendor they shepherd; as new categories of data are added to the integration, the Lead has responsibility to ensure that the vendor relationship and risk profile are well matched to the needs that the new usage (and presumably, business value) is placing on the relationship.

Documenting the Intergation Lead role and responsibilities is critical. The team should know when to check in, and writing it down helps it happen.  When new code has a security implication, or a new use case potentially amplifies the cost of an integration, bringing the domain expert in will avoid unhappy surprises.  Knowing how to find out who to bring in, and when to bring them in, will keep your team getting the right eyes on their changes.

Security threats and other challenges change over time, too.  Collaborating with your security team so that they know what systems are in use helps your team take note of new information that is relevant to your business. A simple example is noting when your vendors publish a breach announcement, but more complex examples happen too — your vendor transitions cloud providers from AWS to Azure and the security team gets an alert about unexpected data flows from your production cluster; with transparency and trust such events become part of a routine process rather than an emergency.

It’s all operational

Monitoring and alerting is a fact of operations life, and this has to include vendor integrations (even when the vendor integration is a monitoring product.)  All of your operations best practices are needed here — keep your alerts clean and actionable so that you don’t develop pager fatigue, and monitor performance of the integration so that you don’t get blindsided by a creeping latency monster in your APIs.

Authentication and authorization are changing as the threat landscape evolves and industry moves from SMS verification codes to U2F/WebAuthn.  Does your vendor support your SSO integration?  If they can’t support the same SSO that you use everywhere else and can’t add it — or worse, look confused when you mention SSO — that’s probably a sign you should consider a different vendor.

A beautiful sunset

Have a plan beforehand for what needs to be done should you stop using the service.  Got any mobile apps that depend on APIs that will go away or start returning permission errors?  Be sure to test these scenarios ahead of time.

What happens at contract termination to data stored on the service?  Do you need to explicitly delete data when ceasing use?

Do you need to remove integrations from your systems before ending the commercial relationship, or can the technical shutdown and business shutdown run in parallel?

In all likelihood these are contingency plans that will never be needed, and they don’t need to be fully fleshed out to start, but a little bit of forethought can avoid unpleasant surprises.

Year after year

Industry best practice and common sense dictate that you should revisit the security questionnaire annually (if not more frequently). Use this chance to take stock of the last year and check in — are you getting value from the service?  What has changed in your business needs and the competitive landscape? 

It’s entirely possible that a new year brings new challenges, which could make your current vendor even more valuable (time to negotiate a better contract rate!) or could mean you’d do better with a competing service.  Has the vendor gone through any major changes?  They might have new offerings that suit your needs well, or they may have pivoted away from the features you need. 

Check in with your friends on the security team as well; standards evolve, and last year’s sufficient solution might not be good enough for new requirements.

 

Andy thinks out loud about security, society, and the problems with computers on Twitter.


 

❤️ Thanks so much reading, folks.  Please feel free to drop any complaints, comments, or additional tips to us in the comments, or direct them to me on twitter.

Have fun!  Stay (a little bit) Paranoid!!

— charity

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

Shipping Software Should Not Be Scary

On twitter this week, @srhtcn noted that “Many incidents happen during or right after release” and asked for advice on ways to fix this.

And he’s right!  Rolling out new software is the proximate cause for the overwhelming majority of incidents, at companies of all sizes.  Upgrading software is both a necessity and a minor insanity, considering how often it breaks things.

Image result for deploy production memeI’m not going to recap the history of continuous integration and delivery, suffice it to say that we now know that smaller and more frequent changes are much safer than larger and less frequent changes.

But it’s still risky.  And most issues are still caused by humans and our pesky need for “improvements”.  So what can be done?

It’s not ok for software releases to be scary and hazardous

First of all: If releasing is risky for you, you need to fix that.  Make this a priority.  Track your failures, practice post mortems, evaluate your on call practices and culture.  Know if you’re getting better or worse.  This is a project that will take weeks if not months until you can be confident in the results.

You have to fix it though, because these things are self-reinforcing.  If shipping changes is scary and fraught, people will do it less and it will get even MORE scary and treacherous.

Likewise, if you turn it into a non-cortisol inducing event and set expectations, engineers will ship their code more often in smaller diffs and therefore break the world less.

Fixing deploys isn’t about eliminating errors, it’s about making your pipeline resilient to errors.  It’s fundamentally about detecting common failures and recovering from them, without requiring human intervention.

Value your tools more

As an short term patch, you should run deploys in the mornings or whenever everyone is around and fresh.  Then take a hard look at your deploy pipeline.

In too many organizations, deploy code is a technical backwater, an accumulation of crufty scripts and glue code, forked gems and interns’ earnest attempts to hack up Capistrano.  It usually gives off a strong whiff of “sloppily evolved from many 2 am patches with no code review”.Image result for test in production meme

This is insane.  Deploy software is the most important software you have.  Treat it that way: recruit an owner, allocate real time for development and testing, bake in metrics and track them over time.

If it doesn’t have an owner, it will never improve.  And you will need to invest in frequent improvements even after you’re over this first hump.

  • Signal high organizational value by putting one of your best engineers on it.
  • Recruit help from the design side of the house as well.  The “right” thing to do must be the fastest, easiest thing to do, with friendly prompts and good docs.  No “shortcuts” for people to reach for at the worst possible time.  You need user research and design here.
  • Track how often deploys fail and why.  Managers should pay close attention to this metric, just like the one for people getting interrupted or woken up, and allocate time to fixing this early whenever it sags.  Before it gets bad.
  • Allocate real time for development, testing, and training — don’t expect the work to get shoved into people’s “spare time” or post mortem cleanup time.  Make sure other managers understand the impact of this work and are on board.  Make this one of your KPIs.Image result for deploy production meme

In other words, make deploy tools a first class citizen of your technical toolset.  Make the work prestigious and valued — even aspirational.  If you do performance reviews, recognize the impact there.

(Btw, “how we hardened our deploys” is total Velocity-bait (&& other practitioner conferences) as well as being great for recruiting and general visibility in blog post form.  People love these stories; there definitely aren’t enough of them.)

Turn software engineers into software owners

The canonical CI/CD advice starts with “ship early, ship often, ship smaller change sets”.  That’s great advice: you should definitely do those things.  But they are covered plenty elsewhere.  What’s software ownership?

Software ownership is the natural end state of DevOps.  Software engineers, operations engineers, platform engineers, mobile engineers — everyone who writes code should be own the full lifecycle of their software.

Software owners are people who:

  1. Write codeImage result for deploy production meme
  2. Can deploy and roll back their own code
  3. Are able to debug their own issues in prod (via instrumentation, not ssh)

If you’re lacking any one of those three ingredients, you don’t have ownership.

Why ownership?  Because software ownership makes for better engineers, better software, and a better experience for customers.  It shortens feedback loops and means the person debugging is usually the person with the most context on what has recently changed.

Some engineers might balk at this, but you’ll be doing them a favor.  We are all distributed systems engineers now, and distributed systems require a much higher level of operational literacy.  May as well start today.

Fail fast, fix fast

This is about shifting your mindset from one of brittleness and a tight grip, to one of flexibility where failures are no big deal because they happen all the time, don’t impact users, and give everyone lots of practice at detecting and recovering from them.

Here are a few of the best practices you should adopt with this practice.

Make operability a high-value skill set.  Never promote someone to “senior engineer” if they can’t deploy and debug their Image result for test in production memeown code.

Software engineers don’t have to become operational experts.  They do need to know the bare basics of instrumentation, deploy/revert, and debugging.

Everyone who puts software in production needs to understand and feel responsible for the full lifecycle of their code, not just how it works in their IDE.

Baking: it’s not just for cookies

Shipping something to production is a process of incrementally gaining confidence, not a switch you can flip.

You can’t trust code until it’s been in prod a while, Image result for deploy production memeuntil you’ve seen it perform under a wide range of load and concurrency scenarios, in lots of partial failure modes.  Only over time can you develop confidence in it not being terrible.

Nothing is production except production.  Don’t rely on never failing; expect failure, embrace failure.  Practice failure!  Build guard rails around your production systems to help you find and fix problems quickly.

The changes you need to make your pipeline more resilient are roughly the same changes you need to safely test in production.  These are a few of your guard rails.

  • Use feature flags to switch new code paths on and offImage result for test in production meme
  • Build canaries for your deploy process, so you can promote releases gracefully and automatically to larger subsets of your traffic as you gain confidence in them
  • Create cohorts.  Deploy to internal users first, then any free tier, etc in order of ascending importance.  Don’t jump from 10% to 25% to 50% and then 100% — some changes are related to saturating backend resources, and the 50%-100% jump will kill you.
  • Have robots check the health of your software as it rolls out to decide whether to promote the canary.  Over time the robot checks will mature and eventually catch a ton of problems and regressions for you.

The quality of code is not knowable before it hits production.  You may able to spot some problems, but you can never guarantee a lack of then.  It takes time to bake a new release and gain incremental confidence in new code.

In summary.

  1. Get someone to own the deploy software
  2. Value the work
  3. Create a culture of software ownership
  4. LOOK at what you’ve done after you do it
  5. Be suspicious of new versions until they prove themselves

Image result for deploy production meme

Two blog posts in one weekend!  That’s definitely never happened before.  Thanks to Baron for asking me to draft this up following the weekend’s twitter thread: https://twitter.com/mipsytipsy/status/1030340072741064704.

 

 

Shipping Software Should Not Be Scary

DevOps vs SRE: delayed coverage of the dumbest war

Last week was the West Coast Velocity conference.  I had a terrific time — I think it’s the
best Velocity I’ve been to yet.  I also slipped in quite late, the evening before last, to catch Gareth’s session on DevOps vs SRE.

had to catch it, because Gareth Rushgrove (of DevOps Weekly glory) was taunting @lusis and me about it on the Internet.

rainbowdropletAnd it was worth it!   Holy crap, this was such a fun barnburner of a talk, with Gareth schizophrenically arguing both for and against the key premise of the talk, which was about “Google Infrastructure for Everyone Else (GIFEE)” and whether SRE is a) the highest, noblest goal that we should all aspire towards, or b) mostly irrelevant to anyone outside the Google confines.

Which Gareth won?  Check out the slides and judge for yourself.  🙃

 

At some point in his talk, though, Gareth tossed out something like “Charity probably already has a blog post on this drafted up somewhere.”  And I suddenly remembered “Fuck!  I DO!”  it’s been sitting in my Drafts for months god dammit.

So this is actually a thing I dashed off back in April, after CraftConf.  Somebody asked me for my opinion on the internet — always a dangerous proposition — and I went off on a bit of a rant about the differences and similarities between DevOps and SRE, as philosophies and practices.
colorful-rainbow

Time passed and I forgot about it, and then decided it was too stale.  I mean who really wants to read a rehash of someone’s tweetstorm from two months ago?

Well Gareth, apparently.

Anyway: enjoy.


SRE vs DevOps: TWO PHILOSOPHIES ENTER, BOTH ARE PHENOMENALLY SUCCESSFUL AND MUTUALLY DUBIOUS OF ONE ANOTHER


 

So in case you haven’t noticed, Google recently published a book about Site Reliability Engineering: How Google Runs Production Systems.  It contains some really terrific wisdom on how to scale both systems and orgs.  It contains chapters written by dear friends of mine.  It’s a great book, and you should buy it and read it!

Rainbow-Umbrella-Z-5_5It also has some really fucking obnoxious blurbs.  Things like about how “ONLY GOOGLE COULD HAVE DONE THIS”, and an whiff of snobbery throughout the book as though they actually believe this (which is far worse if true).

You can’t really blame the poor blurb’ers, but you can certainly look askance at a massive systems engineering org when it seems as though they’ve never heard of DevOps, or considered how it relates to SRE practices, and may even be completely unaware of what the rest of the industry has been up to for the past 10-plus years.  It’s just a little weird.

So here, for the record, is what I said about it.

 

Google is a great company with lots of terrific engineers, but you can only say they are THE

sre
The Google SRE Bible

BEST at what they do if you’re defining what they do tautologically, i.e. “they are the best at making Google run.”  Etsyans are THE BEST at running Etsy, Chefs are THE BEST at building Chef, because … that’s what they do with their lives.

Context is everything here.  People who are THE BEST at Googling often flail and flame out in early startups, and vice versa.  People who are THE BEST at early-stage startup engineering are rarely as happy or impactful at large, lumbering, more bureaucratic companies like Google.  People who can operate equally well and be equally happy at startups and behemoths are fairly rare.

And large companies tend to get snobby and forget this.  They stop hiring for uniquerainbow-swirl strengths and start hiring for lack of weaknesses or “Excellence in Whiteboard Coding Techniques,” and congratulate themselves alot about being The Best.  This becomes harmful when it translates into to less innovation, abysmal diversity numbers, and a slow but inexorable drift into dinosaurdom.

Everybody thinks their problems are hard, but to a seasoned engineer, most startup problems are not technically all that hard.  They’re tedious, and they are infinite, but anyone can figure this shit out.  The hard stuff is the rest of it: feverish pace, the need to reevaluate and reprioritize and reorient constantly, the total responsibility, the terror and uncertainty of trying to find product/market fit and perform ten jobs at once and personally deliver to your promises to your customers.

rainbow-cloud-dropletAt a large company, most of the hardest problems are bureaucratic.  You have to come to terms with being a very tiny cog in a very large wheel where the org has a huge vested interest in literally making you as replicable and replaceable as possible.  The pace is excruciatingly slow if you’re used to a startup.  The autonony is … well, did I mention the politics?  If you want autonomy, you have to master the politics.

 

Everyone.  Operational excellence is everyone’s job.  Dude, if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on  your team is an operational risk (not to mention, you know, like moral issues and stuff).

But the more engineering-focused your role is, the more direct your impact will be on operational outcomes.

As a software engineer, developing strong ops chops makes you powerful.  It makes you better at debugging and instrumentation, building resiliency and observability into your own systems and interdependent systems, and building systems that other people can come along and understand and maintain long after you’re gone.rainbow-shade

As an operations engineer, those skills are already your bread and butter.  You can increase your power in other ways, like by leveling up at software engineering skills like test coverage and automation, or DBA stuff like query optimization and storage engine internals, or by helping the other teams around you level up on their skills (communication and persuasion are chronically underrecognized as core operations engineering skills).

This doesn’t mean that everyone can or should be able to do everything.  (I can’t even SAYrainbow-dot-ball the words “full stack engineer” without rolling my eyes.)  Generalists are awesome!  But past a certain inflection point, specialization is the only way an org can scale.

It’s the only way you make room for those engineering archetypes who only want to dive deep, or who really really love refactoring, or who will save the world then disappear for weeks.  Those engineers can be incredibly valuable as part of a team … but they are most valuable in a large org where you have enough generalists to keep the oars rowing along in the meantime.

So, back to Google.  They’ve done, ahem, rather  well for themselves.  Made shitbuckets of money, pushed the boundaries of tech, service hardly ever goes down.  They have operational demands that most of us never have seen and never will, and their engineers are definitely to be applauded for doing a lot of hard technical and cultural labor to get there.

So why did this SRE book ruffle a few feathers?

Mostly because it comes off a little tone deaf in places.  I’m not personally pissed off by
the google SRE book, actually, just a little bemused at how legitimately unaware they seem to be about … anything else that the industry has been doing over the past 10 years, in terms of cultural transformation, turning sysadmins into better engineers, sharing on-call rotations, developing processes around empathy and cross-functionality, engineering best practices, etc.

devops
DevOps for the rest of us

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

However, it happens that Jen Davis and Katherine Daniels just published a book called Effective DevOps, which covers a lot of the same ground with a much more varied and inclusive approach.  And one of the things they return to over and over again is the power of context, and how one-size-fits-all solutions simply don’t exist, just like unisex OSFA t-shirts are a dirty fucking lie.

Google insularity is … a thing.  On the one hand it’s great that they’re opening up a bit!rainbow-umbrella-clipart-1 On the other hand it’s a little bit like when somebody barges onto a mailing list and starts spouting without skimming any of the archives.  And don’t even get me started on what happens when you hire long, longterm ex-Googlers back into to the real world.

So, so many of us have had this experience of hiring ex-Googlers who automatically assume that the way Google does a thing is CORRECT, not just contextually appropriate.   Not just right for Google, but right for everyone, always.  Which is just obviously untrue.  But the reassimilation process can be quite long and exhausting when the Kool-Aid is so strong.

Because yeah, this is a conversation and a transformation that the industry has been having for a long time now.  Compared with the SRE manifesto, the DevOps philosophy is much more crowd-sourced, more flexible, and adaptable to organizations of all stages of developments, with all different requirements and key business differentiators, because it’s benefited from loud, mouthy contributors who aren’t all working in the same bubble all along.

And it’s like Google isn’t even aware this was happening, which is weird.

Orrrrrr, maybe I’m just a wee bit annoyed that I’ve been drawn into this position rainbow-dot-ballof having to defend “DevOps”, after many excellent years spent being grumpy about the word and the 10000010101 ways it is used and abused.

(Tell me again about your “DevOps Engineering Team”, I dare you.)

P.S. I highly encourage you to go read the epic hours-long rant by @matthiasr that kicked off the whole thing.  some of which I definitely endorse and some of which not, but I think we could go drink whiskey and yell about this for a week or two easy breezy  <3

Anyway what the fuck do I know, I’ve never worked in the Google lair, so maybe I am just under-equipped to grasp the true glory, majesty and superiority of their achievements over us all.

Or maybe they should go read Katherine and Jen’s book and interact with the “UnGoogled” once in a while.  ☺️

colorful-rainbow

 

DevOps vs SRE: delayed coverage of the dumbest war

Operational Best Practices #serverless

This post is part two of my recap of last week’s terrific Serverless conference.  If you feel like getting bitchy with me about what serverless means or #NoOps or whatever, please refer back to the prequel post, where I talked about operations engineering in the modern world.

*Then* you can get bitchy with me.  (xoxoxxooxo)

The title of my talk was:

Screen Shot 2016-05-30 at 8.43.39 PM

The theme of my talk was basically: what should software engineers know and care about when it comes to operations in a world where we are outsourcing more and more core functionality?

If you care about running a quality service or product, or providing your customers with a reasonable level of service, you have to care about operational concerns like design, resiliency, instrumentation and debuggability.  No matter how many abstractions there are between you and the bare metal.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault.  You chose them, it’s on you.  It’s tacky to blame the software or the service, and besides your customers don’t give a shit whose “fault” it is.

So given an infinite number of things to care about, where do you start?

What is your mission, and what are your differentiators?

The first question must always be: what is your mission?  Your mission is not writing software.  Your mission is delivering whatever it is your customers are paying you for, and you use software to get there.  (Code is kind of a liability so you should write as little of it as necessary.  hey!! sounds like a good argument for #serverless!)

Second: what are your core differentiators?  What are the things that you are doing that are unique, and difficult to replicate, or the things where you have to actually be world class experts in those things?

Those are the things that you will have the hardest time outsourcing, or that you should think about very carefully before outsourcing.

Screen Shot 2016-05-30 at 7.57.06 PM

Facts

You can outsource labor, but you can’t outsource caring.  And nobody but you is in the position to think about your core differentiators and your product in a holistic way.

If you’re a typical early startup, you’re probably using somewhere between 5 and 20 SaaS products to get rid of some of the crap work and offload it to dedicated teams who can do it better than you can, much more cheaply, so you are freed up to work on your core value proposition.

GOOD.

But you still have to think about things like reliability, your security model, your persistent storage models, your query performance, how all these lovely services talk to each other, how you’re going to debug them, how you’re going to repro when things go wrong, etc.  You still own these things, even if you don’t run them.

For example, take AWS Lambda.  It’s a pretty great service on many dimensions.  It’s an early version of the future.  It is also INCREDIBLY irritating and challenging to debug in a practically infinite number of insanity-inducing ways.

** Important side note — I’m talking about actual production systems.  Parse, Heroku, Lambda, etc are GREAT for prototyping and can take you a long, long way.  Early stage startups SHOULD optimize for agility and rapid developer iteration, not reliability.  Thx to @joeemison for reminding me that i left that out of the recap.

Screen Shot 2016-05-30 at 8.03.01 PM

Focus on the critical path

Your users don’t care if your internal jenkins builds are broken.  They don’t care about a whole lot of things that you have to care about … eventually.  They do care a lot if your product isn’t actually functional.  Which means you have to think through the behavioral and failure characteristics of the providers you’re relying on in any user visible fashion.

Ask lots of questions if you can.  (AWS often won’t tell you much, but smaller providers will.)  Find out as much as you can about their cotenancy model (shared hardware or isolation?), their typical performance variance (run your own tests, don’t trust their claims), and the underlying storage systems.

Think about how you can bake in resiliency from the user’s perspective, that doesn’t rely on provider guarantees.  If you’re on mobile, can you provide a reasonable offline experience?  Like Parse did a lot of magic here in the APIs, where it would back off and retry saves if there were any errors.

Can you fail over to another provider if one is down?  Is it even worth it at your company’s stage of maturity and engineering resources to invest in this?

How willing are you to be locked into a vendor or provider, and what is the story if you find yourself forced to switch?  Or if that service goes away, as so many, many, many of them have done and will do.  (RIP, parse.com.)

Screen Shot 2016-05-30 at 8.11.10 PM

Tradeoffs

Listen, outsourcing is awesome.  I do it as much as I can.  I’m literally helping build a service that provides outsourced metrics, I believe in this version of the future!  It’s basically the latest iteration of capitalism in a nutshell: increased complexity –> increased specialization –> you pay other people to do the job better than you –> everybody wins.

But there are tradeoffs, so let’s be real.

The service, if it is smart, will put strong constraints on how you are able to use it, so they are more likely to deliver on their reliability goals.  When users have flexibility and options it creates chaos and unreliability.  If the platform has to choose between your happiness vs thousands of other customers’ happiness, they will choose the many over the one every time — as they should.

Limits may mysteriously change or be invented as they are discovered, esp with fledgling services.  You may be desperate for a particular feature, but you can’t build it.  (This is why I went for Kafka over Kinesis.)

You need to think way more carefully and more deeply about visibility and introspection up front than you would if you were running your own services, because you have no ability to log in and use strace or gdb or tail a logfile or run any system profiling commands when things go dark.

In the best case, you’re giving up some control and quality in exchange for experts doing the work better than you could for cheaper (e.g. i’m never running a fucking physical data center again, jesus.  EC24lyfe).  In a common worse case, it’s less reliable than what you would build AND it’s also opaque AND you can’t tell if it’s down for you or for everyone because frankly it’s just massively harder to build a service that works for thousands/millions of use cases than for any one of them individually.

Screen Shot 2016-05-30 at 8.21.36 PM.png

Stateful services

Ohhhh and let’s just briefly talk about state.

The serverless utopia mostly ignores the problems of stateful services.  If pressed they will usually say DynamoDB, or Firebase, or RDS or Aurora or something.

This is a big, huge, deep, wide lake of crap to wade in to so all I’m going to say is that there is no such thing as having the luxury of not having to understand how your storage systems work.  Queries will get slow, and you’ll need to be able to figure out why and fix them.  You’ll hit scaling cliffs where suddenly a perfectly-usable app just starts timing everything out because of that extra second of latency coming from …

¯\_(ツ)_/¯

The hardware underlying your instance will degrade (there’s a server somewhere under all those abstractions, don’t forget).  The provider will have mysterious failures.  They will be better than you, probably, but less inclined to give you satisfactory progress updates because there are hundreds or thousands or millions of you all clamoring.

The more you understand about your storage system (and the more you stay in the lane of how it was intended to be used), the happier you’ll be.

Screen Shot 2016-05-30 at 8.28.29 PM

In conclusion

These trends are both inevitable and, for the most part, very good news for everyone.

Operations engineering is becoming a more fascinating and specialized skill set.  The best engineers are flocking to solve category problems — instead of building the same system at company after company, they are building SaaS solutions to solve it for the internet at large.  Just look at the massive explosion in operational software offerings over the past 5-6 years.

This means that the era of the in-house dedicated ops team, which serves as an absorbent buffer for all the pain of software development, is mostly on its way out the door.  (And good riddance.)

People are waking up to the fact that software quality improves when feedback loops are tighter for software engineers, which means being on call and owning services end to end.  The center of gravity is shifting towards engineering teams owning the services they built.

This is awesome!  You get to rent engineers from Google, AWS, Pagerduty, Pingdom, Heroku, etc for much cheaper than if you hired them in-house — if you could even get them, which you probably can’t because talent is scarce.

Screen Shot 2016-05-30 at 8.39.42 PM

But the flip side of this is that application engineers need to get better at thinking in traditionally operations-oriented ways about reliability, architecture, instrumentation, visibility, security, and storage.  Figure out what your core differentiators are, and own the shit out of those.

Nobody but you can care about your mission as much as you can.  Own it, do it.  Have fun.

 

Operational Best Practices #serverless

Two weeks with Terraform

I’ve been using terraform regularly for 2-3 weeks now.  I have terraformed in rage, I have terraformed in delight.  I thought it might be helpful to share some of my notes and lessons learned.

Why Terraform?

Because I am fucking sick and tired of not having versioned infrastructure.  Jesus christ, the ways my teams have bent over backwards to fake infra versioning after the fact (nagios checks running ec2 diffs, anyone?).

Because I am starting from scratch on a green field project, so I have the luxury of experimenting without screwing over existing customers.  Because I generally respect Hashicorp and think they’re on the right path more often than not.

If you want versioned infra, you basically get to choose between 1) AWS CloudFormation and its wrappers (sparkleformation, troposphere), 2) chef-provisioner, and 3) Terraform.

The orchestration space is very green, but I think Terraform is the standout option.  (More about why later.)  There is precious little evidence that TF was developed by or for anyone with experience running production systems at scale, but it’s … definitely not as actively hostile as CloudFormation, so it’s got that going for it.

First impressions

Stage one: my terraform experiment started out great.  I read a bunch of stuff and quickly spun up a VPC with public/private subnets, NAT, routes, IAM roles etc in < 2 days.  This would be nontrivial to do in two days *without* learning a new tool, so TOTAL JOY.

Stage two: spinning up services.  This is where I started being like … “huh.  Has anyone ever actually used this thing?  For a real thing?  In production?”  Many of the patterns that seemed obvious and correct to me about how to build robust AWS services were completely absent, like any concept of a subnet tier spanning availability zones.  I did some inexcusably horrible things with variables to get the behavior I wanted.

Stage three: … modules.  Yo, all I wanted to do was refactor a perfectly good working config into modules for VPC, security groups, IAM roles/policies/users/groups/profiles, S3 buckets/configs/policies, autoscaling groups, policies, etc, and my entire fucking world just took a dump for a week.  SURE, I was a TF noob making noob mistakes, but I could not believe how hard it was to debug literally anything..

This is when I started tweeting sad things.

The best (only) way of debugging terraform was just reading really, really carefully, copy-pasting back and forth between multiple files for hours to get all the variables/outputs/interpolation correct.  Many of the error messages lack any context or line numbers to help you track down the problem.  Take this prime specimen:

Error downloading modules: module aws_vpc: Error loading .terraform
/modules/77a846c64ead69ab51558f8c5be2cc44/main.tf: Error reading 
config for aws_route_table[private]: parse error: syntax error

Any guesses?  Turned out to be a stray ‘}’ on line 105 in a different file, which HCL vim syntax highlighting thought was A-OK.  That one took me a couple hours to track down.

Or this:

aws_security_group.zookeeper_sg: cannot parse '' as int: 
strconv.ParseInt: parsing "": invalid syntax

Which *obviously* means you didn’t explicitly define some inherited port as an int, so there’s a string somewhere there lurking in your tf tree.  (*Obviously* in retrospect, I mean, after quite a long time poking haplessly about.)

Later on I developed more sophisticated patterns for debugging terraform.  Like, uhhh, bisecting my diffs by commenting out half of the lines I had just added, then gradually re-adding or re-commenting out more lines until the error went away.

Security groups are the worst for this.  SO MANY TIMES I had security group diffs run cleanly with “tf apply”, but then claim to be modifying themselves over and over.  Sometimes I would track this down to having passed in a variable for a port number or range, e.g. “cidr_blocks = [“${var.ip_range}”]”.  Hard-coding it to “cidr_blocks [“10.0.0.0/8″]” or setting the type explicitly would resolve the problem.  Or if I accidentally entered a CIDR range that AWS didn’t like, like 10.0.20.0/20 instead of 10.0.16.0/20.  The change would apply and usually it would work, it just didn’t think it had worked, or something.  TF wasn’t aware there was a problem with the run so it would just keep “successfully” reapplying the diff every time it ran.

Some advice for TF noobs

  • As @phinze told me, “modules are basically like functions — a variable is an argument, output is a return value”.  This was helpful, because that was completely unintuitive to me when I started refactoring.  It took a few days of wrestling with profoundly inscrutable error messages before modules really clicked for me.
  • Strings.  Lists.  You can only pass variables around as strings.  Split() and join() are your friends.  Oh my god I would sell so many innocent children for the ability to pass maps back and forth between modules.
  • No interpolation for resource names makes me so sad.  Basically you can either use local variable maps, or multiple lists and just … run those index counters like a boss I guess..
  • Use AWS termination protection for stateful services or anything risky once you’re in production.  Use create_before_destroy on resources like ASG launch configs.  Use “don’t destroy” where you must — but as sparingly as possible, because that basically breaks the entire TF model.
  • If you change the launch config for an ASG, like replacing the AMI for example, you might expect TF to kick off an instance recycle.  It will not.  You must manually terminate the instances to pick up the new config.
  • If you’re collaborating with a team — ok, even if you’re not — find a remote place to store the tfstate files.  Try S3 or github, or shell out for Atlas.  Local state on laptops is for losers.
  • TF_LOG=DEBUG has never once been helpful to me.  I can only assume it was written for the Hashicorp developers, not for those of us using the product.

Errors returned by AWS are completely opaque.  Like “You were not allowed to apply this update”.  Huh?  Ok well if it fails on “tf plan”, it’s probably a bad terraform config.  If it successfully plans but fails on “tf apply”, your AWS logic is probably at fault.

Terraform does not do a great job of surfacing AWS errors.

For example, here is some terraform output:

tf output: "* aws_route_table.private: InvalidNatGatewayID.NotFound
: The natGateway ID 'nat-0e5f4ea507113b423' does not exist"

Oh!~  Okay, I go to the AWS console and track down that NAT gateway object and find this:

"Elastic IP address [eipalloc-8583b7e1] is already associated"

Hey, that seems useful!  Seems like TF just timed out bringing up one of the route tables, so it tried assigning the same EIP twice.  It would be nice to surface more of this detail into the terraform output, I hate having to resort to a web console.

Last but not least: one time I changed the comment string on a security group, and “tf plan” went into an infinite dependency loop.  I had to roll back the change, run terraform destroy against all resources in a bash for loop, and create an new security group with all new instances/ASGs just to change the comment string.  You cannot change comment strings or descriptions for resources without the resources being destroyed.  This seems PROFOUNDLY weird to me.

Wrapper scripts

Lots of people seem to eventually end up wrapping terraform with a script.  Why?

  • There is no concept of a $TF_ROOT.  If you run tf from the wrong directory, it will do some seriously confusing and screwed up shit (like duping your config, but only some of it).
  • If you’re running in production, you prob do not want people to be able to accidentally “terraform destroy” the world with the wrong environment
  • You want to enforce test/staging environments, and promotion of changes to production after they are proven good
  • You want to automatically re-run “tf plan” after “tf apply” and make sure your resources have converged cleanly.
  • So you can add slack hooks, or hipchat hooks, or github hooks.
  • Ummm, have I mentioned that TF can feel somewhat undebuggable?  Several people have told me they create rake tasks or YML templates that they then generate .tf files from so they can debug those when things break.  (Erf …)

Okay, so …..

God, it feels I’ve barely gotten started but I should probably wrap it up.[*]  Like I said, I think terraform is best in class for infra orchestration.  And orchestration is a thing that I desperately want.  Orchestration and composability are the future of infrastructure.

But also terraform is green as fuck and I would not recommend it to anyone who needs a 4-nines platform.

Simply put, there is a lot of shit I don’t want terraform touching.  I want terraform doing as little as possible.  I have already put a bunch of things into terraform that I plan on taking right back out again.  Like, you should never be running a userdata.sh script after TF has bootstrapped a node.  Yuck.. That is a job for your cfg management, or possibly a job for packer or a custom firstboot script, but never your orchestration tool!  I have already stuffed a bunch of Route53 DNS into TF and I will be ripping that right back out soon.  Terraform should not be managing any kind of dynamic data.  Or service registry, or configs, or ….

Terraform is fantastic for defining the bones of your infrastructure.  Your networking, your NAT, autoscaling groups, the bits that are robust and rarely change.  Or spinning up replicas of production on every changeset via Travis-CI or Jenkins — yay!  Do that!

But I would not feel safe making TF changes to production every day.  And you should delegate any kind of reactive scaling to ASGs or containers+scheduler or whatever.  I would never want terraform to interfere with those decisions on some arbitrary future run.

Which is why it is important to note that terraform does not play nicely with others.  It wants to own the whole thing.  Monkeypatching TF onto an existing infra is kind of horrendous.  It would be nice if you could tag certain resources or products as “this is managed by some other system, thx”.

So: why terraform?

Well, it is fairly opinionated.  It’s actively developed by some really smart people.  It’s moving fast and has most of the momentum in the space.  It’s composable and interacts well with other players iff you make some good life choices.  (Packer, for example, is amazing, by far the most unixy utility of the Hashicorp library.)

Just look at the rate of bug fixes and releases for Terraform vs CloudFormation.  Set aside crossplatform compatibility etc, and just look at the energy of the respective communities.  Not even a fair fight.

Want more?  Ok, well I would rather adopt one opinionated philosophy for my infrastructure, supplementing where necessary, than duct tape together fifty different half baked philosophies about how software and infrastructure should work and spend all my time mediating their conflicts.  (This is one of my beefs with CloudFormation: AWS has no opinions, only slobbering, squidlike, directionless-flopping optionalities.  And while we’re on the topic it also has nothing like “tf plan” for previewing changes, so THAT’S PRETTY STUPID TOO.)

I do have some concerns about Hashicorp spreading themselves too thin on too many products.  Some of those products probably shouldn’t exist.  Meh.

Terraform has a ways to go before it feels predictable and debuggable, but I think it’s heading in the right direction.  It’s been a fun couple weeks and I’m excited to start contributing to the ecosystem and integrating with other components, like chef-solo & consul.

 

[*] OMGGGGGGG, I never even got to the glorious horrors of the terraforming gem and how you are most definitely going to end up manually editing your *.tfstate files.  Ahahahahaa.

[**] Major thanks to @phinze, @solarce, @ascendantlogic, @lusis, @progrium and others who helped me limp through my first few weeks.

Two weeks with Terraform