“Why are my tests so slow?” A list of likely suspects, anti-patterns, and unresolved personal trauma.

Over the past couple of weeks I’ve been tweeting a LOT about lead time to deploy: the interval encompassing the time from when the code gets written and when it’s been deployed to production. Also described as “how long it takes you to run CI/CD.”

How important is this?

Fucking central.

Here is a quickie thread from this week, or just go read “Accelerate” like everybody already should have. ūüôÉ

It’s nigh impossible to have a high-performing team with a long lead time, and becomes drastically easier with a dramatically shorter lead time.

ūüĆ∑ Shorter is always better.
ūüĆĽ One mergeset per deploy.
ūüĆĻ Deploy should be automatic.

And it should clock in under 15 minutes, all the way from “merging!” to “deployed!”.

Now some people will nod and agree here, and others freak the fuck out. “FIFTEEN MINUTES?” they squall, and begin accusing me of making things up or working for only very small companies. Nope, and nope. There are no magic tricks here, just high standards and good engineering, and the commitment to maintaining your goals quarter by quarter.

If you get CI/CD right, a lot of other critical functions, behaviors, and intuitions are aligned to be comfortably successful and correct with minimal effort. If you get it wrong, you will spend countless cycles chasing pathologies. It’s like choosing to eat your vegetables every day vs choosing a diet of cake and soda for fifty years, then playing whackamole with all the symptoms manifesting on your poor, mouldering body.

Is this ideal achievable for every team, on every stack, product, customer and regulatory environment in the world? No, I’m not being stupid or willfully blind. But I suggest pouring your time and creative energy into figuring out how closely you can approximate the ideal given what you have, instead of compiling all the reasons why you can’t achieve it.

Most of the people who tell me they can’t do this are quite wrong, turns out. And even if you can’t down to 15 minutes, ANY reduction in lead time will pay out massive, compounding, benefits to your team and adjacent teams forever and ever.

So — what was it you said you were working on right now, exactly? that was so important? ūü§Ē

“Cutting my build time by 90%!” — you

Huzzah. ūü§†

So let’s get you started! Here, courtesy of my twitterfriends, is a long compiled list of Likely Suspects and CI/CD Offenders, a long list of anti-patterns, and some unresolved personal pain & suffering to hunt down and question when your build gets slow..

‚ú®15 minutes or bust, baby!‚ú®

Where it all started: what keeps you from getting under 15 minute CI/CD runs?

Generally good advice.

  • Instrument your build pipeline with spans and traces so you can see where all your time is going. ALWAYS. Instrument.
  • Order tests by time to execute and likelihood of failure.
  • Don’t run all tests, only tests affected by your change
  • Similarly, reduce build scope; if you only change front-end code, only build/test/deploy the front end, and for heaven’s sake don’t fuss with all the static asset generation
  • Don’t hop regions or zones any more than you absolutely must.
  • Prune and expire tests regularly, don’t wait for it to get Really Bad
  • Combine functionality of tests where possible — tests need regular massages and refactors too
  • Pipeline, pipeline, pipeline tests … with care and intention
  • You do not need multiple non-production environment in your CI/CD process. Push your artifacts to S3 and pull them down from production. Fight me on this
  • Pull is preferable to push. (see below)
  • Set a time elapsed target for your team, and give it some maintenance any time it slips by 25%

The usual suspects

  • tests that take several seconds to init
  • setup/teardown of databases (HINT try ramdisks)
  • importing test data, seeding databases, sometimes multiple times
  • rsyncing sequentially
  • rsyncing in parallel, all pulling from a single underprovisioned source
  • long git pulls (eg cloning whole repo each time)
  • CI rot (eg large historical build logs)
  • poor teardown (eg prior stuck builds still running, chewing CPU, or artifacts bloating over time
  • integration tests that spin up entire services (eg elasticsearch)
  • npm install taking 2-3 minutes
  • bundle install taking 5 minutes
  • resource starvation of CI/CD system
  • not using containerized build pipeline
  • …(etc)
Continuous deployment to industrial robots in prod?? Props, man.

Not properly separating the streams of “Our Software” (changes constantly) vs “infrastructure” (changes rarely)

  • running cloudformation to setup new load balancers, dbs, etc for an entire acceptance environment
  • docker pulls, image builds, docker pushes container spin up for tests

“Does this really go here?”

  • packaging large build artifacts into different format for distribution
  • slow static source code analysis tools
  • trying to clone production data back to staging, or reset dbs between runs
  • launching temp infra of sibling services for end-to-end tests, running canaries
  • selenium and other UX tests, transpiling and bundling assets

“Have a seat and think about your life choices.”

  • excessive number of dependencies
  • extreme legacy dependencies (things from the 90s)
  • tests with “sleep” in them
  • entirely too large frontends that should be broken up into modules

“We regret to remind you that most AWS calls operate at the pace of ‘Infrastructure’, not ‘Software'”

  • AWS CodeBuild has several minutes of provisioning time before you’re even executing your own code — even a few distinct jobs in a pipeline and you might suffer 15 min of waiting for CodeBuild to do actual work
  • building a new AMI
  • using EBS
  • spinning up EC2 nodes .. sequentially ūüėĪ
  • cool it with the AWS calls basically

A few responses were oozing with some unresolved trauma, lol.

Natural Born Opponents: “Just cache it” and “From the top!”

  • builds install correct version of toolchain from scratch each time
  • rebuilding entire project from source every build
  • failure to cache dependencies across runs (eg npm cache not set properly)

“Parallelization: the cause of, and solution to, all CI problems”

  • shared test state, which prevents parallel testing due to flakiness and non-deterministic test results
  • not parallelizing tests
I have so many questions….

Thanks to @wrd83, @sorenvind, @olitomli, @barney_parker, @dastbe, @myajpitz, @gfodor, @mrz, @rwilcox, @tomaslin, @pwyliu, @runewake2, @pdehlkefor, and many more for their contributions!

P.S. what did I say about instrumenting your build pipeline? For more on honeycomb + instrumentation, see this thread. Our free tier is incredibly generous, btw ‚ėļÔłŹ

Stay tuned for more long form blog posts on this topic. Coming soon. ūüĆą


P.S. this blog post is the best thing i’ve ever read about reducing your build time. EVER.

“Why are my tests so slow?” A list of likely suspects, anti-patterns, and unresolved personal trauma.

Questionable Advice #2: How Do I Get My Team Into Observability?

Welcome to the second installment of my advice column! Last time we talked about the emotional impact of going back to engineering after a stint in management. If you have a question you’d like to ask, please email me or DM it to me on twitter.

Hi Charity! I hope it’s ok to just ask you this…¬†

I’m trying to get our company more aware of observability and I’m finding it difficult to convince people to look more into it. We currently don’t have the kind of systems that would require it much – but we will in future and I want us to be ahead of the game.¬†

If you have any tips about how to explain this to developers (who are aware that quality is important but don’t always advocate for it / do it as much as I’d prefer), or have concrete examples of “here’s a situation that we needed observability to solve – and here’s how we solved it”, I’d be super grateful.¬†

If this is too much to ask, let me know too ūüôā¬†

I’ve been talking to Abby Bangser a lot recently – and I’m “classifying” observability as “exploring in production” in my mental map – if you have philosophical thoughts on that, I’d also love to hear them ūüôā



Dear Alex,

Everyone’s systems are broken. Not just yours!

Yay, what a GREAT note!¬† I feel like I get asked some subset or variation of these questions several times a week, and I am delighted for the opportunity to both write up a response for you and post it for others to read.¬† I bet there are orders of magnitude more people out there with the same questions who *don’t* ask, so I really appreciate those who do. <3

I want to talk about the nuts and bolts of pitching to engineering teams and shepherding technical decisions like this, and I promise I will offer you some links to examples and other materials. But first I want to examine some of the assumptions in your note, because they elegantly illuminate a couple of common myths and misconceptions.

Myth #1: you don’t need observability til you have problems of scale

First of all, there’s this misconception that observability is something you only need when you have really super duper hard problems, or that it’s only justified when you have microservices and large distributed systems or crazy scaling problems.¬† No, no no nononono.¬†

There may come a point where you are ABSOLUTELY FUCKED if you don’t have observability, but it is ALWAYS better to develop with it.¬† It is never not better to be able to see what the fuck you are doing!¬† The image in my head is of a hiker with one of those little headlamps on that lets them see where they’re putting their feet down.¬† Most teams are out there shipping opaque, poorly understood code blindly — shipping it out to systems which are themselves crap snowballs of opaque, poorly understood code. This is costly, dangerous, and¬†extremely wasteful of engineering time.

Ever seen an engineering team of 200, and struggled to understand how the product could possibly need more than one or two teams of engineers? They’re all fighting with the crap snowball.

Developing software with observability is better at ANY scale.¬† It’s better for monoliths, it’s better for tiny one-person teams, it’s better for pre-production services, it’s better for literally everyone always.¬† The sooner and earlier you adopt it, the more compounding value you will reap over time, and the more of your engineers’ time will be devoted to forward progress and creating value.

Myth #2: observability is harder and more technically advanced than monitoring

Actually, it’s the opposite — it’s¬†much¬†easier.¬† If you sat a new grad down and asked them to instrument their code and debug a small problem, it would be fairly straightforward with observability. Observability speaks the native language of variables, functions and API endpoints, the mental model maps cleanly to the request path,¬†and you can straightforwardly ask any question you can come up with. (A key tenet of observability is that it gives an engineer the ability to ask any question, without having had to anticipate it in advance.)

With metrics and logging libraries, on the other hand, it’s far more complicated.you have to make a bunch of awkward decisions about where to emit various types of statistics, and it is terrifyingly easy to make poor choices (with terminal performance implications for your code and/or the remote data source).¬† When asking questions, you are locked in to asking only the questions that you chose to ask a long time ago. You spend a lot of time translating the relationships between code and lowlevel systems resources, and since you can’t break down by users/apps you are blocked from asking the most straightforward and useful questions entirely!¬†¬†

Doing it the old way Is. Fucking. Hard.¬† Doing it the newer way is actually much easier, save for the fact that it is, well, newer — and thus harder to google examples for copy-pasta. But if you’re saturated in decades of old school ops tooling, you may have some unlearning to do before observability seems obvious to you.

Myth #3: observability is a purely technical solution

To be clear, you¬†can¬†just add an observability tool to your stack and go on about your business — same old things, same old way, but now with high cardinality!

You¬†can,¬†but you¬†shouldn’t.¬†¬†

These are sociotechnical systems and they are best improved with¬†sociotechnical solutions.¬† Tools are an absolutely necessary and inextricable part of it.¬† But so are on call rotations and the fundamental virtuous feedback loop of¬†you build it, you run it.¬†¬†So are code reviews, monitoring checks, alerts, escalations, and a blameless culture.¬† So are managers who allocate enough time away from the product roadmap to truly fix deep technical rifts and explosions, even when it’s inconvenient, so the engineers aren’t in constant monkeypatch mode.

I believe that observability is a prerequisite for any major effort to have saner systems, simply because it’s so powerful being able to see the impact of what you’ve done.¬† In the hands of a creative, dedicated team, simply wearing a headlamp can be transformational.

Observability is your five senses for production.

You’re right on the money when you ask if it’s about exploring production, but you could also use words that are even more basic, like “understanding” or “inspecting”.¬† Observability is to software systems as a debugger is to software code.¬† It shines a light on the black box.¬† It allows you to move much faster, with more confidence, and catch bugs much sooner in the lifecycle¬†— before users have even noticed.¬† It rewards you for writing code that is easy to illuminate and understand in production.

So why isn’t everyone already doing it?¬† Well, making the leap isn’t frictionless.¬† There’s a minimal amount of instrumentation to learn (easier than people expect, but it’s nonzero) and then you need to learn to see your code through the lens of your own instrumentation.¬† You might need to refactor your use of older tools, such as metrics libraries, monitoring checks and log lines.¬† You’ll need to learn another query interface and how it behaves on your systems.¬† You might find yourself amending your code review and deploy processes a bit.¬†¬†

Nothing too terrible, but it’s all¬†new.¬† We hate changing our tool kits until absolutely fucking necessary.¬† Back at Parse/Facebook, I actually clung to my sed/awk/shell wizardry until I was professionally shamed into learning new ways when others began debugging shit faster than I could.¬† (I was used to being the debugger of last resort, so this really pissed me off.)¬† So I super get it!¬† So let’s talk about how to get your team aligned and hungry for change.

Okay okay okay already, how do I get my team on board?

If we were on the phone right now, I would be peppering you with a bunch of questions about your organization.  Who owns production?  Who is on call?  Who runs the software that devs write?  What is your deploy process, and how often does it get updated, and by who?  Does it have an owner?  What are the personalities of your senior folks, who made the decisions to invest in the current tools (and what are they), what motivates them, who are your most persuasive internal voices?  Etc.  Every team is different.  <3

There’s a virtuous feedback loop you need to hook up and kickstart and tweak here, where the people with the original intent in their heads (software engineers) are also informed and motivated, i.e. empowered to make the changes and personally impacted when things are broken. I recommend starting by putting your software engineers on call for production (if you haven’t).¬† This has a way of convincing even the toughest cases that they have a strong personal interest in quality and understandability.¬†

Pay attention to your feedback loop and the alignment of incentives, and make sure your teams are given enough time to actually fix the broken things, and motivation usually isn’t a problem.¬† (If it is, then perhaps another feedback loop is lacking: your engineers feeling sufficiently aligned with your users and their pain.¬† But that’s another post.)

Technical ownership over technical outcomes

I appreciate that you want your team to own the technical decisions.¬† I believe very strongly that this is the right way to go.¬† But it doesn’t mean you can’t have influence or impact, and¬†particularly¬†in times like this.¬†

It is literally your job to have your head up, scanning the horizon for opportunities and relevant threats.¬† It’s their job to be heads down, focusing on creating and delivering excellent work.¬† So it is absolutely appropriate for you to flag something like observability as both an opportunity and a potential threat, if ignored.

If I were in your situation and wanted my team to check out some technical concept, I might send around a great talk or two and ask folks to watch it, and then maybe schedule a lunchtime discussion.  Or I might invite a tech luminary in to talk with the team, give a presentation and answer their questions.  Or schedule a hack week to apply the concept to a current top problem, or something else of that nature.

But if I really wanted them to take it fucking seriously, I would put my thumb on the scale.¬† I would find myself a champion, load them up with context, and give them ample time and space to skill up, prototype, and eventually present to the team a set of recommendations.¬† (And I would stay in close contact with them throughout that period, to make sure they didn’t veer too far off course or lose sight of my goals.)

  1. Get a champion.

    Ideally you want to turn the person who is most invested in the old way of doing things — the person who owns the ELK cluster, say, or who was responsible for selecting the previous monitoring toolkit, or the goto person for ops questions — from your greatest obstacle into your proxy warrior.¬† This only works if you know that person is open-minded and secure enough to give it a fair shot & publicly change course, has sufficiently good technical judgment to evaluate and project into the future, and has the necessary clout with their peers.¬† If they don’t, or if they’re too afraid to buck consensus: pick someone else.

  2. Give them context.  

    Take them for a long walk.¬† Pour your heart and soul out to them.¬† Tell them what you’ve learned, what you’ve heard, what you hope it can do for you, what you fear will happen if you don’t.¬† It’s okay to get personal and to admit your uncertainties.¬† The more context they have, the better the chance they will come out with an outcome you are happy with.¬† Get them worried about the same things that worry you,¬†get them excited about the same possibilities that excite you.¬† Give them a sense of the stakes.¬†

    And don’t forget to tell them why you are picking them — because they are listened to by their peers, because they are already expert in the problem area, because you trust their technical judgment and their ability to evaluate new things — all the reasons for picking them will translate well into the best kind of flattery — the true kind.¬†¬†

  3. Give them a deadline.

    A week or two should be plenty.¬† Most likely, the decision is not going to be unilaterally theirs (this also gives you a bit of wiggle room should they come back going “ah no ELK is great forever and ever”), but their recommendations should carry serious weight with the team and technical leadership.¬† Make it clear what sort of outcome you would be very pleased with (e.g. a trial period for a new service) and what reasons you would find compelling for declining to pursue the project (i.e. your tech is unsupported, cost prohibitive, etc).¬† Ideally they should use this time to get real production data into the services they are testing out, so they can actually experience and weigh the benefits, not just read the marketing copy.

As a rule of thumb, I always assume that managers can’t convince engineers to do things: only other engineers can.¬† But what you can do instead is set up an engineer to be your champion.¬† And then just sit quietly in the corner, nodding, with an interested look on your face.

The nuclear option

if you <3 prod,
prod will <3 you back

You have one final option.¬† If there is no appropriate champion to be found, or insufficient time, or if you have sufficient trust with the team that you judge it the right thing to do: you¬†can¬†simply order them to do something your way.¬† This can feel squicky. It’s not a good habit to get into.¬† It usually results in things being done a bit slower, more reluctantly, more half-assedly. And you sacrifice some of your power every time you lean on your authority to get your team to do something.

But it’s just as bad for a leader to take it off the table entirely.

Sometimes you will see things they can’t.¬† If you cannot wield your power when circumstances call for it, then you don’t fucking have real power — you have unilaterally disarmed yourself, to the detriment of your org.¬† You can get away with this maybe twice a year, tops.¬†

But here’s the thing: if you order something to be done, and it turns out in the end that you were right?¬† You earn back all the power you expended on it¬†plus interest.¬†¬†If you were right, unquestionably right in the eyes of the team, they will respect you more for having laid down the law and made sure they did the right thing.



Some useful resources:


Questionable Advice #2: How Do I Get My Team Into Observability?

An Engineer’s Bill of Rights (and Responsibilities)

Power has a way of flowing towards people managers over time, no matter how many times you repeat “management is not a promotion, it’s a career change.”

It’s natural, like water flowing downhill.¬† Managers are privy to performance reviews and other personal information that they need to do their jobs, and they tend to be more practiced communicators.¬† Managers facilitate a lot of decision-making and routing of people and data and things, and it’s very easy to slip into making the all decisions rather than empowering people to make them.¬† Sometimes you want to just hand out assignments and order everyone to do as told.¬† (er, just me??)

But if you let all the power drift over to the engineering managers, pretty soon it doesn’t look so great to be an engineer.¬† Now you have people becoming managers for all the wrong reasons, or everyone saying they want to be a manager, or engineers just tuning out and turning in their homework (or quitting).¬† We all want autonomy and impact, we all crave a seat at the table.¬† You need to work harder to save those seats for non-managers.

So, in the spirit of the enumerated rights and responsibilities of our musty Constitution, here are some of the commitments we make to our engineers at Honeycomb — and some of the expectations we have for managering and engineering roles.¬† Some of them mirror each other, and others are very different.

(Incidentally, I find it helpful to practice visualizing the org chart hierarchies upside down — placing managers below their teams as support structure rather than perched atop.)



Engineer’s Bill of Rights

  1. You should be free to go heads down and focus, and trust that your manager will tap you when you are needed (or would want to be included).
  2. We will invest in you as a leader, just like we invest in managers.  Everybody will have opportunities to develop their leadership and interpersonal skills.
  3. Technical decisions must remain the provenance of engineers, not managers.
  4. You deserve to know how well you are performing, and to hear it early and often if you aren’t meeting expectations.
  5. On call should not substantially impact your life, sleep, or health (other than carrying your devices around).  If it does, we will fix it.
  6. Your code reviews should be turned around in 24 hours or less, under ordinary circumstances.
  7. You should have a career path that challenges you and contributes to your personal life goals, with the coaching and support you need to get there.
  8. You should substantially choose your own work, in consultation with your manager and based on our business goals.  This is not a democracy, but you will have a voice in our planning process.
  9. You should be able to do your work whether in or out of the office. When you’re working remotely, your team will loop you in and have your back.

Engineer’s responsibilities

  • Make forward progress on your projects every week. Be transparent.
  • Make forward progress on your career every quarter.¬† Push your limits.
  • Build a relationship of trust and mutual vulnerability with your manager andcateng¬†team, and invest in those relationships.
  • Know where you stand: how well are you performing, how quickly are you growing?
  • Develop your technical judgment and leadership skills.¬† Own and be accountable for engineering outcomes.¬† Ask for help when you need it, give help when asked.
  • Give feedback early and often, receive feedback gracefully.¬† Practice both saying no and hearing no.¬† Let people retract and try again if it doesn’t come out quite right.
  • Own your time and actively manage your calendar.¬† Spend your attention tokens mindfully.

Manager’s responsibilities

  • Recruit and hire and train your team.¬† Foster a sense of solidarity and “teaminess” as well as real emotional safety.
  • Care for every engineer on your team.¬† Support them in their career trajectory, personal goals, work/life balance, and inter- and intra-team dynamics.
  • Give feedback early and often. Receive feedback gracefully. Always say the hard things, but say them with love.
  • Move us relentlessly forward, watching out for overengineering and work that doeasshatsn’t contribute to our goals.¬† Ensure redundancy/coverage of critical areas.
  • Own the quarterly planning process for your team, be accountable for the goals you set.¬† Allocate resources by communicating priorities and recruiting eng leads.¬† Add focus or urgency where needed.
  • Own your time and attention. Be accessible. Actively manage your calendar.¬† Try not to make your emotions everyone else’s problems (but do lean on your own manager and your peers for support).
  • Make your own personal growth and self-care a priority. Model the values and traits we want our engineers to pattern themselves after.
  • Stay vulnerable.

I’d love to hear from anyone else who has a list like this.






An Engineer’s Bill of Rights (and Responsibilities)

DevOps vs SRE: delayed coverage of the dumbest war

Last week was¬†the West Coast Velocity conference. ¬†I had a terrific time — I think it’s the
best Velocity I’ve¬†been to¬†yet. ¬†I also slipped in quite late,¬†the evening before last, to catch Gareth’s session on DevOps vs SRE.

I had to catch it, because Gareth Rushgrove (of DevOps Weekly glory) was taunting @lusis and me about it on the Internet.

rainbowdropletAnd it was worth it! ¬† Holy crap,¬†this was such a fun barnburner of a talk, with Gareth schizophrenically arguing both for and against the key premise of the talk, which was about “Google Infrastructure for Everyone Else (GIFEE)” and whether SRE¬†is a) the highest, noblest goal¬†that we should all aspire towards, or b) mostly¬†irrelevant to anyone outside the Google confines.

Which Gareth won? ¬†Check out the slides and judge for yourself. ¬†ūüôÉ


At some point in his talk, though, Gareth tossed out something like¬†“Charity probably already has a blog post on this drafted up¬†somewhere.” ¬†And I suddenly remembered¬†“Fuck! ¬†I DO!”¬† it’s been sitting in my Drafts for months god dammit.

So this is actually a thing¬†I dashed off¬†back in April, after CraftConf. ¬†Somebody asked me for my opinion on the internet — always a dangerous proposition — and I went off on a bit of a rant about the differences and similarities¬†between DevOps and SRE, as philosophies and practices.

Time passed and I forgot about it, and then decided¬†it was too stale. ¬†I mean who really wants to read a rehash¬†of someone’s tweetstorm from two months ago?

Well Gareth, apparently.

Anyway: enjoy.



So in case you haven’t noticed, Google¬†recently¬†published a book about Site Reliability Engineering: How Google Runs Production Systems. ¬†It contains some really terrific¬†wisdom on how to scale both systems¬†and orgs. ¬†It contains chapters¬†written by dear friends of mine. ¬†It’s a great book, and you should buy it and read it!

Rainbow-Umbrella-Z-5_5It also has some really fucking obnoxious blurbs. ¬†Things like about how “ONLY GOOGLE COULD HAVE DONE THIS”, and an whiff of snobbery throughout the book as though they actually believe¬†this (which is¬†far worse if true).

You can’t really blame the poor blurb’ers, but you can certainly¬†look askance¬†at a massive systems engineering org¬†when it seems¬†as though¬†they’ve never heard of DevOps, or¬†considered¬†how it relates to SRE practices, and may even be completely unaware of what the rest of the industry has been up to for the past 10-plus years. ¬†It’s just a little weird.

So here, for the record, is what I said about it.


Google is a great company with lots of terrific engineers, but you can only say they are THE

The Google SRE Bible

BEST at what they do if you’re defining what they do tautologically, i.e. “they are the best at making Google run.”¬†¬†Etsyans are THE BEST at running Etsy, Chefs are THE BEST at building Chef, because … that’s what they do with their lives.

Context is everything here.  People who are THE BEST at Googling often flail and flame out in early startups, and vice versa.  People who are THE BEST at early-stage startup engineering are rarely as happy or impactful at large, lumbering, more bureaucratic companies like Google.  People who can operate equally well and be equally happy at startups and behemoths are fairly rare.

And large companies tend to get snobby and forget¬†this. ¬†They stop hiring for uniquerainbow-swirl¬†strengths and start hiring for lack of weaknesses or “Excellence in¬†Whiteboard Coding Techniques,” and congratulate themselves alot about being The Best. ¬†This becomes¬†harmful when it translates into¬†to less¬†innovation, abysmal¬†diversity numbers, and a slow but inexorable¬†drift into dinosaurdom.

Everybody thinks their problems are hard, but to a¬†seasoned engineer, most startup problems are not technically¬†all that hard. ¬†They’re tedious, and they are infinite, but anyone can¬†figure this shit out. ¬†The hard stuff is the rest of it: feverish pace, the need to reevaluate and reprioritize and reorient constantly, the total¬†responsibility, the terror and uncertainty of trying to find product/market fit and perform ten jobs at once and personally deliver¬†to your promises to your customers.

rainbow-cloud-dropletAt a large company, most of the hardest problems are¬†bureaucratic. ¬†You have to come to terms with being a very tiny cog in a very large wheel where the org has a huge vested interest in literally making you as replicable¬†and replaceable as possible. ¬†The pace is excruciatingly slow if you’re used to a startup. ¬†The autonony is … well, did I mention the politics? ¬†If you want autonomy, you have to master the politics.


Everyone. ¬†Operational excellence¬†is everyone’s job. ¬†Dude, if you have a candidate come in and they’re a jerk¬†to your office manager or your cleaning person, don’t fucking hire that¬†person because having jerks on ¬†your team is an operational risk (not to mention, you know, like moral issues and stuff).

But the more engineering-focused your role is, the more direct your impact will be on operational outcomes.

As a software engineer, developing¬†strong ops chops¬†makes you powerful. ¬†It makes you¬†better at debugging and instrumentation, building resiliency and observability into your own systems and interdependent systems, and building systems that other people can come along and understand and maintain long after you’re gone.rainbow-shade

As an operations engineer, those skills are already your bread and butter.  You can increase your power in other ways, like by leveling up at software engineering skills like test coverage and automation, or DBA stuff like query optimization and storage engine internals, or by helping the other teams around you level up on their skills (communication and persuasion are chronically underrecognized as core operations engineering skills).

This doesn’t mean that everyone can or should be able to do everything. ¬†(I can’t even¬†SAYrainbow-dot-ball¬†the words “full stack engineer” without rolling my eyes.) ¬†Generalists are awesome! ¬†But past a certain inflection point, specialization is the only way an org can scale.

It’s the only way you make room for those engineering archetypes who only want to dive deep, or who really really love refactoring, or who will save the world then disappear for weeks. ¬†Those engineers can be¬†incredibly valuable as part of a team … but they are most valuable¬†in a large org where you have enough generalists to keep the oars rowing along¬†in the meantime.

So, back to Google. ¬†They’ve done, ahem, rather¬†¬†well for themselves. ¬†Made shitbuckets¬†of money, pushed the boundaries of tech, service hardly ever goes down. ¬†They have operational demands that most of us never have seen and never will, and their engineers are definitely to be applauded for doing a lot of hard technical and cultural labor to get there.

So why did this SRE book ruffle a few feathers?

Mostly because it comes off a little tone deaf in places. ¬†I’m not personally pissed off by
the google SRE book,¬†actually,¬†just a little bemused at how legitimately unaware they¬†seem to be about …¬†anything else that the industry¬†has been doing over the past 10 years, in terms of cultural transformation, turning¬†sysadmins into better engineers, sharing on-call rotations, developing processes around empathy and cross-functionality, engineering best practices, etc.

DevOps for the rest of us

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

However, it happens that¬†Jen Davis and Katherine Daniels just published a book called Effective DevOps, which covers a lot of the same ground¬†with a much more varied and inclusive approach. ¬†And one of the things they return to over and over again is the power of context, and how one-size-fits-all solutions simply don’t exist, just like unisex OSFA t-shirts are a dirty fucking lie.

Google insularity is … a thing. ¬†On the one hand it’s great that they’re opening up a bit!rainbow-umbrella-clipart-1¬†On the other hand it’s a little bit like when somebody barges onto a mailing list and starts spouting¬†without skimming¬†any of the archives. ¬†And don’t even get me started on¬†what happens when you¬†hire long, longterm ex-Googlers back into to¬†the real world.

So, so many of us have had this experience of hiring ex-Googlers who automatically assume that the way Google does a thing is CORRECT, not just contextually appropriate.   Not just right for Google, but right for everyone, always.  Which is just obviously untrue.  But the reassimilation process can be quite long and exhausting when the Kool-Aid is so strong.

Because yeah, this is a conversation and a transformation¬†that the industry has been having for a long time now. ¬†Compared with the SRE manifesto, the DevOps philosophy is much more crowd-sourced, more flexible, and¬†adaptable to organizations of all¬†stages of developments, with all different requirements and key business differentiators, because it’s benefited from loud, mouthy contributors who aren’t all working in the same bubble all along.

And it’s like Google isn’t even aware¬†this was happening, which is weird.

Orrrrrr, maybe I’m just a wee bit annoyed¬†that I’ve been drawn into this position¬†rainbow-dot-ballof having to defend¬†“DevOps”, after many excellent¬†years spent being grumpy about the word¬†and the 10000010101 ways it is used and abused.

(Tell me again about your “DevOps Engineering Team”, I dare you.)

P.S. I highly encourage you to go read the epic hours-long rant by @matthiasr that kicked off the whole thing.  some of which I definitely endorse and some of which not, but I think we could go drink whiskey and yell about this for a week or two easy breezy  <3

Anyway what the fuck do I know, I’ve never worked in the Google lair, so maybe I am¬†just under-equipped to grasp¬†the true¬†glory, majesty and superiority of their¬†achievements over us all.

Or maybe they should go read Katherine and Jen’s book and interact with¬†the “UnGoogled” once in a while. ¬†‚ėļÔłŹ



DevOps vs SRE: delayed coverage of the dumbest war

Operational Best Practices #serverless

This post is part two of my recap of last week’s terrific Serverless conference. ¬†If you feel like getting bitchy with me about what¬†serverless means or #NoOps or whatever, please refer back to the prequel¬†post, where I talked about operations engineering in the modern world.

*Then* you can get bitchy with me.  (xoxoxxooxo)

The title of my talk was:

Screen Shot 2016-05-30 at 8.43.39 PM

The theme of my talk was basically: what should software engineers know and care about when it comes to operations in a world where we are outsourcing more and more core functionality?

If you care about running a quality service or product, or providing your customers with a reasonable level of service, you have to care about operational concerns like design, resiliency, instrumentation and debuggability.  No matter how many abstractions there are between you and the bare metal.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault. ¬†You chose them, it’s on you. ¬†It’s tacky to blame the software or the service, and besides your customers don’t give a shit whose “fault” it is.

So given an infinite number of things to care about, where do you start?

What is your mission, and what are your differentiators?

The first question must always be: what is your mission?  Your mission is not writing software.  Your mission is delivering whatever it is your customers are paying you for, and you use software to get there.  (Code is kind of a liability so you should write as little of it as necessary.  hey!! sounds like a good argument for #serverless!)

Second: what are your core differentiators?  What are the things that you are doing that are unique, and difficult to replicate, or the things where you have to actually be world class experts in those things?

Those are the things that you will have the hardest time outsourcing, or that you should think about very carefully before outsourcing.

Screen Shot 2016-05-30 at 7.57.06 PM


You can outsource labor, but you can’t outsource caring. ¬†And nobody but you is in the position to think about your core differentiators and your product in a holistic way.

If you’re a typical early startup, you’re probably using somewhere between 5 and 20¬†SaaS products to get rid of some of the crap work and offload it to dedicated teams who can do it better than you can, much more cheaply, so you are freed up to work on your core value proposition.


But you still have to think about things like reliability, your security model, your persistent storage models, your query performance, how all these lovely services talk to each other, how you’re going to debug them, how you’re going to repro when things go wrong, etc. ¬†You still own these things, even if you don’t run them.

For example, take AWS Lambda. ¬†It’s a pretty great service on many dimensions. ¬†It’s an early version of the future. ¬†It is also INCREDIBLY¬†irritating and challenging to debug in a practically infinite number of insanity-inducing ways.

** Important side note — I’m talking about actual production systems. ¬†Parse, Heroku, Lambda, etc are GREAT for prototyping and can take you a long, long way. ¬†Early stage startups SHOULD optimize for agility and rapid developer iteration, not reliability. ¬†Thx to @joeemison for reminding me that i left that out of the recap.

Screen Shot 2016-05-30 at 8.03.01 PM

Focus on the critical path

Your users don’t care if your internal jenkins builds are broken. ¬†They don’t care about a whole lot of things that you have to care about … eventually. ¬†They do care a lot if your product isn’t actually functional. ¬†Which means you have to think through¬†the behavioral and failure characteristics of the providers you’re relying on in any user visible fashion.

Ask lots of questions if you can. ¬†(AWS often won’t tell you much, but smaller providers will.) ¬†Find out as much as you can about their cotenancy model (shared hardware or isolation?), their typical performance variance (run your own tests, don’t trust their claims), and the underlying storage systems.

Think about how you can bake in resiliency from the user’s perspective, that doesn’t rely on provider guarantees. ¬†If you’re on mobile, can you provide a reasonable offline experience? ¬†Like Parse did a lot of magic here in the APIs, where it would back off and retry saves if there were any errors.

Can you fail over to another provider if one is down? ¬†Is it even worth it at your company’s stage of maturity and engineering resources to invest in this?

How willing are you to be locked into a vendor or provider, and what is the story if you find yourself forced to switch?  Or if that service goes away, as so many, many, many of them have done and will do.  (RIP, parse.com.)

Screen Shot 2016-05-30 at 8.11.10 PM


Listen, outsourcing is awesome. ¬†I do it as much as I can. ¬†I’m literally helping build a service that provides outsourced metrics, I believe in this version of the future! ¬†It’s basically the latest iteration of capitalism in a nutshell: increased complexity –> increased¬†specialization –> you pay other people to do the job better than you –> everybody wins.

But there are tradeoffs, so¬†let’s be real.

The service, if it is smart, will put strong constraints on how you are able to use it, so they are more likely to¬†deliver on their reliability goals. ¬†When users have flexibility and options it creates chaos and unreliability. ¬†If the platform has to choose between your happiness vs thousands of other customers’ happiness, they will choose the many over the one every time — as they should.

Limits may mysteriously change or be invented¬†as they are discovered, esp with fledgling services. ¬†You may be desperate for a particular feature, but you can’t build it. ¬†(This is why I went for Kafka over Kinesis.)

You need to think way more carefully and more deeply about visibility and introspection up front than you would if you were running your own services, because you have no ability to log in and use strace or gdb or tail a logfile or run any system profiling commands when things go dark.

In the best case, you’re giving up some control and quality in exchange for experts doing the work better than you could for cheaper (e.g. i’m never running a fucking physical data center again, jesus. ¬†EC24lyfe). ¬†In a common worse¬†case, it’s less reliable than what you would build AND¬†it’s also opaque AND¬†you can’t tell if it’s down for you or for everyone because frankly it’s just massively¬†harder to build a service that works for thousands/millions of use cases than for any one of them individually.

Screen Shot 2016-05-30 at 8.21.36 PM.png

Stateful services

Ohhhh and let’s just briefly¬†talk about state.

The serverless utopia mostly ignores the problems of stateful services.  If pressed they will usually say DynamoDB, or Firebase, or RDS or Aurora or something.

This is a big, huge, deep, wide lake of crap to wade in to so all I’m going to say is that there is no such thing as having the luxury of not having to understand how your storage systems work. ¬†Queries will get slow, and you’ll need to be able to figure out why and fix them. ¬†You’ll hit scaling cliffs where suddenly a perfectly-usable app just starts timing everything out because of that extra¬†second of latency coming from …


The hardware underlying your instance will degrade (there’s a server somewhere under all those abstractions, don’t forget). ¬†The provider will have mysterious failures. ¬†They will be better than you, probably, but less inclined to give you satisfactory progress updates because there are hundreds or thousands or millions of you all clamoring.

The more you understand about your storage system (and the more you stay in the lane of how it was intended to be used), the happier you’ll be.

Screen Shot 2016-05-30 at 8.28.29 PM

In conclusion

These trends are both inevitable and, for the most part, very good news for everyone.

Operations engineering is becoming a more fascinating and specialized skill set. ¬†The best engineers are flocking to solve category problems — instead of building the same system at company after company, they are building SaaS solutions to solve it for the internet at large. ¬†Just look at the massive explosion in operational software offerings over the past 5-6 years.

This means that the era of the in-house dedicated ops team, which serves as an absorbent buffer for all the pain of software development, is mostly on its way out the door.  (And good riddance.)

People are waking up to the fact that software quality improves when feedback loops are tighter for software engineers, which means being on call and owning services end to end.  The center of gravity is shifting towards engineering teams owning the services they built.

This is awesome! ¬†You get to rent engineers from Google, AWS, Pagerduty, Pingdom, Heroku, etc for much cheaper than if you hired them in-house — if you could even get them, which you probably can’t because talent is scarce.

Screen Shot 2016-05-30 at 8.39.42 PM

But the flip side of this is that application engineers need to get better at thinking in traditionally operations-oriented ways about reliability, architecture, instrumentation, visibility, security, and storage.  Figure out what your core differentiators are, and own the shit out of those.

Nobody but you can care about your mission as much as you can.  Own it, do it.  Have fun.


Operational Best Practices #serverless

Two weeks with Terraform

I’ve been using terraform regularly¬†for 2-3 weeks now. ¬†I have terraformed in rage,¬†I have terraformed in delight. ¬†I thought it might be helpful to share some of my notes and lessons learned.

Why Terraform?

Because I am fucking sick and tired of not having versioned infrastructure.  Jesus christ, the ways my teams have bent over backwards to fake infra versioning after the fact (nagios checks running ec2 diffs, anyone?).

Because I am starting from scratch on a green field project, so I have the luxury of¬†experimenting without screwing over existing customers.¬† Because I generally respect Hashicorp¬†and think they’re on the right path more often than not.

If you want versioned infra, you basically get to choose between 1) AWS CloudFormation and its wrappers (sparkleformation, troposphere), 2) chef-provisioner, and 3) Terraform.

The orchestration¬†space is very¬†green, but I think Terraform is the standout¬†option. ¬†(More about why¬†later.) ¬†There is precious little¬†evidence that TF was developed¬†by or for anyone with experience running production systems at scale, but it’s … definitely not as actively hostile as CloudFormation, so it’s got that going for it.

First impressions

Stage one: my terraform experiment started out great.  I read a bunch of stuff and quickly spun up a VPC with public/private subnets, NAT, routes, IAM roles etc in < 2 days.  This would be nontrivial to do in two days *without* learning a new tool, so TOTAL JOY.

Stage two:¬†spinning up services. ¬†This is where I started being like … “huh. ¬†Has anyone ever actually used this thing? ¬†For a real thing? ¬†In production?” ¬†Many of the patterns¬†that seemed obvious and correct¬†to me about how to build robust AWS services were completely absent, like any¬†concept of a subnet tier spanning¬†availability zones. ¬†I did¬†some inexcusably horrible¬†things with¬†variables to get the behavior I wanted.

Stage three:¬†… modules. ¬†Yo, all I wanted to do was refactor a perfectly good working¬†config into modules for VPC, security groups, IAM roles/policies/users/groups/profiles, S3 buckets/configs/policies, autoscaling groups, policies, etc, and my¬†entire fucking world just took¬†a dump for a week. ¬†SURE,¬†I was a TF noob making noob mistakes, but I could not believe how hard it was to debug literally anything..

This is when I started tweeting sad things.

The best (only) way of debugging terraform was just reading really, really carefully, copy-pasting back and forth between multiple files for hours to get all the variables/outputs/interpolation correct.  Many of the error messages lack any context or line numbers to help you track down the problem.  Take this prime specimen:

Error downloading modules: module aws_vpc: Error loading .terraform
/modules/77a846c64ead69ab51558f8c5be2cc44/main.tf: Error reading 
config for aws_route_table[private]: parse error: syntax error

Any guesses? ¬†Turned out to be a stray ‘}’ on line 105 in a different file, which HCL vim syntax highlighting thought was A-OK. ¬†That one took me a couple hours to track down.

Or this:

aws_security_group.zookeeper_sg: cannot parse '' as int: 
strconv.ParseInt: parsing "": invalid syntax

Which *obviously* means you didn’t explicitly define some inherited port as an int, so there’s a string somewhere there lurking in your tf tree. ¬†(*Obviously* in retrospect, I mean, after quite a long time poking haplessly about.)

Later on I developed more sophisticated patterns for debugging terraform.  Like, uhhh, bisecting my diffs by commenting out half of the lines I had just added, then gradually re-adding or re-commenting out more lines until the error went away.

Security groups are the worst for this. ¬†SO MANY TIMES I had security group diffs run¬†cleanly with “tf¬†apply”, but then claim to be modifying themselves over and over. ¬†Sometimes I would track this down to having passed in a variable¬†for a port number or range, e.g. “cidr_blocks = [“${var.ip_range}”]”. ¬†Hard-coding it to “cidr_blocks [“″]” or setting the type explicitly would resolve the problem. ¬†Or if I accidentally entered a CIDR¬†range that AWS didn’t like, like instead of ¬†The change would apply and usually it would work, it just didn’t think it had worked, or something. ¬†TF wasn’t aware there was a problem with the run so it¬†would just keep “successfully” reapplying the diff¬†every time it ran.

Some advice for TF noobs

  • As @phinze told me,¬†“modules are basically like functions — a variable is an argument, output is a return value”.¬† This was helpful, because that was completely unintuitive to me when I started refactoring. ¬†It took a few days of wrestling with profoundly inscrutable¬†error messages before modules really clicked for me.
  • Strings.¬† Lists.¬† You can only pass variables around as strings. ¬†Split() and join() are your friends. ¬†Oh my god I would sell so many innocent children¬†for the ability to pass maps back and forth between modules.
  • No interpolation for resource names¬†makes me so sad. ¬†Basically you can either use local variable maps, or multiple lists and just … run¬†those index counters like a boss I guess..
  • Use AWS termination protection for stateful services or anything risky once you’re in production.¬† Use create_before_destroy on resources like ASG launch configs.¬† Use “don’t destroy” where you must — but as sparingly as possible, because that basically breaks the entire TF¬†model.
  • If you change the launch config for an ASG, like replacing the AMI for example, you might expect TF to kick off an instance recycle. ¬†It will not. ¬†You must manually terminate the instances to pick up the new config.
  • If you’re collaborating with a team — ok, even if you’re not — find a remote place to store the tfstate files. ¬†Try S3 or github, or shell out for Atlas. ¬†Local state on laptops is for losers.
  • TF_LOG=DEBUG has¬†never once been helpful to me. ¬†I can only assume it was written for the Hashicorp developers, not for those of us using the product.

Errors returned by AWS are completely opaque. ¬†Like “You were not allowed to apply this update”.¬† Huh? ¬†Ok well if it fails on “tf plan”, it’s probably a bad terraform config. ¬†If it successfully plans but fails on “tf¬†apply”, your AWS logic is probably at fault.

Terraform does not do a great job of surfacing AWS errors.

For example, here is some terraform output:

tf output: "* aws_route_table.private: InvalidNatGatewayID.NotFound
: The natGateway ID 'nat-0e5f4ea507113b423' does not exist"

Oh!~  Okay, I go to the AWS console and track down that NAT gateway object and find this:

"Elastic IP address [eipalloc-8583b7e1] is already associated"

Hey, that seems useful!  Seems like TF just timed out bringing up one of the route tables, so it tried assigning the same EIP twice.  It would be nice to surface more of this detail into the terraform output, I hate having to resort to a web console.

Last but not least: one time I changed the comment string¬†on a security group, and “tf¬†plan” went into an infinite dependency loop. ¬†I had to roll back the change, run terraform destroy against all resources in a bash for loop, and create an new security group with all new instances/ASGs just to change the comment string. ¬†You cannot change comment strings or descriptions for resources without the resources being destroyed.¬†¬†This seems PROFOUNDLY weird¬†to me.

Wrapper scripts

Lots of people seem to eventually end up wrapping terraform with a script.  Why?

  • There is no concept of a $TF_ROOT. ¬†If you run tf from the wrong directory, it will do some seriously confusing and screwed up shit (like duping your config, but only some of it).
  • If you’re running in production, you prob do not want people to be able to accidentally “terraform destroy” the world with the wrong environment
  • You want to enforce test/staging environments, and promotion of changes to production after they¬†are proven good
  • You want to¬†automatically re-run “tf plan” after “tf apply”¬†and make sure¬†your resources have converged cleanly.
  • So you can add slack hooks, or hipchat hooks, or github hooks.
  • Ummm, have I mentioned¬†that TF can feel somewhat undebuggable? ¬†Several people have told me they create rake tasks or YML templates that they then generate .tf files from so they can debug those when things break. ¬†(Erf …)

Okay, so …..

God, it feels¬†I’ve barely¬†gotten started but I should probably wrap it up.[*] ¬†Like I said, I¬†think¬†terraform is best in class for infra orchestration. ¬†And orchestration¬†is a thing that I desperately want. ¬†Orchestration and composability are the future of infrastructure.

But also terraform is green as fuck and I would not recommend it to anyone who needs a 4-nines platform.

Simply put, there is a lot of shit I don’t want terraform¬†touching.¬† I want terraform doing as little as possible.¬† I have already put a bunch of things into terraform that I plan on taking right back out again. ¬†Like, you should never be running a userdata.sh script after TF¬†has bootstrapped a node. ¬†Yuck.. That is a job for your cfg management, or possibly a job for packer or a custom firstboot script, but never your orchestration tool! ¬†I have already stuffed a bunch of Route53 DNS¬†into TF and I will be ripping that right back out soon.¬† Terraform should not be managing any kind of dynamic data. ¬†Or service registry, or configs, or ….

Terraform is fantastic¬†for defining the bones of your infrastructure. ¬†Your networking, your NAT, autoscaling groups, the bits that are robust and rarely change. ¬†Or spinning up replicas of production on every changeset via Travis-CI or Jenkins — yay! ¬†Do that!

But I would not feel safe making TF changes to production every day.  And you should delegate any kind of reactive scaling to ASGs or containers+scheduler or whatever.  I would never want terraform to interfere with those decisions on some arbitrary future run.

Which is why it is important to note that¬†terraform does not play nicely with others. ¬†It wants to own¬†the whole thing. ¬†Monkeypatching TF¬†onto an existing infra is kind of horrendous. ¬†It would be nice if you could tag certain resources or products as “this is managed by some other system, thx”.

So: why terraform?

Well, it is fairly opinionated. ¬†It’s actively developed by some really smart people. ¬†It’s moving fast and has most of the momentum in the space. ¬†It’s composable and interacts well with other players iff you make some good life choices. ¬†(Packer, for example, is amazing, by far the most unixy utility¬†of the Hashicorp library.)

Just look at the rate of bug fixes and releases for Terraform vs CloudFormation.  Set aside crossplatform compatibility etc, and just look at the energy of the respective communities.  Not even a fair fight.

Want more?¬† Ok, well¬†I would rather adopt one¬†opinionated¬†philosophy for my infrastructure, supplementing where necessary, than duct tape together fifty different half baked philosophies about how software and infrastructure should work and spend all my time mediating¬†their conflicts. ¬†(This is one of my¬†beefs with CloudFormation: AWS has no opinions, only slobbering, squidlike, directionless-flopping optionalities. ¬†And while we’re on the topic¬†it also has nothing like “tf plan” for previewing changes, so THAT’S PRETTY STUPID TOO.)

I do have some concerns about Hashicorp spreading themselves too thin on too many products. ¬†Some of those products probably¬†shouldn’t exist. ¬†Meh.

Terraform has a ways to go before it feels predictable and debuggable, but I think it’s heading in the right direction. ¬†It’s been a fun couple weeks and I’m excited to start contributing to the ecosystem and integrating with other components, like chef-solo & consul.


[*] OMGGGGGGG, I never even got to the glorious horrors of the terraforming gem and how you are most definitely going to end up manually editing your *.tfstate files.  Ahahahahaa.

[**] Major thanks to @phinze, @solarce, @ascendantlogic, @lusis, @progrium and others who helped me limp through my first few weeks.

Two weeks with Terraform