“Why are my tests so slow?” A list of likely suspects, anti-patterns, and unresolved personal trauma.

Over the past couple of weeks I’ve been tweeting a LOT about lead time to deploy: the interval encompassing the time from when the code gets written and when it’s been deployed to production. Also described as “how long it takes you to run CI/CD.”

How important is this?

Fucking central.

Here is a quickie thread from this week, or just go read “Accelerate” like everybody already should have. 🙃

It’s nigh impossible to have a high-performing team with a long lead time, and becomes drastically easier with a dramatically shorter lead time.

🌷 Shorter is always better.
🌻 One mergeset per deploy.
🌹 Deploy should be automatic.

And it should clock in under 15 minutes, all the way from “merging!” to “deployed!”.

Now some people will nod and agree here, and others freak the fuck out. “FIFTEEN MINUTES?” they squall, and begin accusing me of making things up or working for only very small companies. Nope, and nope. There are no magic tricks here, just high standards and good engineering, and the commitment to maintaining your goals quarter by quarter.

If you get CI/CD right, a lot of other critical functions, behaviors, and intuitions are aligned to be comfortably successful and correct with minimal effort. If you get it wrong, you will spend countless cycles chasing pathologies. It’s like choosing to eat your vegetables every day vs choosing a diet of cake and soda for fifty years, then playing whackamole with all the symptoms manifesting on your poor, mouldering body.

Is this ideal achievable for every team, on every stack, product, customer and regulatory environment in the world? No, I’m not being stupid or willfully blind. But I suggest pouring your time and creative energy into figuring out how closely you can approximate the ideal given what you have, instead of compiling all the reasons why you can’t achieve it.

Most of the people who tell me they can’t do this are quite wrong, turns out. And even if you can’t down to 15 minutes, ANY reduction in lead time will pay out massive, compounding, benefits to your team and adjacent teams forever and ever.

So — what was it you said you were working on right now, exactly? that was so important? 🤔

“Cutting my build time by 90%!” — you

Huzzah. 🤠

So let’s get you started! Here, courtesy of my twitterfriends, is a long compiled list of Likely Suspects and CI/CD Offenders, a long list of anti-patterns, and some unresolved personal pain & suffering to hunt down and question when your build gets slow..

✨15 minutes or bust, baby!✨

 
Where it all started: what keeps you from getting under 15 minute CI/CD runs?

Generally good advice.

  • Instrument your build pipeline with spans and traces so you can see where all your time is going. ALWAYS. Instrument.
  • Order tests by time to execute and likelihood of failure.
  • Don’t run all tests, only tests affected by your change
  • Similarly, reduce build scope; if you only change front-end code, only build/test/deploy the front end, and for heaven’s sake don’t fuss with all the static asset generation
  • Don’t hop regions or zones any more than you absolutely must.
  • Prune and expire tests regularly, don’t wait for it to get Really Bad
  • Combine functionality of tests where possible — tests need regular massages and refactors too
  • Pipeline, pipeline, pipeline tests … with care and intention
  • You do not need multiple non-production environment in your CI/CD process. Push your artifacts to S3 and pull them down from production. Fight me on this
  • Pull is preferable to push. (see below)
  • Set a time elapsed target for your team, and give it some maintenance any time it slips by 25%

The usual suspects

  • tests that take several seconds to init
  • setup/teardown of databases (HINT try ramdisks)
  • importing test data, seeding databases, sometimes multiple times
  • rsyncing sequentially
  • rsyncing in parallel, all pulling from a single underprovisioned source
  • long git pulls (eg cloning whole repo each time)
  • CI rot (eg large historical build logs)
  • poor teardown (eg prior stuck builds still running, chewing CPU, or artifacts bloating over time
  • integration tests that spin up entire services (eg elasticsearch)
  • npm install taking 2-3 minutes
  • bundle install taking 5 minutes
  • resource starvation of CI/CD system
  • not using containerized build pipeline
  • …(etc)
 
Continuous deployment to industrial robots in prod?? Props, man.

Not properly separating the streams of “Our Software” (changes constantly) vs “infrastructure” (changes rarely)

  • running cloudformation to setup new load balancers, dbs, etc for an entire acceptance environment
  • docker pulls, image builds, docker pushes container spin up for tests

“Does this really go here?”

  • packaging large build artifacts into different format for distribution
  • slow static source code analysis tools
  • trying to clone production data back to staging, or reset dbs between runs
  • launching temp infra of sibling services for end-to-end tests, running canaries
  • selenium and other UX tests, transpiling and bundling assets

“Have a seat and think about your life choices.”

  • excessive number of dependencies
  • extreme legacy dependencies (things from the 90s)
  • tests with “sleep” in them
  • entirely too large frontends that should be broken up into modules

“We regret to remind you that most AWS calls operate at the pace of ‘Infrastructure’, not ‘Software'”

  • AWS CodeBuild has several minutes of provisioning time before you’re even executing your own code — even a few distinct jobs in a pipeline and you might suffer 15 min of waiting for CodeBuild to do actual work
  • building a new AMI
  • using EBS
  • spinning up EC2 nodes .. sequentially 😱
  • cool it with the AWS calls basically



A few responses were oozing with some unresolved trauma, lol.

Natural Born Opponents: “Just cache it” and “From the top!”

  • builds install correct version of toolchain from scratch each time
  • rebuilding entire project from source every build
  • failure to cache dependencies across runs (eg npm cache not set properly)

“Parallelization: the cause of, and solution to, all CI problems”

  • shared test state, which prevents parallel testing due to flakiness and non-deterministic test results
  • not parallelizing tests
 
 
I have so many questions….

Thanks to @wrd83, @sorenvind, @olitomli, @barney_parker, @dastbe, @myajpitz, @gfodor, @mrz, @rwilcox, @tomaslin, @pwyliu, @runewake2, @pdehlkefor, and many more for their contributions!

P.S. what did I say about instrumenting your build pipeline? For more on honeycomb + instrumentation, see this thread. Our free tier is incredibly generous, btw ☺️

Stay tuned for more long form blog posts on this topic. Coming soon. 🌈

charity

P.S. this blog post is the best thing i’ve ever read about reducing your build time. EVER.

“Why are my tests so slow?” A list of likely suspects, anti-patterns, and unresolved personal trauma.

On Call Shouldn’t Suck: A Guide For Managers

There are few engineering topics that provoke as much heated commentary as oncall. Everybody has a strong opinion. So let me say straight up that there are few if any absolutes when it comes to doing this well; context is everything. What’s appropriate for a startup may not suit a larger team. Rules are made to be broken.

That said, I do have some feelings on the matter. Especially when it comes to the compact between engineering and management. Which is simply this:

It is engineering’s responsibility to be on call and own their code. It is management’s responsibility to make sure that on call does not suck. This is a handshake, it goes both ways, and if you do not hold up your end they should quit and leave you.

As for engineers who write code for 24×7 highly available services, it is a core part of their job is to support those services in production. (There are plenty of software jobs that do not involve building highly available services, for those who are offended by this.) Tossing it off to ops after tests pass is nothing but a thinly veiled form of engineering classism, and you can’t build high-performing systems by breaking up your feedback loops this way.

Someone needs to be responsible for your services in the off-hours. This cannot be an afterthought; it should play a prominent role in your hiring, team structure, and compensation decisions from the very start. These are decisions that define who you are and what you value as a team.

Some advice on how to organize your on call efforts, in no particular order.

  • It is easier to keep yourself from falling into an operational pit of doom than it is to claw your way out of one. Make good operational hygiene a priority from the start. Value good, clean, high-level abstractions that allow you to delegate large swaths of your infrastructure and operational burden to third parties who can do it better than you — serverless, AWS, *aaS, etc. Don’t fall into the trap of disrespecting operations engineering labor, it’s the only thing that can save you.
  • Invest in good release and deploy tooling. Make this part of your engineering roadmap, not something you find in the couch cushions. Get code into production within minutes after merging, and watch how many of your nightmares melt away or never happen.
  • Invest in good instrumentation and observability. Impress upon your engineers that their job is not done when tests pass; it is not done until they have watched users using their code in production. Promote an ownership mentality over the full software life cycle. This is how dev.to did it.
  • Construct your feedback loops thoughtfully. Try to alert the person who made the broken change directly. Never send an alert to someone who isn’t fully equipped and empowered to fix it.
  • When an engineer is on call, they are not responsible for normal project work — period. That time is sacred and devoted to fixing things, building tooling, and creating guard-rails to protect people from themselves. If nothing is on fire, the engineer can take the opportunity to fix whatever has been annoying them. Allow for plenty of agency and following one’s curiosity, wherever it may lead, and it will be a special treat.
  • Closely track how often your team gets alerted. Take ANY out-of-hours-alert seriously, and prioritize the work to fix it. Night time pages are heart attacks, not diabetes.
  • Consider joining the on call rotation yourself! If nothing else, generously pinch hit and be an eager and enthusiastic backup on the regular.
  • Reliability work and technical debt are not secondary to product work. Budget them into your roadmap, right alongside your features and fixes. Don’t plan so tightly that you have no flex for the unexpected. Don’t be afraid to push back on product and don’t neglect to sell it to your own bosses. People’s lives are in your hands; this is what you get paid to do.
  • Consider making after-hours on call fully-elective. Why not? What is keeping you from it? Fix those things. This is how Intercom did it.
  • Depending on your stage and available resources, consider compensating for it.This doesn’t have to be cash, it could be a Friday off the week after every on call rotation. The more established and funded a company you are, the more likely you should do this in order to surface the right incentives up the org chart.
  • Once you’ve dug yourself out of firefighting mode, invest in SLOs (Service Level Objectives). SLOs and observability are the mature way to get out of reactive mode and plan your engineering work based on tradeoffs and user impact.

I believe it is thoroughly possible to construct an on call rotation that is 100% opt-in, a badge of pride and accomplishment, something that brings meaning and mastery to people’s engineering roles and ties them emotionally to their users. I believe that being on call is something that you can genuinely look forward to.

But every single company is a unique complex sociotechnical snowflake. Flipping the script on whether on call is a burden or a blessing will require a unique solution, crafted to meet your specific needs and drawing on your specific history. It will require tinkering. It will take maintenance.

Above all: ✨RAISE YOUR STANDARDS✨ for what you expect from yourselves. Your greatest enemy is how easily you accept the status quo, and then make up excuses for why it is necessarily this way. You can do better. I know you can.

There is lots and lots of prior art out there when it comes to making on call work for you, and you should research it deeply. Watch some talks, read some pieces, talk to some people. But then you’ll have to strike out on your own and try something. Cargo-culting someone else’s solution is always the wrong answer.

Any asshole can write some code; owning and tending complex systems for the long run is the hard part. How you choose to shoulder this burden will be a deep reflection of your values and who you are as a team.

And if your on call experience is mandatory and severely life-impacting, and if you don’t take this dead seriously and fix it ASAP? I hope your team will leave you, and go find a place that truly values their time and sleep.

 

On Call Shouldn’t Suck: A Guide For Managers

Questionable Advice #2: How Do I Get My Team Into Observability?

Welcome to the second installment of my advice column! Last time we talked about the emotional impact of going back to engineering after a stint in management. If you have a question you’d like to ask, please email me or DM it to me on twitter.

Hi Charity! I hope it’s ok to just ask you this… 

I’m trying to get our company more aware of observability and I’m finding it difficult to convince people to look more into it. We currently don’t have the kind of systems that would require it much – but we will in future and I want us to be ahead of the game. 

If you have any tips about how to explain this to developers (who are aware that quality is important but don’t always advocate for it / do it as much as I’d prefer), or have concrete examples of “here’s a situation that we needed observability to solve – and here’s how we solved it”, I’d be super grateful. 

If this is too much to ask, let me know too 🙂 

I’ve been talking to Abby Bangser a lot recently – and I’m “classifying” observability as “exploring in production” in my mental map – if you have philosophical thoughts on that, I’d also love to hear them 🙂

alex_schl

 

Dear Alex,

Everyone’s systems are broken. Not just yours!

Yay, what a GREAT note!  I feel like I get asked some subset or variation of these questions several times a week, and I am delighted for the opportunity to both write up a response for you and post it for others to read.  I bet there are orders of magnitude more people out there with the same questions who *don’t* ask, so I really appreciate those who do. <3

I want to talk about the nuts and bolts of pitching to engineering teams and shepherding technical decisions like this, and I promise I will offer you some links to examples and other materials. But first I want to examine some of the assumptions in your note, because they elegantly illuminate a couple of common myths and misconceptions.

Myth #1: you don’t need observability til you have problems of scale

First of all, there’s this misconception that observability is something you only need when you have really super duper hard problems, or that it’s only justified when you have microservices and large distributed systems or crazy scaling problems.  No, no no nononono. 

There may come a point where you are ABSOLUTELY FUCKED if you don’t have observability, but it is ALWAYS better to develop with it.  It is never not better to be able to see what the fuck you are doing!  The image in my head is of a hiker with one of those little headlamps on that lets them see where they’re putting their feet down.  Most teams are out there shipping opaque, poorly understood code blindly — shipping it out to systems which are themselves crap snowballs of opaque, poorly understood code. This is costly, dangerous, and extremely wasteful of engineering time.


Ever seen an engineering team of 200, and struggled to understand how the product could possibly need more than one or two teams of engineers? They’re all fighting with the crap snowball.

Developing software with observability is better at ANY scale.  It’s better for monoliths, it’s better for tiny one-person teams, it’s better for pre-production services, it’s better for literally everyone always.  The sooner and earlier you adopt it, the more compounding value you will reap over time, and the more of your engineers’ time will be devoted to forward progress and creating value.

Myth #2: observability is harder and more technically advanced than monitoring

Actually, it’s the opposite — it’s much easier.  If you sat a new grad down and asked them to instrument their code and debug a small problem, it would be fairly straightforward with observability. Observability speaks the native language of variables, functions and API endpoints, the mental model maps cleanly to the request path, and you can straightforwardly ask any question you can come up with. (A key tenet of observability is that it gives an engineer the ability to ask any question, without having had to anticipate it in advance.)

With metrics and logging libraries, on the other hand, it’s far more complicated.you have to make a bunch of awkward decisions about where to emit various types of statistics, and it is terrifyingly easy to make poor choices (with terminal performance implications for your code and/or the remote data source).  When asking questions, you are locked in to asking only the questions that you chose to ask a long time ago. You spend a lot of time translating the relationships between code and lowlevel systems resources, and since you can’t break down by users/apps you are blocked from asking the most straightforward and useful questions entirely!  

Doing it the old way Is. Fucking. Hard.  Doing it the newer way is actually much easier, save for the fact that it is, well, newer — and thus harder to google examples for copy-pasta. But if you’re saturated in decades of old school ops tooling, you may have some unlearning to do before observability seems obvious to you.

Myth #3: observability is a purely technical solution

To be clear, you can just add an observability tool to your stack and go on about your business — same old things, same old way, but now with high cardinality!

You can, but you shouldn’t.  

These are sociotechnical systems and they are best improved with sociotechnical solutions.  Tools are an absolutely necessary and inextricable part of it.  But so are on call rotations and the fundamental virtuous feedback loop of you build it, you run it.  So are code reviews, monitoring checks, alerts, escalations, and a blameless culture.  So are managers who allocate enough time away from the product roadmap to truly fix deep technical rifts and explosions, even when it’s inconvenient, so the engineers aren’t in constant monkeypatch mode.

I believe that observability is a prerequisite for any major effort to have saner systems, simply because it’s so powerful being able to see the impact of what you’ve done.  In the hands of a creative, dedicated team, simply wearing a headlamp can be transformational.

Observability is your five senses for production.

You’re right on the money when you ask if it’s about exploring production, but you could also use words that are even more basic, like “understanding” or “inspecting”.  Observability is to software systems as a debugger is to software code.  It shines a light on the black box.  It allows you to move much faster, with more confidence, and catch bugs much sooner in the lifecycle — before users have even noticed.  It rewards you for writing code that is easy to illuminate and understand in production.

So why isn’t everyone already doing it?  Well, making the leap isn’t frictionless.  There’s a minimal amount of instrumentation to learn (easier than people expect, but it’s nonzero) and then you need to learn to see your code through the lens of your own instrumentation.  You might need to refactor your use of older tools, such as metrics libraries, monitoring checks and log lines.  You’ll need to learn another query interface and how it behaves on your systems.  You might find yourself amending your code review and deploy processes a bit.  

Nothing too terrible, but it’s all new.  We hate changing our tool kits until absolutely fucking necessary.  Back at Parse/Facebook, I actually clung to my sed/awk/shell wizardry until I was professionally shamed into learning new ways when others began debugging shit faster than I could.  (I was used to being the debugger of last resort, so this really pissed me off.)  So I super get it!  So let’s talk about how to get your team aligned and hungry for change.

Okay okay okay already, how do I get my team on board?

If we were on the phone right now, I would be peppering you with a bunch of questions about your organization.  Who owns production?  Who is on call?  Who runs the software that devs write?  What is your deploy process, and how often does it get updated, and by who?  Does it have an owner?  What are the personalities of your senior folks, who made the decisions to invest in the current tools (and what are they), what motivates them, who are your most persuasive internal voices?  Etc.  Every team is different.  <3

There’s a virtuous feedback loop you need to hook up and kickstart and tweak here, where the people with the original intent in their heads (software engineers) are also informed and motivated, i.e. empowered to make the changes and personally impacted when things are broken. I recommend starting by putting your software engineers on call for production (if you haven’t).  This has a way of convincing even the toughest cases that they have a strong personal interest in quality and understandability. 

Pay attention to your feedback loop and the alignment of incentives, and make sure your teams are given enough time to actually fix the broken things, and motivation usually isn’t a problem.  (If it is, then perhaps another feedback loop is lacking: your engineers feeling sufficiently aligned with your users and their pain.  But that’s another post.)

Technical ownership over technical outcomes

I appreciate that you want your team to own the technical decisions.  I believe very strongly that this is the right way to go.  But it doesn’t mean you can’t have influence or impact, and particularly in times like this. 

It is literally your job to have your head up, scanning the horizon for opportunities and relevant threats.  It’s their job to be heads down, focusing on creating and delivering excellent work.  So it is absolutely appropriate for you to flag something like observability as both an opportunity and a potential threat, if ignored.

If I were in your situation and wanted my team to check out some technical concept, I might send around a great talk or two and ask folks to watch it, and then maybe schedule a lunchtime discussion.  Or I might invite a tech luminary in to talk with the team, give a presentation and answer their questions.  Or schedule a hack week to apply the concept to a current top problem, or something else of that nature.

But if I really wanted them to take it fucking seriously, I would put my thumb on the scale.  I would find myself a champion, load them up with context, and give them ample time and space to skill up, prototype, and eventually present to the team a set of recommendations.  (And I would stay in close contact with them throughout that period, to make sure they didn’t veer too far off course or lose sight of my goals.)

  1. Get a champion.

    Ideally you want to turn the person who is most invested in the old way of doing things — the person who owns the ELK cluster, say, or who was responsible for selecting the previous monitoring toolkit, or the goto person for ops questions — from your greatest obstacle into your proxy warrior.  This only works if you know that person is open-minded and secure enough to give it a fair shot & publicly change course, has sufficiently good technical judgment to evaluate and project into the future, and has the necessary clout with their peers.  If they don’t, or if they’re too afraid to buck consensus: pick someone else.

  2. Give them context.  

    Take them for a long walk.  Pour your heart and soul out to them.  Tell them what you’ve learned, what you’ve heard, what you hope it can do for you, what you fear will happen if you don’t.  It’s okay to get personal and to admit your uncertainties.  The more context they have, the better the chance they will come out with an outcome you are happy with.  Get them worried about the same things that worry you, get them excited about the same possibilities that excite you.  Give them a sense of the stakes. 

    And don’t forget to tell them why you are picking them — because they are listened to by their peers, because they are already expert in the problem area, because you trust their technical judgment and their ability to evaluate new things — all the reasons for picking them will translate well into the best kind of flattery — the true kind.  

  3. Give them a deadline.

    A week or two should be plenty.  Most likely, the decision is not going to be unilaterally theirs (this also gives you a bit of wiggle room should they come back going “ah no ELK is great forever and ever”), but their recommendations should carry serious weight with the team and technical leadership.  Make it clear what sort of outcome you would be very pleased with (e.g. a trial period for a new service) and what reasons you would find compelling for declining to pursue the project (i.e. your tech is unsupported, cost prohibitive, etc).  Ideally they should use this time to get real production data into the services they are testing out, so they can actually experience and weigh the benefits, not just read the marketing copy.

As a rule of thumb, I always assume that managers can’t convince engineers to do things: only other engineers can.  But what you can do instead is set up an engineer to be your champion.  And then just sit quietly in the corner, nodding, with an interested look on your face.

The nuclear option


if you <3 prod,
prod will <3 you back

You have one final option.  If there is no appropriate champion to be found, or insufficient time, or if you have sufficient trust with the team that you judge it the right thing to do: you can simply order them to do something your way.  This can feel squicky. It’s not a good habit to get into.  It usually results in things being done a bit slower, more reluctantly, more half-assedly. And you sacrifice some of your power every time you lean on your authority to get your team to do something.

But it’s just as bad for a leader to take it off the table entirely.

Sometimes you will see things they can’t.  If you cannot wield your power when circumstances call for it, then you don’t fucking have real power — you have unilaterally disarmed yourself, to the detriment of your org.  You can get away with this maybe twice a year, tops. 

But here’s the thing: if you order something to be done, and it turns out in the end that you were right?  You earn back all the power you expended on it plus interest.  If you were right, unquestionably right in the eyes of the team, they will respect you more for having laid down the law and made sure they did the right thing.

xo

charity

Some useful resources:

 

Questionable Advice #2: How Do I Get My Team Into Observability?

Love (and Alerting) in the Time of Cholera (and Observability)

I made a vow this year to post one blog post a month, then I didn’t post anything at all from May to September.  I have some catching up to do.  😑   I’ve also been meaning to transcribe some of the twitter rants that I end up linking back to into blog posts, so if Graph Everything, Kittensthere’s anything you especially want me to write about, tell me now while I’m in repentance mode.

This is one request I happened to make a note of because I can’t believe I haven’t already written it up!  I’ve been saying the same thing over and over in talks and on twitter for years, but apparently never a blog post.

The question is: what is the proper role of alerting in the modern era of distributed systems?  Has it changed?  What are the updated best practices for alerting?

It’s a great question.  I want to wax philosophically about some stuff, but first let me briefly outline the way to modernize your alerting best practices:

  1. implement observability
  2. implement SLOs and/or end-to-end checks that traverse key code paths and correlate to user-impacting events
  3. create a secondary channel (tasks, ticketing system, whatever) for “things that on call should look at soon, but are not impacting users yet” which does not page anyone, but which on call is expected to look at (at least) first thing in the morning, last thing in the evening, and midday
  4. move as many paging alerts as possible to the secondary channel, by engineering your services to auto-remediate or run in degraded mode until they can be patched up
  5. wake people up only for SLOs and health checks that correlate to user-impacting events

Or, in an even shorter formulation: delete all your paging alerts, then page only on e2e alerts that mean users are in pain.  Rely on debugging tools for debugging, and paging only when users are in pain.

To understand why I advocate deleting all your paging alerts, and when it’s safe to delete them, first we need to understand why have we accumulated so many crappy paging alerts over the years.

Monoliths, LAMP stacks, and death by pagebomb

Here, let’s crib a couple of slides from one of my talks on observability.  Here are the characteristics of older monolithic LAMP-stack style systems, and best practices for running them:

 

The sad truth is, that when all you have is time series aggregates and traditional monitoring dashboards, you aren’t really debugging with science so much as you are relying on your gut and a handful of dashboards, using intuition and scraps of data to try and reconstruct an impossibly complex system state.

This works ok, as long as you have a relatively limited set of failure scenarios that happen over and over again.  You can just pattern match from past failures to current data, and most of the time your intuition can bridge the gap correctly.  Every time there’s Graph Everything Unicorn 2x2an outage, you post mortem the incident, figure out what happened, build a dashboard “to help us find the problem immediately next time”, create a detailed runbook for how to respond to it, and (often) configure a paging alert to detect that scenario.

Over time you build up a rich library of these responses.  So most of the time when you get paged you get a cluster of pages that actually serves to help you debug what’s happening.  For example, at Parse, if the error graph had a particular shape I immediately knew it was a redis outage.  Or, if I got paged about a high % of app servers all timing out in a short period of time, I could be almost certain the problem was due to mysql connections.  And so forth.

Things fall apart; the pagebomb cannot stand

However, this model falls apart fast with distributed systems.  There are just too many failures.  Failure is constant, continuous, eternal.  Failure stops being interesting.  It has to stop being interesting, or you will die.

 

 

 

Instead of a limited set of recurring error conditions, you have an infinitely long list of things that almost never happen …. except that one time they do.  If you invest your time into runbooks and monitoring checks, it’s wasted time if that edge case never happens again.

Frankly, any time you get paged about a distributed system, it should be a genuinely new failure that requires your full creative attention.  You shouldn’t just be checking your phone, going “oh THAT again”, and flipping through a runbook.  Every time you get paged it should be genuinely new and interesting.

And thus you should actually have drastically fewer paging alerts than you used to.

A better way: observability and SLOs.

Instead of paging alerts for every specific failure scenario, the technically correct answer is to define your SLOs (service level objectives) and page only on those, i.e. when you are going to run out of budget ahead of schedule.  But most people aren’t yet operating at this level of sophistication.  (SLOs sound easy, but are unbelievably challenging to do well; many great teams have tried and failed.  This is why we have built an SLO feature into Honeycomb that does the heavy lifting for you.  Currently alpha testing with users.)

If you haven’t yet caught the SLO religion, the alternate answer is that “you should only page on high level end-to-end alerts, the ones which traverse the code paths that make you money and correspond to user pain”.  Alert on the three golden signals: request rate, latency, and errors, and make sure to traverse every shard and/or storage type in your critical path.

That’s it.  Don’t alert on the state of individual storage instances, or replication, or anything that isn’t user-visible.

(To be clear: by “alert” I mean “paging humans at any time of day or night”.  You might reasonably choose to page people during normal work hours, but during sleepy hours most errors should be routed to a non-paging address.  Only wake people up for actual user-visible problems.)

Here’s the thing.  The reason we had all those paging alerts was because we depended on them to understand our systems.

Once you make the shift to observability, once you have rich instrumentation and the ability to swiftly zoom in from high level “there might be a problem” to identifying specifically what the errors have in common, or the source of the problem — you no longer need to lean on that scattershot bunch of pagebombs to understand your systems.  You should be able to confidently ask any question of your systems, understand any system state — even if you have never encountered it before.

With observability, you debug by systematically following the trail of crumbs back to their source, whatever that is.  Those paging alerts were a crutch, and now you don’t need them anymore.

Everyone is on call && on call doesn’t suck.

I often talk about how modern systems require software ownership.  The person who is writing the software, who has the original intent in their head, needs to shepherd that code out into production and watch real users use it.  You can’t chop that up into multiple roles, dev and ops.  You just can’t.  Software engineers working on highly available systems need to be on call for their code.Graph Unicorn 4_x4_

But the flip side of this responsibility belongs to management.  If you’re asking everyone to be on call, it is your sworn duty to make sure that on call does not suck.  People shouldn’t have to plan their lives around being on call.  People shouldn’t have to expect to be woken up on a regular basis.  Every paging alert out of hours should be as serious as a heart attack, and this means allocating real engineering resources to keeping tech debt down and noise levels low.

And the way you get there is first invest in observability, then delete all your paging alerts and start over from scratch.

It works.  It really does. 🌈

 

 

Love (and Alerting) in the Time of Cholera (and Observability)

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

This is part three of a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it.  🙏❤️  charity + friends


“Now Roll It Out And Keep Them Happy”

This is the third in a series of blog posts; previously we analyzed the security challenges of using a third party service, and we worked together with the security team to build empathy to deliver the project.  You might want to read those first, since we are going to build on a lot of the ideas there to ship and maintain this integration.

Ready for launch

You’ve convinced the security team and other stakeholders, you’ve gotten the integration running, you’re getting promising results from dev-test or staging environments… now it’s time to move from proof-of-concept to full implementation.  Depending on your situation this might be a transition from staging to production, or it might mean increasing a feature flipper flag from 5% to 100%, or it might mean increasing coverage of an integration from one API endpoint to cover your entire developer footprint.

Taking into account Murphy’s Law, we expect that some things will go wrong during the rollout.  Perhaps during coverage, a developer realizes that the schema designed to handle the app’s event mechanism can’t represent a scenario, requiring a redesign or a hacky solution.  Or perhaps the metrics dashboard shows elevated error rates from the API frontend, and while there’s no smoking gun, the ops oncall decides to rollback the integration Just In Case it’s causing the incident.

This gives us another chance to practice empathy — while it’s easy, wearing the champion hat, to dismiss any issues found by looking for someone to blame, ultimately this poisons trust within your organization and will hamper success.  It’s more effective, in the long run (and often even in the short run), to find common ground with your peers in other disciplines and teams, and work through to solutions that satisfy everybody.

Keeping the lights on

In all likelihood as integration succeeds, the team will rapidly develop experts and expertise, as well as idiomatic ways to use the product.  Let the experts surprise you; folks you might not expect can step up when given a chance.  Expertise flourishes when given guidance and goals; as the team becomes comfortable with the integration, explicitly recognize a leader or point person for each vendor relationship.  Having one person explicitly responsible for a relationship lets them pay attention to those vendor emails, updates, and avoid the tragedy of the “but I thought *you* were” commons.  This Integration Lead is also a center of knowledge transfer for your organization — they won’t know everything or help every user come up to speed, but they can help empower the local power users in each team to ramp up their teams on the integration.

As comfort grows you will start to consider ways to change your usage, for example growing into new kinds of data.  This is a good time to revisit that security checklist — does the change increase PII exposure to your vendor?  Would the new data lead to additional requirements such as per-field encryption?  Don’t let these security concerns block you from gaining valuable insight using the new tool, but do take the chance to talk it over with your security experts as appropriate.

Throughout this organic growth, the Integration Lead remains core to managing your changing profile of usage of the vendor they shepherd; as new categories of data are added to the integration, the Lead has responsibility to ensure that the vendor relationship and risk profile are well matched to the needs that the new usage (and presumably, business value) is placing on the relationship.

Documenting the Intergation Lead role and responsibilities is critical. The team should know when to check in, and writing it down helps it happen.  When new code has a security implication, or a new use case potentially amplifies the cost of an integration, bringing the domain expert in will avoid unhappy surprises.  Knowing how to find out who to bring in, and when to bring them in, will keep your team getting the right eyes on their changes.

Security threats and other challenges change over time, too.  Collaborating with your security team so that they know what systems are in use helps your team take note of new information that is relevant to your business. A simple example is noting when your vendors publish a breach announcement, but more complex examples happen too — your vendor transitions cloud providers from AWS to Azure and the security team gets an alert about unexpected data flows from your production cluster; with transparency and trust such events become part of a routine process rather than an emergency.

It’s all operational

Monitoring and alerting is a fact of operations life, and this has to include vendor integrations (even when the vendor integration is a monitoring product.)  All of your operations best practices are needed here — keep your alerts clean and actionable so that you don’t develop pager fatigue, and monitor performance of the integration so that you don’t get blindsided by a creeping latency monster in your APIs.

Authentication and authorization are changing as the threat landscape evolves and industry moves from SMS verification codes to U2F/WebAuthn.  Does your vendor support your SSO integration?  If they can’t support the same SSO that you use everywhere else and can’t add it — or worse, look confused when you mention SSO — that’s probably a sign you should consider a different vendor.

A beautiful sunset

Have a plan beforehand for what needs to be done should you stop using the service.  Got any mobile apps that depend on APIs that will go away or start returning permission errors?  Be sure to test these scenarios ahead of time.

What happens at contract termination to data stored on the service?  Do you need to explicitly delete data when ceasing use?

Do you need to remove integrations from your systems before ending the commercial relationship, or can the technical shutdown and business shutdown run in parallel?

In all likelihood these are contingency plans that will never be needed, and they don’t need to be fully fleshed out to start, but a little bit of forethought can avoid unpleasant surprises.

Year after year

Industry best practice and common sense dictate that you should revisit the security questionnaire annually (if not more frequently). Use this chance to take stock of the last year and check in — are you getting value from the service?  What has changed in your business needs and the competitive landscape? 

It’s entirely possible that a new year brings new challenges, which could make your current vendor even more valuable (time to negotiate a better contract rate!) or could mean you’d do better with a competing service.  Has the vendor gone through any major changes?  They might have new offerings that suit your needs well, or they may have pivoted away from the features you need. 

Check in with your friends on the security team as well; standards evolve, and last year’s sufficient solution might not be good enough for new requirements.

 

Andy thinks out loud about security, society, and the problems with computers on Twitter.


 

❤️ Thanks so much reading, folks.  Please feel free to drop any complaints, comments, or additional tips to us in the comments, or direct them to me on twitter.

Have fun!  Stay (a little bit) Paranoid!!

— charity

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

Outsource Your O11y: Get Aligned With Security (part 2/3)

This is part two of a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it.  🙏❤️  charity + friends


“Get Aligned With Security”

by Lilly Ryan

If your team has decided on a third-party service to help you gather data and debug product issues, how do you convince an often overeager internal security team to help you adopt it?

When this service is something that provides a pathway for developers to access production data, as analytics tools often do, making the case for access to that data can screech to a halt at the mention of the word “production”. Progressing past that point will take time, empathy, and consideration.

I have been on both sides of the “adopting a new service” fence: as a developer hoping to introduce something new and useful to our stack, and now as a security professional who spends her days trying to bust holes in other people’s setups. I understand both sides of the sometimes-conflicting needs to both ship software and to keep systems safe.  

This guide has advice to help you solve the immediate problem of choosing and deploying a third-party service with the approval of your security team.  But it also has advice for how to strengthen the working relationship between your security and development teams over the longer term. No two companies are the same, so please adapt these ideas to fit your circumstances.

Understanding the security mindset

The biggest problems in technology are never really about technology, but about people. Seeing your security team as people and understanding where they are coming from will help you to establish empathy with them so that both of you want to help each other get what you want, not block each other.

First, understand where your security team is coming from. Development teams need to build features, improve the product, understand and ship good code. Security teams need to make sure you don’t end up on the cover of the NYT for data breaches, that your business isn’t halted by ransomware, and that you’re not building your product on a vulnerable stack.

This can be an unfamiliar frame of mind for developers.  Software development tends to attract positive-minded people who love creating things and are excited about the possibilities of new technology. Software security tends to attract negative thinkers who are skilled at finding all the flaws in a system.  These are very different mentalities, and the people who occupy them tend to have very different assumptions, vocabularies, and worldviews.   

But if you and your security team can’t share the same worldview, it will be hard to trust each other and come to agreement.  This is where practicing empathy can be helpful.

Before approaching your security team with your request to approve a new vendor, you may want to run some practice exercises for putting yourselves in their shoes and forcing yourselves to deliberately cultivate a negative thinking mindset to experience how they may react — not just in terms of the objective risk to the business, or the compliance headaches it might cause, but also what arguments might resonate with them and what emotional reactions they might have.

My favourite exercise for getting teams to think negatively is what I call the Land Astronaut approach.

The “Land Astronaut” Game

Imagine you are an astronaut on the International Space Station. Literally everything you do in space has death as a highly possible outcome. So astronauts spend a lot of time analysing, re-enacting, and optimizing their reactions to events, until it becomes muscle memory. By expecting and training for failure, astronauts use negative thinking to anticipate and mitigate flaws before they happen. It makes their chances of survival greater and their people ready for any crisis.

Your project may not be as high-stakes as a space mission, and your feet will most likely remain on the ground for the duration of your work, but you can bet your security team is regularly indulging in worst-case astronaut-type thinking. You and your team should try it, too.

The Game:

Pick a service for you and your team to game out.  Schedule an hour, book a room with a whiteboard, put on your Land Astronaut helmets.  Then tell your team to spend half an hour brainstorming about all the terrible things that can happen to that service, or to the rest of your stack when that service is introduced.  Negative thoughts only!

Start brainstorming together. Start out by being as outlandish as possible (what happens if their data centre is suddenly overrun by a stampede of elephants?). Eventually you will find that you’ll tire of the extreme worst case scenarios and come to consider more realistic outcomes — some of which which you may not have thought of outside of the structure of the activity.

After half an hour, or whenever you feel like you’re all done brainstorming, take off your Land Astronaut helmets, sift out the most plausible of the worst case scenarios, and try to come up with answers or strategies that will help you counteract them.  Which risks are plausible enough that you should mitigate them?  Which are you prepared to gamble on never happening?  How will this risk calculus change as your company grows and takes on more exposure?

Doing this with your team will allow you all to practice the negative thinking mindset together and get a feel for how your colleagues in the security team might approach this request. (While this may seem similar to threat modelling exercises you might have done in the past, the focus here is on learning to adopt a security mindset and gaining empathy for this thought process, rather than running through a technical checklist of common areas of concern.)

While you still have your helmets within reach, use your negative thinking mindset to fill out the spreadsheet from the first piece in this series.  This will help you anticipate most of the reasonable objections security might raise, and may help you include useful detail the security team might not have known to ask for.

Once you have prepared your list of answers to George’s worksheet and held a team Land Astronaut session together, you will have come most of the way to getting on board with the way your security team thinks.

Preparing for compromise

You’ve considered your options carefully, you’ve learned how to harness negative thinking to your advantage, and you’re ready to talk to your colleagues in security – but sometimes, even with all of these tools at your disposal, you may not walk away with all of the things you are hoping for.

Being willing to compromise and anticipating some of those compromises before you approach the security team will help you negotiate more successfully.

While your Land Astronaut helmets are still within reach, consider using your negative thinking mindset game to identify areas where you may be asked to compromise. If you’re asking for production access to this new service for observability and debugging purposes, think about what kinds of objections may be raised about this and how you might counter them or accommodate them. Consider continuing the activity with half of the team remaining in the Land Astronaut role while the other half advocates from a positive thinking standpoint. This dynamic will get you having conversations about compromise early on, so that when the security team inevitably raises eyebrows, you are ready with answers.

Be prepared to consider compromises you had not anticipated, and enter into discussions with the security team with as open a mind as possible. Remember the team is balancing priorities of not only your team, but other business and development teams as well.  If you and your security colleagues are doing the hard work to meet each other halfway then you are more likely to arrive at a solution that satisfies both parties.

Working together for the long term

While the previous strategies we’ve covered focus on short-term outcomes, in this continuous-deployment, shift-left world we now live in, the best way to convince your security team of the benefits of a third-party service – or any other decision – is to have them along from day one, as part of the team.

Roles and teams are increasingly fluid and boundary-crossing, yet security remains one of the roles least likely to be considered for inclusion on a software development team. Even in 2019, the task of ensuring that your product and stack are secure and well-defended is often left until the end of the development cycle.  This contributes a great deal to the combative atmosphere that is common.

Bringing security people into the development process much earlier builds rapport and prevents these adversarial, territorial dynamics. Consider working together to build Disaster Recovery plans and coordinating for shared production ownership.

If your organisation isn’t ready for that kind of structural shift, there are other ways to work together more closely with your security colleagues.

Try having members of your team spend a week or two embedded with the security team. You may even consider a rolling exchange – a developer for a security team member – so that developers build the security mindset, and the security team is able to understand the problems your team is facing (and why you are looking at introducing this new service).

At the very least, you should make regular time to meet with the security team, get to know them as people, and avoid springing things on them late in the project when change is hardest.

Riding off together into the sunset…?

If you’ve taken the time to get to know your security team and how they think, you’ll hopefully be able to get what you want from them – or perhaps you’ll understand why their objections were valid, and come up with a better solution that works well for both of you.

Investing in a strong relationship between your development and security teams will rarely lead to the apocalypse. Instead, you’ll end up with a better product, probably some new work friends, and maybe an exciting idea for a boundary-crossing new career in tech.

But this story isn’t over! Once you get the green light from security, you’ll need to think about how to roll your new service out safely, maintain it, and consider its full lifespan within your company.  Which leads us to part three of this series, on rolling it out and maintaining it … both your integration and your relationship with the security team.

 

Lilly Ryan is a pen tester, Python wrangler, and recovering historian from Melbourne. She writes and speaks internationally about ethical software, social identities after death, teamwork, and the telegraph. More recently she has researched the domestic use of arsenic in Victorian England, attempted urban camouflage, reverse engineered APIs, wielded the Oxford comma, and baked a really good lemon shortbread.

Outsource Your O11y: Get Aligned With Security (part 2/3)

Logs vs Structured Events

I got an interesting tweet the other day from @evntdrvn in response to this thread of mine. Paraphrasing,

“So I’ve almost got our group at work up to Step 1 in your observability maturity model, but some of the devs that I work with want to turn OFF our lovely structured logging in prod for informational-level msgs due to their legacy philosophy (‘we only log errors in prod’). The reasons given are mostly philosophical (“I’m a dev and only interested when things error out, I don’t want any other noise in prod logs”, “I don’t want to slow my app down in prod”). Help?!?”

As I was reading this, I was itching to fly out and dive into battle with Eric. I know exactly where his opinionated devs are coming from. I used to say the same things! I even wrote a whole blog post about it.

These developers have internalized a set of rules and best practices for dealing with output data, in the context of “monolith application development in the early 2000s”.

Monolithic systems assumptions

Those systems had many common constraints and assumptions, such as:

  • We have a monolith service, or a very small number of services. We can model the system in our heads.
  • Logging is done to local disk, which can impact performance
  • Disks are expensive
  • Log lines are spat out inline with execution.  A poorly placed printf can take the whole system down.
  • Investigation is rare, and usually means a human reading error logs.
  • Logging is of poor utility for understanding internal states or execution paths; you should just read the code or use a debugger.  (There are few or network hops between functions.)
  • Logging is mostly useful for detecting certain terminal crash states or connection errors.

Monolithic logging best practices

Therefore:

  • We should be very stingy in what we log
  • Debuggers should be used for understanding internal states of the code
  • Logs are a last resort and record of crash dumps.  We do not expect to use log data in the course of our daily work.  We assume log-related manual investigation will be infrequent and of limited utility.

These were exactly the right lessons to learn in the era of expensive hardware and monolithic repos/artifacts. Many people still work in environments like this, and follow logging best practices like these. God bless, more power to em.

Distributed systems assumptions

But more and more of us face systems that are very different.

  • We have many services, possibly many MANY services. A representative request will have “many” hops across “many” services and routers and proxies and meshes and storage systems.
  • We cannot model the system in our heads; it would be a mistake to try. We rely on tooling as the source of truth for those systems.
  • You may or may not have access to those services, or the systems your code runs on. There may or may not be a logging facility, or a centralized log aggregator. Your only view of the system is through the instrumentation of your code.
  • Disks and system resources are cheap, ephemeral, all but disposable.
  • Data services are similarly cheap.  We can almost entirely silo application performance off from the cost of writing perf data out.
  • Investigation is prohibitively slow and expensive for a human to do by hand. Many of the nodes or processes we need to inspect may no longer even exist, but their past states may still be relevant to us in understanding patterns to the present time.
  • Investigation should usually be done distributedly, across all instantiations of your code, however many there might be — and in real time
  • Investigation requires computation — not just string search. We need to ask on the fly involving math and percentiles and breakdowns and group by’s.  And we need access to the raw requests in order to run accurate computations — no pre-aggregates.
  • The hardest part isn’t usually debugging the code, it’s figuring out where is the code you need to debug. Or what the errors or outliers have in common from the perspective of the code.  Fixing the code itself is often comparatively trivial, once found.
  • What even is ‘logging’?
  • What even is ‘local disk’?

This isn’t optional: at some point of complexity or scale or distributedness, it becomes necessary if you want to work with these systems.

Logs can’t help you here.

And you aren’t going to get that kind of explorable data out of loglevel:ERROR, or by chopping up your telemetry into disconnected metrics devoid of context.

You are only going to get this kind of explorable, ad hoc, computation-friendly data if you take a radically new approach to how you output and aggregate telemetry.  You’re going to need to replace your log lines and log levels with a different sort of beast: arbitrarily wide structured events that describe the request and its context, one event per request per service.

If it helps, don’t think of them as log files any more. Think of them as events. Yes, you can stash this stream in a file, but why would you?  on what disk?  will that work for your serverless functions too?  Just stream them over the network to wherever you want to put them.

Log levels are another confusing and unnecessary artifact of yesteryear that you no longer really need. The more you think of structured events as logs, the more tempted you may be to apply the old set of best practices. So just don’t think of them as logs at all.

How to gather and structure your data

Instead of dribbling little pebbles of log effluvia throughout your code, do this.  (If you’re a honeycomb user, our beelines do it all automatically for you *and* pre-propagate the blobs with everything we know of your context.)

  1. Initialize an empty blob at the beginning, when the request first enters the service.
  2. Stuff any and all interesting detail about the request into that blob throughout the lifetime of the request.
    • Any unique id, any high-cardinality variable, any headers passed in, every full query, normalized query, and query execution time; every http call out to a remote service, every http execution time; any shopping cart id, first and last name, execution time — literally anything interesting, append to blob.
  3. Then, when the request is about to exit or error, write the blob off to honeycomb or another service or disk somewhere.

You can see immediately how this method has radically different performance implications and risks than the earlier shotgun spray approach. No more “oops i accidentally put a print line INSIDE a for loop”. The write amplification profile is compressed. Most importantly, the incremental cost of capturing more detail about the request per service is nearly zero.

And now you have the kind of structured data that you can feed into something like a columnar store, or honeycomb, and run ad hoc queries to your heart’s delight.

Distributed systems logging events best practices:

Let’s sum up.  (I’m including links to other past rants on this topic):

Just think.

No more doing multi-line regexps trying to look for the same request ID or user ID doing five suspicious things in a row.

No more regexps at all, for fuck’s sake.

No more bullshit percentiles that were computed at write time by averaging over a bunch of other averages

No more having to jump around from dashboards to logs trying to vainly eyeball correlate one spike with another. No more wondering why no two tools can agree if anything even exists or not

Just gather the detail you need to ask the questions when you need them, and store it in a single source of truth.  It’s that simple.

No need to shame people from learning best practices that worked perfectly well for a long time.  You can either let them learn the hard way that this transformation is non optional, or you can help them learn the easy way that it’s simply much better and easier to invest in this telemetry up front.  You seem like a nice enough chap, which is probably why you chose door 2.  (If you wanted to get tougher about it, have a few reformed folks in to tell their horror stories.  Try some ex-twitter engineers.)

The hardest part seems to be getting people to unlearn all the best practices they once learned for dealing with logs.  So just don’t call it logs anymore, if that helps. Call it “structured events”.

– charity.

img_4817

Logs vs Structured Events