Notes on the Perfidy of Dashboards

The other day I said this on twitter —

… which stirred up some Feelings for many people. 🙃  So I would like to explain my opinions in more detail.

Static vs dynamic dashboards

First, let’s define the term. When I say “dashboard”, I mean STATIC dashboards, i.e. collections of metrics-based graphs that you cannot click on to dive deeper or break down or pivot. If your dashboard supports this sort of responsive querying and exploration, where you can click on any graph to drill down and slice and dice the data arbitrarily, then breathe easy — that’s not what I’m talking about. Those are great. (I don’t really consider them dashboards, but I have heard a few people refer to them as “dynamic dashboards”.)

Actually, I’m not even “against” static dashboards. Every company has them, including Honeycomb. They’re great for getting a high level sense of system functioning, and tracking important stats over long intervals. They are a good starting point for investigations. Every company should have a small, tractable number of these which are easily accessible and shared by everyone.

Debugging with dashboards: it’s a trap

What dashboards are NOT good at is debugging, or understanding or describing novel system states.

I can hear some of you now: “But I’ve debugged countless super-hard unknown problems using only static dashboards!” Yes, I’m sure you have. If all you have is a hammer, you CAN use it to drive screws into the wall, but that doesn’t mean it’s the best tool. And It takes an extraordinary amount of knowledge and experience to be able to piece together a narrative that translates low-level system statistics into bugs in your software and back. Most software engineers don’t have that kind of systems experience or intuition…and they shouldn’t have to.

Why are dashboards bad for debugging? Think of it this way: every dashboard is an answer to a question someone asked at some point. Your monitoring system is probably littered with dashboards, thousands and thousands of them, most of whose questions have been long forgotten and many of whose source data streams have long since gone silent.

So you come along trying to investigate something, and what do you do? You start skimming through dashboards, eyes scanning furiously, looking for visual patterns — e.g. any spikes that happened around the same time as your incident. That’s not debugging, that’s pattern-matching. That’s … eyeball racing.

if we did math like we do dashboards

Imagine you’re in a math competition, and you get handed a problem to solve. But instead of pulling out your pencil and solving the equation, step by step, you start hollering out guesses.


That’s what flipping through dashboards feels like to me. You’re riffling through a bunch of graphs that were relevant to some long-ago situation, without context or history, without showing their work. Sometimes you’ll spot the exact scenario, and — huzzah! — the number you shout is correct! But when it comes to unknown scenarios, the odds are not in your favor.

Debugging looks and feels very different from flipping through answers. You ask a question, examine the answer, and ask another question based on the result. (“Which endpoints were erroring? Are all of the requests erroring, or only some? What did they have in common?”, etc.)

You methodically put one foot in front of the other, following the trail of bread crumbs, until the data itself leads you to the answer.

The limitations of metrics and dashboards

Unfortunately, you cannot do that with metrics-based dashboards, because you stripped away the connective tissue of the event back when you wrote the metrics out to disk.

If you happened to notice while skimming through dashboards that your 404 errors spiked at 14:03, and your /payment and /import endpoints started erroring at 14.03, and your database started returning a bunch of mysql errors shortly after 14:00, you’ll probably assume that they’re all related and leap to find more evidence that confirms it.

But you cannot actually confirm that those events are the same ones, not with your metrics dashboards. You cannot drill down from errors to endpoints to error strings; for that, you’d need a wide structured data blob per request. Those might in fact be two or three separate outages or anomalies happening at the same time, or just the tip of the iceberg of a much larger event, and your hasty assumptions might extend the outage for much longer than was necessary.

With metrics, you tend to find what you’re looking for. You have no way to correlate attributes between requests or ask “what are all of the dimensions these requests have in common?”, or to flip back and forth and look at the request as a trace. Dashboards can be fairly effective at surfacing the causes of problems you’ve seen before (raise your hand if you’ve ever been in an incident review where one of the follow up tasks was, “create a dashboard that will help us find this next time”), but they’re all but useless for novel problems, your unknown-unknowns.

Other complaints about dashboards:

They tend to have percentiles like 95th, 99th, 99.9th, 99.99th, etc. Which can cover over a multitude of sins. You really want a tool that allows you to see MAX and MIN, and heatmap distributions.

A lot of dashboards end up getting created that are overly specific to the incident you just had — naming specific hosts, etc — which just creates clutter and toil. This is how your dashboards become that graveyard of past outages.

The most useful approach to dashboards is to maintain a small set of them; cull regularly, and think of them as a list of starter queries for your investigations.

Fred Hebert has this analogy, which I like:

“I like to compare the dashboards to the big display in a hospital room: heartbeat, pressure, oxygenation, etc. Those can tell you when a thing is wrong, but the context around the patient chart (and the patient themselves) is what allows interpretation to be effective. If all we have is the display but none of the rest, we’re not getting anywhere close to an accurate picture. The risk with the dashboard is having the metrics but not seeing or knowing about the rest changing.”

In conclusion

Dashboards aren’t universally awful. The overuse of them just encourages sloppy thinking, and static ones make it impossible for you to follow the plot of an outage, or validate your hypotheses. 🤒  There’s too many of them, and not enough shared consensus. (It would help if, like, new dashboards expired within a month if nobody looked at them again.)

If what you have is “nothing”, even shitty dashboards are far better than no dashboards. But shitty dashboards have been the only game in town for far too long. We need more vendors to think about building for queryability, explorability, and the ability to follow a trail of breadcrumbs. Modern systems are going to demand more and more of this approach.

Nothing < Dashboards < a Queryable, Exploratory Interface

If everyone out there who slaps “observability” on their web page also felt the responsibility to add an observability-enabling interface to their tool, one that would let users explore and identify unknown-unknowns, we would all be in a far better place. 🙂






Notes on the Perfidy of Dashboards

On Call Shouldn’t Suck: A Guide For Managers

There are few engineering topics that provoke as much heated commentary as oncall. Everybody has a strong opinion. So let me say straight up that there are few if any absolutes when it comes to doing this well; context is everything. What’s appropriate for a startup may not suit a larger team. Rules are made to be broken.

That said, I do have some feelings on the matter. Especially when it comes to the compact between engineering and management. Which is simply this:

It is engineering’s responsibility to be on call and own their code. It is management’s responsibility to make sure that on call does not suck. This is a handshake, it goes both ways, and if you do not hold up your end they should quit and leave you.

As for engineers who write code for 24×7 highly available services, it is a core part of their job is to support those services in production. (There are plenty of software jobs that do not involve building highly available services, for those who are offended by this.) Tossing it off to ops after tests pass is nothing but a thinly veiled form of engineering classism, and you can’t build high-performing systems by breaking up your feedback loops this way.

Someone needs to be responsible for your services in the off-hours. This cannot be an afterthought; it should play a prominent role in your hiring, team structure, and compensation decisions from the very start. These are decisions that define who you are and what you value as a team.

Some advice on how to organize your on call efforts, in no particular order.

  • It is easier to keep yourself from falling into an operational pit of doom than it is to claw your way out of one. Make good operational hygiene a priority from the start. Value good, clean, high-level abstractions that allow you to delegate large swaths of your infrastructure and operational burden to third parties who can do it better than you — serverless, AWS, *aaS, etc. Don’t fall into the trap of disrespecting operations engineering labor, it’s the only thing that can save you.
  • Invest in good release and deploy tooling. Make this part of your engineering roadmap, not something you find in the couch cushions. Get code into production within minutes after merging, and watch how many of your nightmares melt away or never happen.
  • Invest in good instrumentation and observability. Impress upon your engineers that their job is not done when tests pass; it is not done until they have watched users using their code in production. Promote an ownership mentality over the full software life cycle. This is how did it.
  • Construct your feedback loops thoughtfully. Try to alert the person who made the broken change directly. Never send an alert to someone who isn’t fully equipped and empowered to fix it.
  • When an engineer is on call, they are not responsible for normal project work — period. That time is sacred and devoted to fixing things, building tooling, and creating guard-rails to protect people from themselves. If nothing is on fire, the engineer can take the opportunity to fix whatever has been annoying them. Allow for plenty of agency and following one’s curiosity, wherever it may lead, and it will be a special treat.
  • Closely track how often your team gets alerted. Take ANY out-of-hours-alert seriously, and prioritize the work to fix it. Night time pages are heart attacks, not diabetes.
  • Consider joining the on call rotation yourself! If nothing else, generously pinch hit and be an eager and enthusiastic backup on the regular.
  • Reliability work and technical debt are not secondary to product work. Budget them into your roadmap, right alongside your features and fixes. Don’t plan so tightly that you have no flex for the unexpected. Don’t be afraid to push back on product and don’t neglect to sell it to your own bosses. People’s lives are in your hands; this is what you get paid to do.
  • Consider making after-hours on call fully-elective. Why not? What is keeping you from it? Fix those things. This is how Intercom did it.
  • Depending on your stage and available resources, consider compensating for it.This doesn’t have to be cash, it could be a Friday off the week after every on call rotation. The more established and funded a company you are, the more likely you should do this in order to surface the right incentives up the org chart.
  • Once you’ve dug yourself out of firefighting mode, invest in SLOs (Service Level Objectives). SLOs and observability are the mature way to get out of reactive mode and plan your engineering work based on tradeoffs and user impact.

I believe it is thoroughly possible to construct an on call rotation that is 100% opt-in, a badge of pride and accomplishment, something that brings meaning and mastery to people’s engineering roles and ties them emotionally to their users. I believe that being on call is something that you can genuinely look forward to.

But every single company is a unique complex sociotechnical snowflake. Flipping the script on whether on call is a burden or a blessing will require a unique solution, crafted to meet your specific needs and drawing on your specific history. It will require tinkering. It will take maintenance.

Above all: ✨RAISE YOUR STANDARDS✨ for what you expect from yourselves. Your greatest enemy is how easily you accept the status quo, and then make up excuses for why it is necessarily this way. You can do better. I know you can.

There is lots and lots of prior art out there when it comes to making on call work for you, and you should research it deeply. Watch some talks, read some pieces, talk to some people. But then you’ll have to strike out on your own and try something. Cargo-culting someone else’s solution is always the wrong answer.

Any asshole can write some code; owning and tending complex systems for the long run is the hard part. How you choose to shoulder this burden will be a deep reflection of your values and who you are as a team.

And if your on call experience is mandatory and severely life-impacting, and if you don’t take this dead seriously and fix it ASAP? I hope your team will leave you, and go find a place that truly values their time and sleep.


On Call Shouldn’t Suck: A Guide For Managers

Questionable Advice: War Rooms? Really?!?

My company has recently begun pushing for us to build and staff out what I can only describe as “command centers”. They’re picturing graphs, dashboards…people sitting around watching their monitors all day just to find out which apps or teams are having issues. With your experience in monitoring and observability, and your opinions on teams supporting their own applications…do you think this sounds like a bad idea? What are things to watch out for, or some ways this might all go sideways?

— Anonymous

Jesus motherfucking Christ on a stick. Is it 1995 where you work? That’s the only way I can try and read this plan like it makes sense.

It’s a giant waste of money and no, it won’t work. This path leads into a death spiral where alarms are going off constantly (yet somehow never actually catching the real problems), people getting burned out, and anyone competent will either a) leave or b) refuse to be on call. Sideways enough for you yet?

Snark aside, there are two foundational flaws with this plan.

1) watching graphs is pointless. You can automate that shit, remember?  ✨Computers!✨ Furthermore, this whole monitoring-based approach will only ever help you find the known unknowns, the problems you already know to look for. But most of your actual problems will be unknown unknowns, the ones you don’t know about yet.

2) those people watching the graphs… When something goes wrong, what exactly can they do about it? The answer, unfortunately, is “not much”. The only people who can swiftly diagnose and fix complex systems issues are the people who build and maintain those systems, and those people are busy building and maintaining, not watching graphs.

That extra human layer is worse than useless; it is actively harmful. By insulating developers from the consequences of their actions, you are concealing from them the information they need to understand the consequences of their actions. You are interfering with the most basic of feedback loops and causing it to malfunction.

The best time to find a bug is as soon as possible after writing it, while it’s all fresh in your head. If you let it fester for days, weeks, or months, it will be exponentially more challenging to find and solve. And the best people to find those bugs are the people who wrote them

Helpful? Hope so. Good luck. And if they implement this anyway — leave. You deserve to work for a company that won’t waste your fucking time.

with love, charity.

selfie - 4

Questionable Advice: War Rooms? Really?!?

Questionable Advice: “What’s the critical path?”

Dan Golant asked a great question today: “Any advice/reading on how to establish a team’s critical path?”

I repeated back: “establish a critical path?” and he clarified:

Yea, like, you talk about buttoning up your “critical path”, making sure it’s well-monitored etc. I think that the right first step to really improving Observability is establishing what business processes *must* happen, what our “critical paths” are. I’m trying to figure out whether there are particularly good questions to ask that can help us document what these paths are for my team/group in Eng.

“Critical path” is one of those phrases that I think I probably use a lot. Possibly because the very first real job I ever had was when I took a break from college and worked at (“we handle the world’s email”) — and by “work” I mean, “lived in SF for a year when I was 18 and went to a lot of raves and did a lot of drugs with people way cooler than me”. Then I went back to college, the dotcom boom crashed, and the CP CFO and CEO actually went to jail for cooking the books, becoming the only tech execs I am aware of who actually went to jail.

Where was I.

Right, critical path. What I said to Dan is this: “What makes you money?”

Like, if you could only deploy three end-to-end checks that would perform entire operations on your site and ensure they work at all times, what would they be? what would they do? “Submit a payment” is a super common one; another is new user signups.

The idea here is to draw up a list of the things that are absolutely worth waking someone up to fix immediately, night or day, rain or shine. That list should be as compact and well-defined as possible. This allows you to be explicit about the fact that anything else can wait til morning, or some other less-demanding service level agreement.

And typically the right place to start on this list is by asking yourselves: “what makes us money?” as a proxy for the real questions, which are: “what actions allow us to survive as a business? What do our customers care the absolute most about? What makes us us?” That’s your critical path.

Someone will usually seize this opportunity to argue that absolutely any deterioration in service is worth paging someone immediately to fix it, day or night. They are wrong, but it’s good to flush these assumptions out and have this argument kindly out in the open.

(Also, this is really a question about service level objectives. So if you’re asking yourself about the critical path, you should probably consider buying Alex Hidalgo’s book on SLOs, and you may want to look into the Honeycomb SLO product, the only one in the industry that actually implements SLOs as the Google SRE book defines them (thanks Liz!) and lets you jump straight from “what are our customers experiencing?” to “WHY are they experiencing it”, without bouncing awkwardly from aggregate metrics to logs and back and just … hoping … the spikes line up according to your visual approximations.)

Questionable Advice: “What’s the critical path?”

Love (and Alerting) in the Time of Cholera (and Observability)

I made a vow this year to post one blog post a month, then I didn’t post anything at all from May to September.  I have some catching up to do.  😑   I’ve also been meaning to transcribe some of the twitter rants that I end up linking back to into blog posts, so if Graph Everything, Kittensthere’s anything you especially want me to write about, tell me now while I’m in repentance mode.

This is one request I happened to make a note of because I can’t believe I haven’t already written it up!  I’ve been saying the same thing over and over in talks and on twitter for years, but apparently never a blog post.

The question is: what is the proper role of alerting in the modern era of distributed systems?  Has it changed?  What are the updated best practices for alerting?

It’s a great question.  I want to wax philosophically about some stuff, but first let me briefly outline the way to modernize your alerting best practices:

  1. implement observability
  2. implement SLOs and/or end-to-end checks that traverse key code paths and correlate to user-impacting events
  3. create a secondary channel (tasks, ticketing system, whatever) for “things that on call should look at soon, but are not impacting users yet” which does not page anyone, but which on call is expected to look at (at least) first thing in the morning, last thing in the evening, and midday
  4. move as many paging alerts as possible to the secondary channel, by engineering your services to auto-remediate or run in degraded mode until they can be patched up
  5. wake people up only for SLOs and health checks that correlate to user-impacting events

Or, in an even shorter formulation: delete all your paging alerts, then page only on e2e alerts that mean users are in pain.  Rely on debugging tools for debugging, and paging only when users are in pain.

To understand why I advocate deleting all your paging alerts, and when it’s safe to delete them, first we need to understand why have we accumulated so many crappy paging alerts over the years.

Monoliths, LAMP stacks, and death by pagebomb

Here, let’s crib a couple of slides from one of my talks on observability.  Here are the characteristics of older monolithic LAMP-stack style systems, and best practices for running them:


The sad truth is, that when all you have is time series aggregates and traditional monitoring dashboards, you aren’t really debugging with science so much as you are relying on your gut and a handful of dashboards, using intuition and scraps of data to try and reconstruct an impossibly complex system state.

This works ok, as long as you have a relatively limited set of failure scenarios that happen over and over again.  You can just pattern match from past failures to current data, and most of the time your intuition can bridge the gap correctly.  Every time there’s Graph Everything Unicorn 2x2an outage, you post mortem the incident, figure out what happened, build a dashboard “to help us find the problem immediately next time”, create a detailed runbook for how to respond to it, and (often) configure a paging alert to detect that scenario.

Over time you build up a rich library of these responses.  So most of the time when you get paged you get a cluster of pages that actually serves to help you debug what’s happening.  For example, at Parse, if the error graph had a particular shape I immediately knew it was a redis outage.  Or, if I got paged about a high % of app servers all timing out in a short period of time, I could be almost certain the problem was due to mysql connections.  And so forth.

Things fall apart; the pagebomb cannot stand

However, this model falls apart fast with distributed systems.  There are just too many failures.  Failure is constant, continuous, eternal.  Failure stops being interesting.  It has to stop being interesting, or you will die.




Instead of a limited set of recurring error conditions, you have an infinitely long list of things that almost never happen …. except that one time they do.  If you invest your time into runbooks and monitoring checks, it’s wasted time if that edge case never happens again.

Frankly, any time you get paged about a distributed system, it should be a genuinely new failure that requires your full creative attention.  You shouldn’t just be checking your phone, going “oh THAT again”, and flipping through a runbook.  Every time you get paged it should be genuinely new and interesting.

And thus you should actually have drastically fewer paging alerts than you used to.

A better way: observability and SLOs.

Instead of paging alerts for every specific failure scenario, the technically correct answer is to define your SLOs (service level objectives) and page only on those, i.e. when you are going to run out of budget ahead of schedule.  But most people aren’t yet operating at this level of sophistication.  (SLOs sound easy, but are unbelievably challenging to do well; many great teams have tried and failed.  This is why we have built an SLO feature into Honeycomb that does the heavy lifting for you.  Currently alpha testing with users.)

If you haven’t yet caught the SLO religion, the alternate answer is that “you should only page on high level end-to-end alerts, the ones which traverse the code paths that make you money and correspond to user pain”.  Alert on the three golden signals: request rate, latency, and errors, and make sure to traverse every shard and/or storage type in your critical path.

That’s it.  Don’t alert on the state of individual storage instances, or replication, or anything that isn’t user-visible.

(To be clear: by “alert” I mean “paging humans at any time of day or night”.  You might reasonably choose to page people during normal work hours, but during sleepy hours most errors should be routed to a non-paging address.  Only wake people up for actual user-visible problems.)

Here’s the thing.  The reason we had all those paging alerts was because we depended on them to understand our systems.

Once you make the shift to observability, once you have rich instrumentation and the ability to swiftly zoom in from high level “there might be a problem” to identifying specifically what the errors have in common, or the source of the problem — you no longer need to lean on that scattershot bunch of pagebombs to understand your systems.  You should be able to confidently ask any question of your systems, understand any system state — even if you have never encountered it before.

With observability, you debug by systematically following the trail of crumbs back to their source, whatever that is.  Those paging alerts were a crutch, and now you don’t need them anymore.

Everyone is on call && on call doesn’t suck.

I often talk about how modern systems require software ownership.  The person who is writing the software, who has the original intent in their head, needs to shepherd that code out into production and watch real users use it.  You can’t chop that up into multiple roles, dev and ops.  You just can’t.  Software engineers working on highly available systems need to be on call for their code.Graph Unicorn 4_x4_

But the flip side of this responsibility belongs to management.  If you’re asking everyone to be on call, it is your sworn duty to make sure that on call does not suck.  People shouldn’t have to plan their lives around being on call.  People shouldn’t have to expect to be woken up on a regular basis.  Every paging alert out of hours should be as serious as a heart attack, and this means allocating real engineering resources to keeping tech debt down and noise levels low.

And the way you get there is first invest in observability, then delete all your paging alerts and start over from scratch.

It works.  It really does. 🌈



Love (and Alerting) in the Time of Cholera (and Observability)

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

This is part three of a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it.  🙏❤️  charity + friends

“Now Roll It Out And Keep Them Happy”

This is the third in a series of blog posts; previously we analyzed the security challenges of using a third party service, and we worked together with the security team to build empathy to deliver the project.  You might want to read those first, since we are going to build on a lot of the ideas there to ship and maintain this integration.

Ready for launch

You’ve convinced the security team and other stakeholders, you’ve gotten the integration running, you’re getting promising results from dev-test or staging environments… now it’s time to move from proof-of-concept to full implementation.  Depending on your situation this might be a transition from staging to production, or it might mean increasing a feature flipper flag from 5% to 100%, or it might mean increasing coverage of an integration from one API endpoint to cover your entire developer footprint.

Taking into account Murphy’s Law, we expect that some things will go wrong during the rollout.  Perhaps during coverage, a developer realizes that the schema designed to handle the app’s event mechanism can’t represent a scenario, requiring a redesign or a hacky solution.  Or perhaps the metrics dashboard shows elevated error rates from the API frontend, and while there’s no smoking gun, the ops oncall decides to rollback the integration Just In Case it’s causing the incident.

This gives us another chance to practice empathy — while it’s easy, wearing the champion hat, to dismiss any issues found by looking for someone to blame, ultimately this poisons trust within your organization and will hamper success.  It’s more effective, in the long run (and often even in the short run), to find common ground with your peers in other disciplines and teams, and work through to solutions that satisfy everybody.

Keeping the lights on

In all likelihood as integration succeeds, the team will rapidly develop experts and expertise, as well as idiomatic ways to use the product.  Let the experts surprise you; folks you might not expect can step up when given a chance.  Expertise flourishes when given guidance and goals; as the team becomes comfortable with the integration, explicitly recognize a leader or point person for each vendor relationship.  Having one person explicitly responsible for a relationship lets them pay attention to those vendor emails, updates, and avoid the tragedy of the “but I thought *you* were” commons.  This Integration Lead is also a center of knowledge transfer for your organization — they won’t know everything or help every user come up to speed, but they can help empower the local power users in each team to ramp up their teams on the integration.

As comfort grows you will start to consider ways to change your usage, for example growing into new kinds of data.  This is a good time to revisit that security checklist — does the change increase PII exposure to your vendor?  Would the new data lead to additional requirements such as per-field encryption?  Don’t let these security concerns block you from gaining valuable insight using the new tool, but do take the chance to talk it over with your security experts as appropriate.

Throughout this organic growth, the Integration Lead remains core to managing your changing profile of usage of the vendor they shepherd; as new categories of data are added to the integration, the Lead has responsibility to ensure that the vendor relationship and risk profile are well matched to the needs that the new usage (and presumably, business value) is placing on the relationship.

Documenting the Intergation Lead role and responsibilities is critical. The team should know when to check in, and writing it down helps it happen.  When new code has a security implication, or a new use case potentially amplifies the cost of an integration, bringing the domain expert in will avoid unhappy surprises.  Knowing how to find out who to bring in, and when to bring them in, will keep your team getting the right eyes on their changes.

Security threats and other challenges change over time, too.  Collaborating with your security team so that they know what systems are in use helps your team take note of new information that is relevant to your business. A simple example is noting when your vendors publish a breach announcement, but more complex examples happen too — your vendor transitions cloud providers from AWS to Azure and the security team gets an alert about unexpected data flows from your production cluster; with transparency and trust such events become part of a routine process rather than an emergency.

It’s all operational

Monitoring and alerting is a fact of operations life, and this has to include vendor integrations (even when the vendor integration is a monitoring product.)  All of your operations best practices are needed here — keep your alerts clean and actionable so that you don’t develop pager fatigue, and monitor performance of the integration so that you don’t get blindsided by a creeping latency monster in your APIs.

Authentication and authorization are changing as the threat landscape evolves and industry moves from SMS verification codes to U2F/WebAuthn.  Does your vendor support your SSO integration?  If they can’t support the same SSO that you use everywhere else and can’t add it — or worse, look confused when you mention SSO — that’s probably a sign you should consider a different vendor.

A beautiful sunset

Have a plan beforehand for what needs to be done should you stop using the service.  Got any mobile apps that depend on APIs that will go away or start returning permission errors?  Be sure to test these scenarios ahead of time.

What happens at contract termination to data stored on the service?  Do you need to explicitly delete data when ceasing use?

Do you need to remove integrations from your systems before ending the commercial relationship, or can the technical shutdown and business shutdown run in parallel?

In all likelihood these are contingency plans that will never be needed, and they don’t need to be fully fleshed out to start, but a little bit of forethought can avoid unpleasant surprises.

Year after year

Industry best practice and common sense dictate that you should revisit the security questionnaire annually (if not more frequently). Use this chance to take stock of the last year and check in — are you getting value from the service?  What has changed in your business needs and the competitive landscape? 

It’s entirely possible that a new year brings new challenges, which could make your current vendor even more valuable (time to negotiate a better contract rate!) or could mean you’d do better with a competing service.  Has the vendor gone through any major changes?  They might have new offerings that suit your needs well, or they may have pivoted away from the features you need. 

Check in with your friends on the security team as well; standards evolve, and last year’s sufficient solution might not be good enough for new requirements.


Andy thinks out loud about security, society, and the problems with computers on Twitter.


❤️ Thanks so much reading, folks.  Please feel free to drop any complaints, comments, or additional tips to us in the comments, or direct them to me on twitter.

Have fun!  Stay (a little bit) Paranoid!!

— charity

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

Software Sprawl, The Golden Path, and Scaling Teams With Agency


Stop me if you’ve heard this one before.

The company is growing like crazy, your engineering team keeps rising to the challenge, and you are ferociously proud of them.  But some cracks are beginning to show, and frankly you’re a little worried.  You have always advocated for engineers to have broad latitude in technical decisions, including choosing languages and tools.  This autonomy and culture of ownership is part of how you have successfully hired and retained top talent despite the siren song of the Faceboogles.

But recently you saw something terrifying that you cannot unsee: your company is using all the languages, all the environments, all the databases, all the build tools.  Shit!!!  Your ops team is in full revolt and you can’t really blame them.  It’s grown into an unsupportable nightmare and something MUST be done, but you don’t know what or how — let alone how to solve it while retaining the autonomy and personal agency that you all value so highly.

I hear a version of this everywhere I’ve gone for the past year or two.  It’s crazy how often.  I’ve been meaning to write my answer up for ages, and here it (finally) is.

First of all: you aren’t alone.  This is extremely common among high-performing teams, so congratulations.  Really!

There actually seems to be a direct link between teams that give engineers lots of leeway to own their technical decisions and that team’s ability to hire and retain top-tier talent, particularly senior talent.   Everything is a tradeoff, obviously, but accepting somewhat more chaos in exchange for a stronger sense of individual ownership is usually the right one, and leads to higher-performing teams in the long run.

Second, there is actually already a well-trod path out of this hole to a better place, and it doesn’t involve sacrificing developer agency.  It’s fairly simple!  Just five short steps, which I will describe to you now.

How to build a golden path and reverse software sprawl

  1. Assemble a small council of trusted senior engineers.
  2. Task them with creating a recommended list of default components for developers to use when building out new services.  This will be your Golden Path, the path of convergence (and the path of least resistance).
  3. Tell all your engineers that going forward, the Golden Path will be fully supported by the org.  Upgrades, patches, security fixes; backups, monitoring, build pipeline; deploy tooling, artifact versioning, development environment, even tier 1 on call support.  Pave the path with gold.  Nobody HAS to use these components … but if they don’t, they’re on their own.  They will have to support it themselves.
  4. Work with team leads to draw up an umbrella plan for adopting the Golden Path for their current projects as well as older production services, as much as is reasonable or possible or desirable.  Come up with a timeline for the whole eng org to deprecate as many other tools as possible.  Allocate real engineering time to the effort.  Hell, make a party out of it!
  5. After the cutoff date (and once things have stabilized), establish a regular process for reviewing and incorporating feedback about the blessed Path and considering any proposed changes, additions or removals.

There you go.  That’s it.  Easy, right??

(It’s not easy.  I never said it was easy, I said it was simple.  👼🏼)

Your engineers are currently used to picking the best tool for the job by optimizing locally. gpjon What data store has a data model that is easiest for them to fit to their needs?  Which language is fastest for I/O throughput?  What are they already proficient in?  What you need to do is start building your muscles for optimizing globally.  Not in isolation of other considerations, but in conjunction with them.  It will always be a balancing act between optimizing locally for the problem at hand and optimizing globally for operability and general sanity.

(Oh, incidentally, requiring an engineer to write up a proposal any time they want to use a non-standard component, and then defend their case while the council grills them in person — this will be nothing but good for them, guaran-fucking-teed.)

Let’s go into a bit more detail on each of the five points.  But quick disclaimer: this is not a prescription.  I don’t know your system, your team, your cultural land mines or technical interdependencies or anything else about your situation.  I am just telling stories here.

1. Assemble your council

Three is a good number for a council.  More than that gets unwieldy, and may have trouble reaching consensus.  Less than three and you run into SPOFs.  You never want to have a single person making unilateral decisions because a) the decision-making process will be weaker, b) it sets that person up for too much interpersonal friction, and c) it denies your other engineers the opportunity to practice making these kinds of decisions.

  • Your council members need technical breadth more than depth, and should be widely respected by engineers.
  • At least one member should have a long history with the company so they know lots of stupid little details about what’s been tried before and why it failed.
  • At least one member should be deeply versed in practical data and operability concerns.
  • They should all have enough patience and political skill to drive consensus for their decisions.  Absolutely no bombthrowers.

If you’re super lucky, you just tap the three senior technologists who immediately come to mind … your mind and everyone else’s.  If you don’t have this kind of automatic consensus, you may want to let teams or orgs nominate their own representative so they feel they have some say.


2.  Task the council with defining a Golden Path

Your council cannot vanish for a week and then descend from the mountain lugging lists engraved on stone tablets.  The process of discovery and consensus is what validates the result.

The process must include talking to and gathering feedback from your engineers, talking to experts outside the company, talking to teams at other companies who are farther along using that technology, coming up with detailed pro/con lists and reasons for their choices.  Maybe sometimes it includes prototyping something or investigating the technical depths … but yeah no mostly it’s just the talking.

You need your council members to have enough political skill to handle these conversations deftly, building support and driving consensus through the process.  Everybody doesn’t have to love the outcome, but it shouldn’t be a *surprise* to anyone by the end.

3.  Know where you’re going

Your council should create a detailed written plan describing which technologies are going to be supported … and a stab at what “supported” means.  (Ask the experts in each component what the best practices are for backups, versioning, dependency management, etc.)

You might start with something like this:

* Backend lang: Go 1.11           ## we will no longer be supporting
backend scripting languages
* Frontend lang: ReactJS v 16.5
* Primary db: Aurora v 2.0        ## Yes, we know postgres is "better", 
but we have many mysql experts and 0 pg experts except the one guy 
who is going to complain about this.  You know who you are.
* Deploy pipeline: github -> jenkins + docker -> S3 -> custom k8s 
deploy tooling
* Message broker: kafka v 2.10, confluent build
* Mail: SES
* .... etc

Circulate the draft regularly for feedback, especially with eng managers.  Some team reorganization will probably be necessary to bear the new weight of your support specifications, and managers will need some lead time to wrangle this.

This is also a great time to reconceive of the way on call works at your company.  But I am not going to go into all that here.

4. Set a date, draft a plan: go!

Get approval from leadership to devote a certain amount of time to consolidating your stack and paying down a lump sum of tech debt.  It depends on your stage of decay, but a reasonable amount of time might be “25% of engineering time for three months“.  Whatever you agree to, make sure it’s enough to make the world demonstrably better for the humans who run it; you don’t want to leave them with a tire fire or you’ll blow your credibility.

The council and team leads should come up with a rough outer estimate for how long itgpjava would take to rewrite everything and move the whole stack on to the Golden Stack.  (It’s probably impossible and/or would take years, but that’s okay.)  Next, look for the quick wins or swollen, inflamed pain points.

  • If you are running two pieces of functionally similar software, like postgres and mysql, can you eliminate one?
  • If you are managing something yourself that AWS could manage for you (e.g. postfix instead of SES, or kafka instead of kinesis), can you migrate that?
  • If you are managing anything yourself that is not core to your business value, in fact, you should try to not manage it.
  • If you are running any services by hand on an AWS instance somewhere, could you try using a service?
  • If you are running your own monitoring software, etc … can you not?
  • If you have multiple versions of a piece of software, can you upgrade or consolidate on one version?

The hardest parts are always going to be the ones around migrating data or rewriting components.  Not everything is worth doing or can afford to be done in the time span of your project time, and that’s okay.

Next, brainstorm up some carrots.  Can you write templates so that anybody who writes a service using your approved library, magically gets monitoring checks without having to configure anything?  Can you write a wrapper so they get a bunch of end-to-end tests for free?  Anything you can do to delight people or save them time and effort by using your preferred components is worth considering.

(By the way, if you don’t have any engineers devoted to internal tooling, you’re probably way overdue at this point.)

Pay down as much debt as you can, but be pragmatic: it’s better to get rid of five small things than one large thing, from a support perspective.  Your main goal is to shrink the number of types of software your team has to support, particularly databases.

Do look for ways to make it fun, like … running a competition to see who can move the most tools to AWS in a week, or throwing a hack week party, or giving dorky prizes like trophies that entitle you to put your manager on call instead of you for a day, etc.


5. Make the process sustainable

After your target date has come and gone, you probably want to hold a post mortem retrospective and do lots of listening.  (Well — first might I recommend a bubble bath and a bottle of champagne?  But then a post mortem.)

Nothing is ever fixed forever.  The company’s needs are going to expand and contract, gpdiedand people will come and go, because change is the only constant.  So you need to bake some flex into your system.  How are you going to handle the need for changes to the Golden Path?  Monthly discussions?  An email list?  Quarterly meetings with a formal agenda?  I’ve seen people do all of these and more, it doesn’t really matter afaict.

Nobody likes a cabal, though, so the original council should gradually rotate out.  I recommend replacing one person at a time, one per quarter, and rotating in another senior engineer in their place.  This provides continuity while giving others a chance to learn these technical and political skills.

In the end, engineers are still free to use any tool or component at any time, just like before, only now they are solely responsible for it, which puts pressure on them not to do it unless REALLY necessary.  So if someone wants to propose adding a new tool to the default golden path, they can always add it themselves and gain some experience in it before bringing it to the council to discuss a formal place for it.

That’s all folks

See, wasn’t that simple?

(It’s never simple.)

I dearly wish more people would write up their experiences with this sort of thing in detail.  I think engineering teams are too reluctant to show their warts and struggles to the world — or maybe it’s their executives who are afraid?  Dunno.

Regardless, I think it’s actually a highly effective recruiting tool when teams aren’t afraid to share their struggles.  The companies that brag about how awesome they are are the ones who come off looking weak and fragile.  Whereas you can always trust the ones who are willing to laugh about all the ways they screwed up.  Right?

In conclusion, don’t feel like an asshole for insisting on some process here.  There should be friction around adding new components to your stack.  (Add in haste, repent at leisure, as they say.)  Anybody who argues with you probably needs to be exposed to way, way more of the support load for that software.  That’s my professional opinion.

Anyway.  You win or you die.  Good luck with your sprawl.




Software Sprawl, The Golden Path, and Scaling Teams With Agency

On Engineers and Influence

(Based on yesterday’s tweetstorm and the ensuing conversation,

Let’s talk about influence. As an engineer, how do you get influence? What does influence look like, what is it rooted in, how do you wield it or lose it? How is it different from the power and influence you might have as a manager?[0]

This often comes up in the context of ICs who desperately want to become managers in order to have more access to information and influence over decisions. This is a bad signal, though it’s sadly very common.

When that happens, you need to do some soul-searching. Does your org make space for senior ICs to lead and own decisions? Do you have an IC track that runs parallel to the manager track at least as high as director level? Are they compensated equally? Do you  have a career ladder? Are your decision-making processes mysterious to anyone who isn’t a manager? Don’t assume what’s obvious to you is obvious to others; you have to ask around.

If so, it’s probably their own personal baggage speaking. Maybe they don’t believe you. Maybe they’ve only worked in orgs where managers had all the power. Maybe they’ve even worked in lots of places that said the exact same things as you are saying about how ICs can have great impact, but it was all a lie and now they’re burned. Maybe they aren’t used to feeling powerful for all kinds of reasons.

Regardless, people who want to be managers in order to perpetuate a bad power structure are the last people you want to be managers.[1]

But what does engineering influence look like?  How do your powers manifest?

I am going to avoid discussing the overlapping and interconnected issues of gender, race and class, let’s just acknowledge that it’s much more structurally difficult for some to wield power than for others, ok?

The power to create

Doing is the engineering superpower. We create things with just a laptop and our brain! It’s incredible! We don’t have to constantly convince and cajole and coerce others into building on our behalf, we can just build.

This may seem basic, but it matters. Creation is the ur-power from which all our forms of power flow. Nothing gets built unless we agree to build it (which makes this an ethical issue, too).

Facebook had a poster that said “CODE WINS ARGUMENTS”. Problematic in many ways, absolutely. But how many times have you seen a technical dispute resolved by who was willing to do the work? Or “resolved” one way.. then reversed by doing? Doing ends debates. Doing proves theories. Doing is powerful. (And “doing” doesn’t only mean “write code”.)

Furthermore, building software is a creative activity, and doing it at scale is an intensely communal one. As a creative act, we are better builders when we are motivated and inspired and passionate about our work (as compared to say, chopping wood). And as a collaborative act, we do better work when we have high trust and social cohesion.

Engineering ability and judgment, autonomy and sense of purpose, social trust and cooperative behaviors: this is the raw stuff of great engineering. Everybody has a mode or two that they feel most comfortable and authoritative operating from: we can group these roughly into archetypes.

(Examples drawn from some of the stupendously awesome senior engineers I’ve gotten to work with over the years, as well as the ways I loved to fling my weight around as an engineer.)

Archetypes of influence

  • “Doing the work that is desperately hard and desperately needed — and often desperately dull.” SOC2 compliance, backups and restores, terrifying refactors, any auth integration ever: if it’s moving the business forward, they don’t give a shit how dull the work is. If you are this engineer, you have a deep well of respect and gratitude.
  • Debugger of last resort.” Often the engineer who has been there the longest or originally built the system. If you are helpful and cheerful with your history and context, this is a huge asset. (People tend to wildly overestimate this person’s indispensability, actually; please don’t encourage this.)Image result for engineer software meme manager
  • The “expert” archetype is closely related. If you are the deep subject matter expert in some technology component, you have a shit ton of influence over anything that uses or touches that component. (You should stay up on impending changes to retain your edge.)
  • There are people who deliver a bafflingly powerful firehose of sustained output, sometimes making headway on multiple fronts at once. Some work long hours, others just have an unerring instinct for how to maximize impact (this sometimes maps to junior/senior manifestations). Nobody wants to piss off those people. Their consent is critical for … everything. Their participation will often turbo charge a project or pull a foundering effort over the finish line.

Not all influence is rooted in raw technical strength or output.  Just a few of the wide variety of creative/collaborative/interpersonal strengths:

  • Some engineers are infinitely curious, and have a way of consistently sniffing a few steps ahead of the pack. They might seem to be playing around with something pointless, and you want to scold them; then they save your ass from total catastrophe. You learn to value their playing around.
  • Some engineers solve problems socially, by making friends and trading tips and fixes and favors in the industry. Don’t underestimate social debugging, it’s often the quickest path to the right answer.
  • Some are dazzlingly lazy and blow your mind with their elegant shortcuts and corners correctly cut.
  • Some are recruiting magnets, and it’s worth paying their salary just for all the people who want to work with them again.
  • Some are skilled at driving consensus among stakeholders.
  • Some are killer explainers and educators and storytellers.
  • Some are the senior engineer everyone silently wants to grow up to be.
  • Some can tell such an inspiring story of tomorrow that everyone will run off to make it so.
  • Some teach by turning code reviews into a pedagogical art form.
  • Some make everyone around them somehow more productive and effective. Some create relentless forward momentum. Some are good at saying no.

And there are a few special wells of power that bear calling out as such.

  • Engineers who have been managers are worth their weight in gold.  They can translate business goals for junior engineers in their native language with impeccable credibility (something managers never really have, esp in junior engineers’ eyes.). They make strong tech leads, they can carve up projects into components that challenge but do not overwhelm each contributor while hitting deadlines.
  • Some engineers are a royal pain in the ass because they question and challenge every system and hierarchy. But these are sharp, powerful rocks that can polish great teams. Though they do require a strong manager, to channel their energy towards productive dialogue and improvement and keep them from pissing off the whole team.
  • And let’s not forget engineers who are on call. If you have a healthy on call culture,your ownership over production creates a deep, deep well of power and moral authority — to make demands, drive change, to prioritize. On call should not be a shit salad served up to those who can’t refuse, it should be a badge of honor and seriousness shouldered by every engineer who ships code. (And it should not be miserable or regularly life-impacting.)

… I could go on all day. Engineering is such a powerful role and skill set. It’s definitely worth unpacking where your own influence comes from, and understanding how others perceive your strengths.

Most forms of power boil down to “influence, wielded”.

But just banging out code is not enough. You may have credibility, but having it is not the same as using it. To transform influence into power you have to use it.  And the way you use it is by communicating.

What’s locked up in your head has no impact on the rest of us.  You have to get it out.

You can do this in lots of ways: by writing, in 1x1s, conversations with small groups, openly recruiting allies, convincing someone with explicit authority, broadcasting inpublic, etc.

Because engineering is a creative activity, authoritarian power is actually quite brittle and damaging. The only sustainable forms of power are so-called “soft powers” like influencing and inspiring, which is why good managers use their soft power freely and hard power sparingly/with great reluctance. If your leadership invokes authority on the regular, that’s an antipattern.[2]

If you don’t speak up, you don’t have the right to sit and fume over your lack of influence. And speaking up does mean being vulnerable — and sometimes wrong — in front of other people.

This is not a zero-sum game.

Most of you have far more latent power than you realize or are used to wielding, because you don’t feel powerful or don’t recognize what you do in those terms.

Managers may have hard power and authority, but the real meaty decisions about technical delivery and excellence are more properly made by the engineers closest to them. These belong properly to the doers, in large part because they are the ones who have to support the consequences of these decisions.

Power tends to flow towards managers because they are privy to more information. That makes it important to hire managers who are aware of this and lean against it to push power back to others.

In the same way that submissives have ultimate power in healthy BDSM relationships, engineers actually have the ultimate power in healthy teams. You have the ultimate veto: you can refuse to create.  Demand is high for your skills.  You can usually afford to look for better conditions. Many of you probably should.

And when technical and managerial priorities collide, who wins? Ideally you work together to find the best solution for the business and the people. The teams that feel 🔥on fire🔥 always have tight alignment between the two.

Pick your battles.

One final thought. You can have a lot of say in what gets built and how it gets built, if you cultivate your influence and spend it wisely. But you can’t have a say in everything. It doesn’t work that way.

Think of it like @mcfunley’s famous “innovation tokens”, but for attention and fucks given.
Image result for engineer software meme
The more you use your influence for good outcomes, the more you build up over time, yes … but it’s a precision tool, not background noise. Imagine someone trying to give you a massage by laying down on your whole back instead of pushing their elbow or hand into knots and trigger points. A too-broad target will diffuse your force and limit your potential impact.

Spend your attention tokens wisely.

And once you have influence, don’t forget to use it on behalf of others. Pay attention to those who aren’t being heard, and amplify their voices. Give your time, lend your patronage and credibility, and most of all teach the skills that have made you powerful to others who need them.


P.S. I owe a huge debt to all the awesome senior engineers i’ve gotten to work with.  Mad love to you all.  <3
Image result for influence meme

  • [0] I successfully answered one (1) of these questions before running out of steam.  Later. 
  • [1] Sheepish confession: this is why I became a manager.
  • [2] It’s also a bad sign if they won’t grant any explicit authority to the people they hold responsible for outcomes. I’m talking about relatively healthy orgs here, not pathological ones where people (often women) are told they don’t need promotions or explicit authority, they should just use their “soft power” — esp when the hard forms of power aligned against with them. That’s setting you up for failure.
  • [3] Some people seem caught off guard by my use of “power” to signal anything other than explicit granted powers by the org. This doesn’t make any sense to me. I find it too depressing and disempowering to think of power as merely granted authority. It doesn’t map to how I experience the world, either. Individual clout is a thing that waxes and wanes and only exists in relation to others’. I’ve seen plenty of weak managers pushed around by strong personalities (which is terrible too).
On Engineers and Influence

An Engineer’s Bill of Rights (and Responsibilities)

Power has a way of flowing towards people managers over time, no matter how many times you repeat “management is not a promotion, it’s a career change.”

It’s natural, like water flowing downhill.  Managers are privy to performance reviews and other personal information that they need to do their jobs, and they tend to be more practiced communicators.  Managers facilitate a lot of decision-making and routing of people and data and things, and it’s very easy to slip into making the all decisions rather than empowering people to make them.  Sometimes you want to just hand out assignments and order everyone to do as told.  (er, just me??)

izengBut if you let all the power drift over to the engineering managers, pretty soon it doesn’t look so great to be an engineer.  Now you have people becoming managers for all the wrong reasons, or everyone saying they want to be a manager, or engineers just tuning out and turning in their homework (or quitting).  We all want autonomy and impact, we all crave a seat at the table.  You need to work harder to save those seats for non-managers.

So, in the spirit of the enumerated rights and responsibilities of our musty Constitution, here are some of the commitments we make to our engineers at Honeycomb — and some of the expectations we have for managering and engineering roles.  Some of them mirror each other, and others are very different.

(Incidentally, I find it helpful to practice visualizing the org chart hierarchies upside down — placing managers below their teams as support structure rather than perched atop.)

Engineer’s Bill of Rights

  1. You should be free to go heads down and focus, and trust that your manager will tap you when you are needed (or would want to be included).
  2. We will invest in you as a leader, just like we invest in managers.  Everybody will have opportunities to develop their leadership and interpersonal skills.
  3. Technical decisions must remain the provenance of engineers, not managers.
  4. You deserve to know how well you are performing, and to hear it early and often if you aren’t meeting expectations.
  5. On call should not substantially impact your life, sleep, or health (other than carrying your devices around).  If it does, we will fix it.
  6. Your code reviews should be turned around in 24 hours or less, under ordinary circumstances.
  7. You should have a career path that challenges you and contributes to your personal life goals, with the coaching and support you need to get there.
  8. You should substantially choose your own work, in consultation with your manager and based on our business goals.  This is not a democracy, but you will have a voice in our planning process.
  9. You should be able to do your work whether in or out of the office. When you’re working remotely, your team will loop you in and have your back.

Engineer’s responsibilities

  • Make forward progress on your projects every week. Be transparent.
  • Make forward progress on your career every quarter.  Push your limits.
  • Build a relationship of trust and mutual vulnerability with your manager andcateng team, and invest in those relationships.
  • Know where you stand: how well are you performing, how quickly are you growing?
  • Develop your technical judgment and leadership skills.  Own and be accountable for engineering outcomes.  Ask for help when you need it, give help when asked.
  • Give feedback early and often, receive feedback gracefully.  Practice both saying no and hearing no.  Let people retract and try again if it doesn’t come out quite right.
  • Own your time and actively manage your calendar.  Spend your attention tokens mindfully.

Manager’s responsibilities

  • Recruit and hire and train your team.  Foster a sense of solidarity and “teaminess” as well as real emotional safety.
  • Care for every engineer on your team.  Support them in their career trajectory, personal goals, work/life balance, and inter- and intra-team dynamics.
  • Give feedback early and often. Receive feedback gracefully. Always say the hard things, but say them with love.
  • Move us relentlessly forward, watching out for overengineering and work that doeasshatsn’t contribute to our goals.  Ensure redundancy/coverage of critical areas.
  • Own the quarterly planning process for your team, be accountable for the goals you set.  Allocate resources by communicating priorities and recruiting eng leads.  Add focus or urgency where needed.
  • Own your time and attention. Be accessible. Actively manage your calendar.  Try not to make your emotions everyone else’s problems (but do lean on your own manager and your peers for support).
  • Make your own personal growth and self-care a priority. Model the values and traits we want our engineers to pattern themselves after.
  • Stay vulnerable.

I’d love to hear from anyone else who has a list like this.





An Engineer’s Bill of Rights (and Responsibilities)

DevOps vs SRE: delayed coverage of the dumbest war

Last week was the West Coast Velocity conference.  I had a terrific time — I think it’s the
best Velocity I’ve been to yet.  I also slipped in quite late, the evening before last, to catch Gareth’s session on DevOps vs SRE.

had to catch it, because Gareth Rushgrove (of DevOps Weekly glory) was taunting @lusis and me about it on the Internet.

rainbowdropletAnd it was worth it!   Holy crap, this was such a fun barnburner of a talk, with Gareth schizophrenically arguing both for and against the key premise of the talk, which was about “Google Infrastructure for Everyone Else (GIFEE)” and whether SRE is a) the highest, noblest goal that we should all aspire towards, or b) mostly irrelevant to anyone outside the Google confines.

Which Gareth won?  Check out the slides and judge for yourself.  🙃


At some point in his talk, though, Gareth tossed out something like “Charity probably already has a blog post on this drafted up somewhere.”  And I suddenly remembered “Fuck!  I DO!”  it’s been sitting in my Drafts for months god dammit.

So this is actually a thing I dashed off back in April, after CraftConf.  Somebody asked me for my opinion on the internet — always a dangerous proposition — and I went off on a bit of a rant about the differences and similarities between DevOps and SRE, as philosophies and practices.

Time passed and I forgot about it, and then decided it was too stale.  I mean who really wants to read a rehash of someone’s tweetstorm from two months ago?

Well Gareth, apparently.

Anyway: enjoy.



So in case you haven’t noticed, Google recently published a book about Site Reliability Engineering: How Google Runs Production Systems.  It contains some really terrific wisdom on how to scale both systems and orgs.  It contains chapters written by dear friends of mine.  It’s a great book, and you should buy it and read it!

Rainbow-Umbrella-Z-5_5It also has some really fucking obnoxious blurbs.  Things like about how “ONLY GOOGLE COULD HAVE DONE THIS”, and an whiff of snobbery throughout the book as though they actually believe this (which is far worse if true).

You can’t really blame the poor blurb’ers, but you can certainly look askance at a massive systems engineering org when it seems as though they’ve never heard of DevOps, or considered how it relates to SRE practices, and may even be completely unaware of what the rest of the industry has been up to for the past 10-plus years.  It’s just a little weird.

So here, for the record, is what I said about it.


Google is a great company with lots of terrific engineers, but you can only say they are THE

The Google SRE Bible

BEST at what they do if you’re defining what they do tautologically, i.e. “they are the best at making Google run.”  Etsyans are THE BEST at running Etsy, Chefs are THE BEST at building Chef, because … that’s what they do with their lives.

Context is everything here.  People who are THE BEST at Googling often flail and flame out in early startups, and vice versa.  People who are THE BEST at early-stage startup engineering are rarely as happy or impactful at large, lumbering, more bureaucratic companies like Google.  People who can operate equally well and be equally happy at startups and behemoths are fairly rare.

And large companies tend to get snobby and forget this.  They stop hiring for uniquerainbow-swirl strengths and start hiring for lack of weaknesses or “Excellence in Whiteboard Coding Techniques,” and congratulate themselves alot about being The Best.  This becomes harmful when it translates into to less innovation, abysmal diversity numbers, and a slow but inexorable drift into dinosaurdom.

Everybody thinks their problems are hard, but to a seasoned engineer, most startup problems are not technically all that hard.  They’re tedious, and they are infinite, but anyone can figure this shit out.  The hard stuff is the rest of it: feverish pace, the need to reevaluate and reprioritize and reorient constantly, the total responsibility, the terror and uncertainty of trying to find product/market fit and perform ten jobs at once and personally deliver to your promises to your customers.

rainbow-cloud-dropletAt a large company, most of the hardest problems are bureaucratic.  You have to come to terms with being a very tiny cog in a very large wheel where the org has a huge vested interest in literally making you as replicable and replaceable as possible.  The pace is excruciatingly slow if you’re used to a startup.  The autonony is … well, did I mention the politics?  If you want autonomy, you have to master the politics.


Everyone.  Operational excellence is everyone’s job.  Dude, if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on  your team is an operational risk (not to mention, you know, like moral issues and stuff).

But the more engineering-focused your role is, the more direct your impact will be on operational outcomes.

As a software engineer, developing strong ops chops makes you powerful.  It makes you better at debugging and instrumentation, building resiliency and observability into your own systems and interdependent systems, and building systems that other people can come along and understand and maintain long after you’re gone.rainbow-shade

As an operations engineer, those skills are already your bread and butter.  You can increase your power in other ways, like by leveling up at software engineering skills like test coverage and automation, or DBA stuff like query optimization and storage engine internals, or by helping the other teams around you level up on their skills (communication and persuasion are chronically underrecognized as core operations engineering skills).

This doesn’t mean that everyone can or should be able to do everything.  (I can’t even SAYrainbow-dot-ball the words “full stack engineer” without rolling my eyes.)  Generalists are awesome!  But past a certain inflection point, specialization is the only way an org can scale.

It’s the only way you make room for those engineering archetypes who only want to dive deep, or who really really love refactoring, or who will save the world then disappear for weeks.  Those engineers can be incredibly valuable as part of a team … but they are most valuable in a large org where you have enough generalists to keep the oars rowing along in the meantime.

So, back to Google.  They’ve done, ahem, rather  well for themselves.  Made shitbuckets of money, pushed the boundaries of tech, service hardly ever goes down.  They have operational demands that most of us never have seen and never will, and their engineers are definitely to be applauded for doing a lot of hard technical and cultural labor to get there.

So why did this SRE book ruffle a few feathers?

Mostly because it comes off a little tone deaf in places.  I’m not personally pissed off by
the google SRE book, actually, just a little bemused at how legitimately unaware they seem to be about … anything else that the industry has been doing over the past 10 years, in terms of cultural transformation, turning sysadmins into better engineers, sharing on-call rotations, developing processes around empathy and cross-functionality, engineering best practices, etc.

DevOps for the rest of us

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

However, it happens that Jen Davis and Katherine Daniels just published a book called Effective DevOps, which covers a lot of the same ground with a much more varied and inclusive approach.  And one of the things they return to over and over again is the power of context, and how one-size-fits-all solutions simply don’t exist, just like unisex OSFA t-shirts are a dirty fucking lie.

Google insularity is … a thing.  On the one hand it’s great that they’re opening up a bit!rainbow-umbrella-clipart-1 On the other hand it’s a little bit like when somebody barges onto a mailing list and starts spouting without skimming any of the archives.  And don’t even get me started on what happens when you hire long, longterm ex-Googlers back into to the real world.

So, so many of us have had this experience of hiring ex-Googlers who automatically assume that the way Google does a thing is CORRECT, not just contextually appropriate.   Not just right for Google, but right for everyone, always.  Which is just obviously untrue.  But the reassimilation process can be quite long and exhausting when the Kool-Aid is so strong.

Because yeah, this is a conversation and a transformation that the industry has been having for a long time now.  Compared with the SRE manifesto, the DevOps philosophy is much more crowd-sourced, more flexible, and adaptable to organizations of all stages of developments, with all different requirements and key business differentiators, because it’s benefited from loud, mouthy contributors who aren’t all working in the same bubble all along.

And it’s like Google isn’t even aware this was happening, which is weird.

Orrrrrr, maybe I’m just a wee bit annoyed that I’ve been drawn into this position rainbow-dot-ballof having to defend “DevOps”, after many excellent years spent being grumpy about the word and the 10000010101 ways it is used and abused.

(Tell me again about your “DevOps Engineering Team”, I dare you.)

P.S. I highly encourage you to go read the epic hours-long rant by @matthiasr that kicked off the whole thing.  some of which I definitely endorse and some of which not, but I think we could go drink whiskey and yell about this for a week or two easy breezy  <3

Anyway what the fuck do I know, I’ve never worked in the Google lair, so maybe I am just under-equipped to grasp the true glory, majesty and superiority of their achievements over us all.

Or maybe they should go read Katherine and Jen’s book and interact with the “UnGoogled” once in a while.  ☺️



DevOps vs SRE: delayed coverage of the dumbest war