The Future of Ops is Platform Engineering

First published on 2022-09-30 at https://www.honeycomb.io/blog/future-ops-platform-engineering.

Two years ago I wrote a piece in The New Stack about the Future of Ops Careers. Towards the end, I wrote:

The reality is that jack-of-all-trades systems infrastructure jobs are slowly vanishing: the world doesn’t need thousands of people who can expertly tune postfix, SpamAssassin, and ClamAV—the world has Gmail. (…)

Building infrastructure and operational expertise used to be bundled together into a single role. But the industry is now bifurcating along an infrastructure fault line, and the overlap between infrastructure-oriented engineers and operationally-minded engineers is swiftly eroding. Engineers who love this work increasingly have a choice to make. Either you can 1) go deep on infrastructure by joining a company that does infrastructure as a service, or 2) go broad on operability by joining a company to help them do as little infrastructure as possible.

I described the second category as “operations engineering minus the infrastructure,” dedicated to evaluating and assembling a production stack of third-party platform providers, enabling software engineers to self-serve their services and own their own code in production. I said:

  • Your job will be to aggressively minimize the cycles your org devotes to infrastructure by finding effective ways to outsource or minimize infra labor. Your job is to NOT go deep if there is any workable alternative.
  • Your job will be to work cross-functionally with all the other software engineering teams, looking for ways to speed up their time to value and helping them own their own code in production.
  • Your job will be to move past the kludgey old models of “outsourcing” to sophisticated understandings of how and where to leverage abstractions that can radically accelerate development.

That second category I was describing now has a name. We call those teams “platform engineering.”

The fifty-year arc of software careers

In the beginning, there were people who wrote and ran software. At some point, we spun away ops skills from dev skills into two different professions, but that turned out to be a ginormous mistake, so along came DevOps to reunify them. Nowadays, ops as an independent profession is in the process of fading out. Companies are spinning down their ops teams left and right. Engineers who formerly identified as sysadmins or operations have turned into DevOps engineers, and soon there will just be “software people” again. This is the way of things.

Please note that this is NOT the same thing as saying “ops is dead,” or “ops skills are no longer valuable or needed1.” Our systems are only getting more complex, more difficult to operate, and simultaneously more critical to life on earth, which means that operational excellence has never been more desperately needed (and if you don’t respect that, 🌈 you deserve to suffer 🌈).

The industry story of the past three to five years has been us trying to figure out how to help software engineers own their own code in production2, phasing out dedicated ops teams, and aggressively outsourcing as much infrastructure as possible.

As we should. Developer cycles are the scarcest resource in your company, and you want to spend as many of those as possible on your core product: the crown jewel, the code that makes you a business. Money is cheaper than engineering cycles, and teams that are focused on their core business will always outperform teams whose focus is spread across dozens of non-revenue-generating projects. Let someone else build and run all the dependencies and adjacencies.

Before: some engineers wrote code, and some engineers ran code.

Now: all engineers write code, and all engineers run the code they write.

Platform engineering is what stands between you and darkness

When you start talking about putting software engineers on call for their own code, and generally being more involved in production, some percentage of the time you will hear back a guttural wail of despair: “You can’t expect me to know EVERYTHING about EVERYTHING!”

Quite right; we can’t. Platform engineering teams are part of the answer to this perfectly reasonable complaint. It’s not that you’re being asked to do or understand more in toto, but the distribution of labor and responsibility is shifting:

Before: some engineers wrote code, and some engineers ran code.

Now: all engineers write code, and all engineers run the code they write—but we divide the areas of responsibility by layer or function.

The emergence of a minimum viable self-serve tier

In the earliest days of a company, your first few engineers end up bootstrapping an infrastructure by reading AWS docs or blog posts, or asking a friend for recommendations to get started. They might start by setting up a managed container service, or configuring Terraform, and for a while everybody deploys and owns their own code, just as god intended.

But cognitive limits kick in pretty quickly. The maze of APIs and SDKs and components out there is simply bewildering, even for an experienced ops hand. Before long, it becomes someone’s job to make good decisions, pick a suite of compute and storage options that serve the team’s needs, and write some tooling that pulls everything into a coherent whole—which, at a minimum, lets you:

  1. Run tests and generate new artifacts
  2. Deploy artifacts, version them, and roll back
  3. Instrument, monitor, and debug
  4. Store data somewhere, manage schemas and migrations
  5. Adjust capacity as needed
  6. Define and commit all components (and their relationships) as code

Once these are built, it should be trivial for an engineer to come along and spin up a new service using templates and components from existing services. It should be much simpler and easier to use the blessed paths than anything else, and there should be friction if you go off the beaten path.

Congratulations! You’ve just been platformed 🎉. One of the key principles of any developer platform is that it should be easy to do the right things, and hard to do the wrong things.

The differences between platform engineering and traditional ops

Platform teams are typically staffed by engineers who are comfortable writing software. Not just scripting and automation, but writing tests and doing code reviews. Platform teams also operate much more like product development teams do, with product managers (and occasionally, designers, developer advocates, or UX researchers).

This doesn’t mean that everybody on a platform team has to have originally been a software engineer; in fact, a super common failure condition for platform teams is simply thinking all they need to do is hire software engineers to build developer tools. A strong platform team has an equally deep grounding in operations experience and software development. Individuals who are experts in both areas are fairly rare, but you can pull together a strong, well-rounded team by assembling a mix of SWEs (with some ops experience) and ops or DevOps engineers (with some software experience) and having them learn and grow from each other.

Platform teams are decidedly cloud-native; they actually mostly involve platforms built atop the cloud itself—PaaS, IaaS, everything-aaS, serverless, and so forth.

Ops/DevOps teams are oriented around managing infrastructure, often several generations of infrastructure. Their turf is everything from data centers and bare metal up through virtualization, containers, and the cloud (they aren’t so much cloud-native as cloud-enabled). They measure themselves on things like SLOs and the DORA metrics. You know they’re doing a good job if the system is up/available and users are happy.

Platform teams are oriented around providing a good experience for developers to self-serve and self-manage their code. The more swiftly and easily developers can move, the better your platform team. Operational excellence, in the platform model, is actually more the responsibility of the other engineering teams (and/or an adjacent SRE team) than that of the platform team.

Platform teams typically work higher up the stack than operations, DevOps, or SRE teams do, and they involve a great deal less infrastructure. On the contrary, platform teams are bent on paying other people to run as much shit as possible, preserving their own scarce development cycles for their core product.

Here is a somewhat tongue-in-cheek table of the similarities and differences between the archetypes.

Platform engineers vs. DevOps engineers

Platform Engineer Ops (or DevOps) Engineer
% of job spent writing code > 50% < 50%
Rest of time spent Gathering product requirements, doing user research, architecture discussions, optimizing internal workflows, researching new tools and developer productivity ideas, reviewing other teams’ diffs for impact, performance tuning, helping other engineers own & scale their code, fixing CI/CD pipelines. Fixing cron jobs, automating old setup docs, converting PXE/rsync to Chef/Puppet, converting Chef/Puppet to Terraform, converting VMs to containers, deploying software, debugging broken deploys, writing monitoring checks, doing retros, building out new services, pairing with software engineers to understand and debug their code, investigating weird shit, documentation, etc.
Responsible for Enabling internal teams to self-serve their ability to run and own their code in production. Creating standard, reusable components and processes. Defining golden paths. Infrastructure capacity planning, scaling, performance tuning, upgrading. Reliability and resiliency, SLOs and monitoring/alerting. Delivering quality experience to customers.
Builds for Internal developer teams Customers
Development style Infrastructure as a product Infrastructure as code
Works with product managers Yes No
Works with UX researchers or designers Sometimes No
Dashboards & graphs Uses APM, observability, tracing. Cares a lot about instrumentation and OpenTelemetry. Uses metrics, logs, dashboards; monitoring, alerting, and agent/sidecar/blackbox telemetry.
What ‘coding’ means to them Developing new features & services, writing tests. These are (primarily) software people who do systems. Automation, configuration, DSLs, extending and debugging existing code. These are systems people who do software.
Preferred language Go, Rust Python, Ruby
Time spent in Linux Hardly any A lot
Succeeds when Developers can easily choose good defaults, self-serve their infra, and own their own code in production. Infrastructure is scalable, secure, cost-effective, reliable, and customers are happy.
Native terrain Serverless, *aaS, APIs for everything (cloud-native and above). Instances, VMs, containers, regions, multi-cloud (everything “below,” but up to and including the cloud).
Databases Uses hosted DBs Runs their own, blending automation & DBA expertise
SSH No Yes
Shell REPL bash/zsh
Mantra “Run Less Software” “Cattle, Not Pets”

What about DevOps vs. SRE?

Countless words have been spilled on the difference between DevOps and SRE3, which I won’t rehash.

Here’s what I’ll say: DevOps, to me, feels like a relevant concept for companies that have a lot of infrastructure to wrangle. Companies that do in fact have dev teams and ops teams, or dev teams and DevOps teams (🙄), tend to have a lot of operational shit to automate, test, and run. They use config management, virtualization, and containers, often managing several generations worth of technology, possibly even down to data centers and bare metal. DevOps is for companies that have some combination of bare metal, VMs, regions, AZs, multi-cloud, networking devices, self-managed databases, etc.

DevOps is capacious. It contains multitudes. DevOps writes code, and DevOps has a fuckload of code to manage.

It is also on its way to becoming irrelevant. We are swiftly entering a post-DevOps world.

SRE, to me, feels different. I associate SRE with very large companies, where they mostly have software engineers owning their own code in production, but maybe still struggle with it a bit. SREs are often embedded within software engineering teams or product groups, and they focus a lot on, well, reliability, as the name suggests.

This means they do less infrastructure jockeying or automating (although they still do some coding). They typically have a lot to say about instrumentation, monitoring and observability, and cross-functional coordination. They run incident response and do blameless retros, and they tend to be experts at scaling.

If a company has both a DevOps team and SRE, typically I expect to see the SRE team more on the frontlines, involved with incidents, telemetry, etc., and DevOps teams more on the backburner, slinging pipes and plumbing.

Observability engineering as a case study

In the same piece I referenced earlier, I also wrote about the role of observability teams. I said they should largely no longer be running their own monitoring and graphing software in-house. Yet there is still a place for observability teams to exist: they remain a critical link between outsourced solutions and internal developer needs.

That team should write libraries, generate examples, and drive standardization; ushering in consistency, predictability, and usability. They should partner with internal teams to evaluate use cases. They should partner with your vendors as roadmap stakeholders. They might also write glue code and helper modules to connect disparate data sources and create cohesive visualizations. Basically, that team becomes an integration point between your organization and the outsourced work.

I originally wrote this about observability, but it could just as easily be used to describe platform engineering as a whole. This is the role—being the bridge between other vendors and your own core software. It’s a very high-leverage place to sit.

Ops is dead, long live ops

I’ve spent a lot of time thinking about this because we’ve had such a hard time nailing down exactly who the Honeycomb customer is. Sometimes our buyer is an ops team buying it for their SWEs, sometimes it’s SREs in the midst of an outage, sometimes it’s a VP or director of engineering, or an architect, or a CTO, or a “full stack” engineering team, or even a product manager. It is hard to form a snappy answer out of that list.

The first couple questions every new go-to-market candidate asks us are “who is your buyer?” and “how do we help them?” To which I respond with a five minute ramble where I list every above persona and each of their pain points. Hardly the concrete answer they would like to receive.

As it goes, sociotechnical trends come and go. A year ago, Christine and I were speculating that platform engineering might be on the verge of consolidating the necessary ingredients that makes up our ideal buyer:

  1. Writing and shipping code, and needing to understand their own code
  2. Positioned to help other teams with their instrumentation patterns and tooling
  3. Firmly cloud-native+ and untethered to hardware or traditional infrastructure

To my delight, since that conversation, these trends have only accelerated—and I, for one, welcome our new platform engineering overlords to the observability table. ☺️

If you’d like to learn more about platform engineering, we’ll be running a Twitter space on ✨ October 20th ✨ at 12:00 p.m. PT. Come join us! I’ll be there along with two colleagues and we’ll be answering your questions and shedding more light on the topic.


1  I do hear people saying that, and it used to make me fucking furious, but now I just smugly remind myself how much self-inflicted suffering they are in for. Disrespecting operational expertise is the shortest path to never again sleeping through the night.

2 It is rather incredible how rapidly this idea has taken off. When we started talking about putting developers on call for their code in 2016, people got seriously angry with us. Before that, the only twitter mention I could find of putting devs on call was one by (of course) Adrian Cockcroft, but by 2019-2020 it had stopped being controversial and soon became common wisdom.

3 I actually wrote one of those myself: DevOps vs SRE: Delayed Coverage of the Dumbest War). LMAO. I think Liz had the final word on this back in … 2017? 2018? … when she said something like class SRE implements DevOps. And yes, DevOps is a philosophy or a methodology and not a job title, etc.

The Future of Ops is Platform Engineering

Rituals for Engineering Teams

Last weekend I happened to pick up a book called “Rituals For Work: 50 Ways To Create Engagement, Shared Purpose, And A Culture That Can Adapt To Change.” It’s a super quick read, more comic book than textbook, but I liked it.

It got me thinking about the many rituals I have initiated and/or participated in over the course of my career. Of course, I never thought of them as such — I thought of them as “having fun at work” 🙃 — but now I realize these rituals disproportionately contribute to my favorite moments and the most precious memories of my career.

Rituals (a definition): Actions that a person or group does repeatedly, following a similar pattern or script, in which they’ve imbued symbolism and meaning.

I think it is extremely worth reading the first 27 pages of the book — the Introduction and Part One. To briefly sum up the first couple chapters: the power of creative rituals comes from their ability to link the physical with the psychological and emotional, all with the benefit of “regulation” and intentionality. Physically going through the process of a ritual helps people feel satisfied and in control, with better emotional regulation and the ability to act in a steadier and more focused way. Rituals also powerfully increase people’s sense of belonging, giving them a stable feeling of social connection. (p. 5-6)

The thing that grabbed me here is that rituals create a sense of belonging. You show that you belong to the group by participating in the ritual. You feel like you belong to the group by participating in the ritual. This is powerful shit!

It seems especially relevant these days when so many of us are atomized and physically separated from our teammates. That ineffable sense of belonging can make all the difference between a job that you do and a role that feeds your soul. Rituals are a way to create that sense of belonging. Hot damn.

So I thought I’d write up some of the rituals for engineering teams I remember from jobs past. I would love to hear about your favorite rituals, or your experience with them (good or bad). Tell me your stories at @mipsytipsy. 🙃

Rituals at Linden Lab

Feature Fish Freeze

At Linden Lab, in the ancient era of SVN, we had something called the “Feature Fish”. It was a rubber fish that we kept in the freezer, frozen in a block of ice. We would periodically cut a branch for testing and deployment and call a feature freeze. Merging code into the branch was painful and time consuming, so If you wanted to get a feature in after the code freeze, you had to first take the fish out of the freezer and unfreeze it.

This took a while, so you would have to sit there and consider your sins as it slowly thawed. Subtext: Do you really need to break code freeze?

Stuffy the Code Reviewer

You were supposed to pair with another engineer for code review. In your commit message, you had to include the name of your reviewer or your merge would be rejected. But the template would also accept the name “Stuffy”, to confess that your only reviewer had been…Stuffy, the stuffed animal.

However if your review partner was Stuffy, you would have to narrate the full explanation of Stuffy’s code review (i.e., what questions Stuffy asked, what changes he suggested and what he thought of your code) at the next engineering meeting. Out loud.

Shrek Ears

We had a matted green felt headband with ogre ears on it, called the Shrek Ears. The first time an engineer broke production, they would put on the Ears for a day. This might sound unpleasant, like a dunce cap, but no — it was a rite of passage. It was a badge of honor! Everyone breaks production eventually, if they’re working on something meaningful.

If you were wearing the Shrek Ears, people would stop you throughout the day and excitedly ask what happened, and reminisce about the first time they broke production. It became a way for 1) new engineers to meet lots of their teammates, 2) to socialize lots of production wisdom and risk factors, and 3) to normalize the fact that yes, things break sometimes, and it’s okay — nobody is going to yell at you. ☺️

This is probably the number one ritual that everybody remembers about Linden Lab. “Congratulations on breaking production — you’re really one of us now!”

Vorpal Bunny

vorpal bunny

We had a stuffed Vorpal Bunny, duct taped to a 3″ high speaker stand, and the operations engineer on call would put the bunny on their desk so people knew who it was safe to interrupt with questions or problems.

At some point we lost the bunny (and added more offices), but it lingered on in company lore since the engineers kept on changing their IRC nick to “$name-bunny” when they went on call.

There was also a monstrous, 4-foot-long stuffed rainbow trout that was the source of endless IRC bot humor… I am just now noticing what a large number of Linden memories involve stuffed animals. Perhaps not surprising, given how many furries were on our platform ☺️

Rituals at Parse

The Tiara of Technical Debt

Whenever an engineer really took one for the team and dove headfirst into a spaghetti mess of tech debt, we would award them the “Tiara of Technical Debt” at the weekly all hands. (It was a very sparkly rhinestone wedding tiara, and every engineer looked simply gorgeous in it.)

Examples included refactoring our golang rewrite code to support injection, converting our entire jenkins fleet from AWS instances to containers, and writing a new log parser for the gnarliest logs anyone had ever seen (for the MongoDB pluggable storage engine update).

Bonfire of the Unicorns

We spent nearly 2.5 years rewriting our entire ruby/rails API codebase to golang. Then there was an extremely long tail of getting rid of everything that used the ruby unicorn HTTP server, endpoint by endpoint, site by site, service by service.

When we finally spun down the last unicorn workers, I brought in a bunch of rainbow unicorn paper sculptures and a jug of lighter fluid, and we ceremonially set fire to them in the Facebook courtyard, while many of the engineers in attendance gave their own (short but profane) eulogies.

Mission Accomplished

This one requires a bit of backstory.

For two solid years after the acquisiton, Facebook leadership kept pressuring us to move off of AWS and on to FB infra. We kept saying “no, this is a bad idea; you have a flat network, and we allow developers all over the world to upload and execute random snippets of javascript,” and “no, this isn’t cost effective, because we run large multi-terabyte MongoDB replica sets by RAIDing together multiple EBS volumes, and you only have 2.5TB FusionIO (for extremely high-perf mysql/RocksDB) and 40 TB spinning rust volumes (for Hadoop), and also it’s impossible to shrink or slice up replsets”, and so forth. But they were adamant. “You don’t understand. We’re Facebook. We can do anything.” (Literal quote)

Finally we caved and got on board. We were excited! I announced the migration and started providing biweekly updates to the infra leadership groups. Four months later, when the  migration was half done, I get a ping from the same exact members of Facebook leadership:

“What are you doing?!?”
“Migrating!”
“You can’t do that, there are security issues!”
“No it’s fine, we have a fix for it.”
“There are hardware issues!”
“No it’s cool, we got it.”
You can’t do this!!!”

ANYWAY. To make an EXTREMELY long and infuriating story short, they pulled the plug and canned the whole project. So I printed up a ten foot long “Mission Accomplished” banner (courtesy of George W Bush on the aircraft carrier), used Zuck’s credit card to buy $800 of top-shelf whiskey delivered straight to my desk (and cupcakes), and we threw an angry, ranty party until we all got it out of our systems.

Blue Hair

I honestly don’t remember what this one was about, but I have extensive photographic evidence to prove that I shaved the heads of and/or dyed the hair blue of at least seven members of engineering. I wish I could remember why! but all I remember is that it was fucking hilarious.

In Conclusion

Coincidentally (or not), I have no memories of participating in any rituals at the jobs I didn’t like, only the jobs I loved. Huh.

One thing that stands out in my mind is that all the fun rituals tend to come bottoms-up. A ritual that comes from your VP can run the risk of feeling like forced fun, in a way it doesn’t if it’s coming from your peer or even your manager. I actually had the MOST fun with this shit as a line manager, because 1) I had budget and 2) it was my job to care about teaminess.

There are other rituals that it does make sense for executives to create, but they are less about hilarious fun and more about reinforcing values. Like Amazon’s infamous door desks are basically just a ritual to remind people to be frugal.

Rituals tend to accrue mutations and layers of meaning as time goes on. Great rituals often make no sense to anybody who isn’t in the know — that’s part of the magic of belonging. 🥰

Now, go tell me about yours!

charity

Rituals for Engineering Teams

Software deploys and cognitive biases

There exist some wonderful teams out there who have valid, well thought through, legitimate reasons for enforcing “NO FRIDAY DEPLOYS” week in and week out, for not hooking CI/CD up to autodeploy, and for not shipping one person’s changes at a time.

And then there are the reasons most people have.

Bad decisions, and the biases they came from

 

We’re humans. 💜  We leap to conclusions with the wetware we have doing the best it can based on heuristics that feel objectively true, but are ultimately just emotional reactions based on past lived experience. And then we retroactively enshrine those goofy gut feelings with the language of noble motive and moral values.

“I tell people not to deploy to production … because I care so deeply about my team and their ability to have a quiet weekend.”

Barf. 🙄  That’s just like saying you tell your kid not to brush his teeth at night, because you care SO DEEPLY about him and his ability to go to bed calm and happy.

Once the retcon engine in your brain gets running, it comes up with all sorts of reasons. Plausible-sounding reasons! But every single argument of the items in the list above is materially false.

Deploy myths are never going away for good; they appeal to too many of our cognitive biases. But what if there was one simple thing you could do that would invert many of these cognitive biases and cause people to grapple with the question in a new way? What if you could kickstart a recalculation?

My next post will pick up right here. I’ll tell you all about the One Simple Trick you can do to fix your deploys and set you on the virtuous path of high-performing teams.

Til then, here’s what I’ve previously written on the topic.

 

Footnotes

 

Availability bias: The tendency to overestimate the likelihood of events with greater “availability” in memory, which can be influenced by how recent the memories are or how unusual or emotionally charged they may be.

Continued influence effect: The tendency to believe previously learned misinformation even after it has been corrected. Misinformation can still influence inferences one generates after a correction has occurred.

Conservatism bias: The tendency to revise one’s belief insufficiently when presented with new evidence.

Default effect: When given a choice between several options, the tendency to favor the default one.

Dread aversion: Just as losses yield double the emotional impact of gains, dread yields double the emotional impact of savouring

False-uniqueness bias: The tendency of people to see their projects and themselves as more singular than they actually are.

Functional fixedness: Limits a person to using an object only in the way it is traditionally used

Hyperbolic discounting: Discounting is the tendency for people to have a stronger preference for more immediate payoffs relative to later payoffs. Hyperbolic discounting leads to choices that are inconsistent over time – people make choices today that their future selves would prefer not to have made, despite using the same reasoning

IKEA effect: The tendency for people to place a disproportionately high value on objects that they partially assembled themselves, such as furniture from IKEA, regardless of the quality of the end product

Illusory truth effect: A tendency to believe that a statement is true if it is easier to process, or if it has been stated multiple times, regardless of its actual veracity.

Irrational escalation: The phenomenon where people justify increased investment in a decision, based on the cumulative prior investment, despite new evidence suggesting that the decision was probably wrong. Also known as the sunk cost fallacy

Law of the instrument: An over-reliance on a familiar tool or methods, ignoring or under-valuing alternative approaches. “If all you have is a hammer, everything looks like a nail”

Mere exposure effect: The tendency to express undue liking for things merely because of familiarity with them

Negativity bias: Psychological phenomenon by which humans have a greater recall of unpleasant memories compared with positive memories

Non-adaptive choice switching: After experiencing a bad outcome with a decision problem, the tendency to avoid the choice previously made when faced with the same decision problem again, even though the choice was optimal

Omission bias: The tendency to judge harmful actions (commissions) as worse, or less moral, than equally harmful inactions (omissions).

Ostrich effect: Ignoring an obvious (negative) situation

Plan continuation bias: Failure to recognize that the original plan of action is no longer appropriate for a changing situation or for a situation that is different than anticipated

Prevention bias: When investing money to protect against risks, decision makers perceive that a dollar spent on prevention buys more security than a dollar spent on timely detection and response, even when investing in either option is equally effective

Pseudocertainty effect: The tendency to make risk-averse choices if the expected outcome is positive, but make risk-seeking choices to avoid negative outcomes

Salience bias: The tendency to focus on items that are more prominent or emotionally striking and ignore those that are unremarkable, even though this difference is often irrelevant by objective standards

Selective perception bias: The tendency for expectations to affect perception

Status-quo bias: If no special action is taken, the default action that will happen is that the code will go live. You will need an especially compelling reason to override this bias and manually stop the code from going live, as it would by default.

Slow-motion bias: We feel certain that we are more careful and less risky when we slow down. This is precisely the opposite of the real world risk factors for shipping software. Slow is dangerous for software; speed is safety. The more frequently you ship code, the smaller the diffs you ship, the less dangerous each one actually becomes. This is the most powerful and difficult to overcome of all of our biases, because there is no readily available counter-metaphor for us to use. (Riding a bike is the best I’ve come up with. 😔)

Surrogation: Losing sight of the strategic construct that a measure is intended to represent, and subsequently acting as though the measure is the construct of interest

Time-saving bias: Underestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively low speed and overestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively high speed.

Zero-risk bias: Preference for reducing a small risk to zero over a greater reduction in a larger risk.

Software deploys and cognitive biases

Why every software engineering interview should include ops questions

I’ve fallen way behind on my blog posts — my goal was to write one per month, and I haven’t published anything since MAY. Egads. So here I am dipping into the drafts archives! This one was written in April of 2016, when I was noodling over my CraftConf 2016 talk on “DevOps for Developers (see slides).”

So I got to the part in my talk where I’m talking about how to interview and hire software engineers who aren’t going to burn the fucking house down, and realized I could spend a solid hour on that question alone. That’s why I decided to turn it into a blog post instead.

Stop telling ops people to code better, start telling SWEs to ops better

Our industry has gotten very good at pressing operations engineers to get better at writing code, writing tests, and software engineering in general these past few years. Which is great! But we have not been nearly so good at pushing software engineers to level up their systems skills. Which is unfortunate, because it is just as important.

Most systems suffer from the syndrome of running too much software. Tossing more software into the heap is as likely to cause more problems as often as it solves them.

We see this play out at companies stacked with good software engineers who have built horrifying spaghetti messes of their infrastructure, and then commence paging themselves to death.

The only way to unwind this is to reset expectations, and make it clear that

  1. you are still responsible for your code after it’s been deployed to production, and 
  2. operational excellence is everyone’s job.

Operations is the constellation of tools, practices, policies, habits, and docs around shipping value to users, and every single one of us needs to participate in order to do this swiftly and safely.

Every software engineering interviewing loop should have an ops component.

Nobody interviews candidates for SRE or ops nowadays without asking some coding questions. You don’t have to be the greatest programmer in the world, but you can’t be functionally illiterate. The reverse is less common: asking software engineers basic, stupid questions about the lifecycle of their code, instrumentation best practices, etc. 

It’s common practice at lots of companies now to have a software engineer in the loop for hiring SREs to evaluate their coding abilities. It should be just as common to have an ops engineer in the loop for a SWE hire, especially for any SWE who is being considered for a key senior position. Those are the people you most rely on to be mentors and role models for junior hires. All engineers should embrace the ethos of owning their code in production, and nobody should be promoted or hired into a senior role if they don’t.

And yes, that means all engineers!  Even your iOS/Android engineers and website developers should be interested in what happens to their code after they hit deploy.  They should care about things like instrumentation, and what kind of data they may need later to debug their problems, and how their features may impact other infrastructure components.

You need to balance out your software engineers with engineers who don’t react to every problem by writing more code. You need engineers who write code begrudgingly, as a last resort. You’ll find these priceless gems in ops and SRE.

ops questions for software engineers

The best questions are broad and start off easy, with plenty of reasonable answers and pathways to explore. Even beginners can give a reasonable answer, while experts can go on talking for hours.

For example: give them the specs for a new feature, and ask them to talk through the infrastructure choices and dependencies to support that feature. Do they ask about things like which languages, databases, and frameworks are already supported by the team? Do they understand what kind of monitoring and observability tools to use, do they ask about local instrumentation best practices?

Or design a full deployment pipeline together. Probe what they know about generating artifacts, versioning, rollbacks, branching vs master, canarying, rolling restarts, green/blue deploys, etc. How might they design a deploy tool? Talk through the tradeoffs.

Some other good starting points:

  • “Tell me about the last time you caused a production outage. What happened, how did you find out, how was it resolved, and what did you learn?”
  • “What are some of your favorite tools for visibility, instrumentation, and debugging?
  • “Latency seems to have doubled over the last 6 hours. Where do you start looking, how do you start debugging?”
  • And this chestnut: “What happens when you type ‘google.com’ into a web browser?” You would be fucking *astonished* how many senior software engineers don’t know a thing about DNS, HTTP, SSL/TLS, cookies, TCP/IP, routing, load balancers, web servers, proxies, and on and on.

Another question I really like is: “what’s your favorite API (or database, or language) and why?” followed up by “… and what are the worst things about it?” (True love doesn’t mean blind worship.)

Remember, you’re exploring someone’s experience and depth here, not giving them a pass-fail quiz. It’s okay if they don’t know it all. You’re also evaluating them on communication skills, which is severely underrated by most people but is actually as a key technical skill.

Signals to look for

You’re not looking for perfection. You are teasing out signals for things like, how will this person perform on a team where software engineers are expected to own their code? How much do they know about the world outside the code they write themselves? Are they curious, eager, and willing to learn, or fearful, incurious and begrudging?

Do they expect networks to be reliable? Do they expect databases to respond, retries to succeed? Are they offended by the idea of being on call? Are they overly clever or do they look to simplify? (God, I hate clever software engineers 🙃.)

It’s valuable to get a feel for an engineer’s operational chops, but let’s be clear, you’re doing this for one big reason: to set expectations. By making ops questions part of the interview, you’re establishing from the start that you run an org where operations is valued, where ownership is non-optional. This is not an ivory tower where software engineers can merrily git push and go home for the day and let other people handle the fallout

It can be toxic when you have an engineer who thinks all ops work is toil and operations engineering is lesser-than. It tends to result in operations work being done very poorly. This is your best chance to let those people self-select out.

You know what, I’m actually feeling uncharacteristically optimistic right now. I’m remembering how controversial some of this stuff was when I first wrote it, five years ago in 2016. Nowadays it just sounds obvious. Like table stakes.

Hell yeah. 🤘

Why every software engineering interview should include ops questions

Notes on the Perfidy of Dashboards

The other day I said this on twitter —

… which stirred up some Feelings for many people. 🙃  So I would like to explain my opinions in more detail.

Static vs dynamic dashboards

First, let’s define the term. When I say “dashboard”, I mean STATIC dashboards, i.e. collections of metrics-based graphs that you cannot click on to dive deeper or break down or pivot. If your dashboard supports this sort of responsive querying and exploration, where you can click on any graph to drill down and slice and dice the data arbitrarily, then breathe easy — that’s not what I’m talking about. Those are great. (I don’t really consider them dashboards, but I have heard a few people refer to them as “dynamic dashboards”.)

Actually, I’m not even “against” static dashboards. Every company has them, including Honeycomb. They’re great for getting a high level sense of system functioning, and tracking important stats over long intervals. They are a good starting point for investigations. Every company should have a small, tractable number of these which are easily accessible and shared by everyone.

Debugging with dashboards: it’s a trap

What dashboards are NOT good at is debugging, or understanding or describing novel system states.

I can hear some of you now: “But I’ve debugged countless super-hard unknown problems using only static dashboards!” Yes, I’m sure you have. If all you have is a hammer, you CAN use it to drive screws into the wall, but that doesn’t mean it’s the best tool. And It takes an extraordinary amount of knowledge and experience to be able to piece together a narrative that translates low-level system statistics into bugs in your software and back. Most software engineers don’t have that kind of systems experience or intuition…and they shouldn’t have to.

Why are dashboards bad for debugging? Think of it this way: every dashboard is an answer to a question someone asked at some point. Your monitoring system is probably littered with dashboards, thousands and thousands of them, most of whose questions have been long forgotten and many of whose source data streams have long since gone silent.

So you come along trying to investigate something, and what do you do? You start skimming through dashboards, eyes scanning furiously, looking for visual patterns — e.g. any spikes that happened around the same time as your incident. That’s not debugging, that’s pattern-matching. That’s … eyeball racing.

if we did math like we do dashboards

Imagine you’re in a math competition, and you get handed a problem to solve. But instead of pulling out your pencil and solving the equation, step by step, you start hollering out guesses.

“27!”
“19992.41!”
“1/4325!”

That’s what flipping through dashboards feels like to me. You’re riffling through a bunch of graphs that were relevant to some long-ago situation, without context or history, without showing their work. Sometimes you’ll spot the exact scenario, and — huzzah! — the number you shout is correct! But when it comes to unknown scenarios, the odds are not in your favor.

Debugging looks and feels very different from flipping through answers. You ask a question, examine the answer, and ask another question based on the result. (“Which endpoints were erroring? Are all of the requests erroring, or only some? What did they have in common?”, etc.)

You methodically put one foot in front of the other, following the trail of bread crumbs, until the data itself leads you to the answer.

The limitations of metrics and dashboards

Unfortunately, you cannot do that with metrics-based dashboards, because you stripped away the connective tissue of the event back when you wrote the metrics out to disk.

If you happened to notice while skimming through dashboards that your 404 errors spiked at 14:03, and your /payment and /import endpoints started erroring at 14.03, and your database started returning a bunch of mysql errors shortly after 14:00, you’ll probably assume that they’re all related and leap to find more evidence that confirms it.

But you cannot actually confirm that those events are the same ones, not with your metrics dashboards. You cannot drill down from errors to endpoints to error strings; for that, you’d need a wide structured data blob per request. Those might in fact be two or three separate outages or anomalies happening at the same time, or just the tip of the iceberg of a much larger event, and your hasty assumptions might extend the outage for much longer than was necessary.

With metrics, you tend to find what you’re looking for. You have no way to correlate attributes between requests or ask “what are all of the dimensions these requests have in common?”, or to flip back and forth and look at the request as a trace. Dashboards can be fairly effective at surfacing the causes of problems you’ve seen before (raise your hand if you’ve ever been in an incident review where one of the follow up tasks was, “create a dashboard that will help us find this next time”), but they’re all but useless for novel problems, your unknown-unknowns.

Other complaints about dashboards:

They tend to have percentiles like 95th, 99th, 99.9th, 99.99th, etc. Which can cover over a multitude of sins. You really want a tool that allows you to see MAX and MIN, and heatmap distributions.

A lot of dashboards end up getting created that are overly specific to the incident you just had — naming specific hosts, etc — which just creates clutter and toil. This is how your dashboards become that graveyard of past outages.

The most useful approach to dashboards is to maintain a small set of them; cull regularly, and think of them as a list of starter queries for your investigations.

Fred Hebert has this analogy, which I like:

“I like to compare the dashboards to the big display in a hospital room: heartbeat, pressure, oxygenation, etc. Those can tell you when a thing is wrong, but the context around the patient chart (and the patient themselves) is what allows interpretation to be effective. If all we have is the display but none of the rest, we’re not getting anywhere close to an accurate picture. The risk with the dashboard is having the metrics but not seeing or knowing about the rest changing.”

In conclusion

Dashboards aren’t universally awful. The overuse of them just encourages sloppy thinking, and static ones make it impossible for you to follow the plot of an outage, or validate your hypotheses. 🤒  There’s too many of them, and not enough shared consensus. (It would help if, like, new dashboards expired within a month if nobody looked at them again.)

If what you have is “nothing”, even shitty dashboards are far better than no dashboards. But shitty dashboards have been the only game in town for far too long. We need more vendors to think about building for queryability, explorability, and the ability to follow a trail of breadcrumbs. Modern systems are going to demand more and more of this approach.

Nothing < Dashboards < a Queryable, Exploratory Interface

If everyone out there who slaps “observability” on their web page also felt the responsibility to add an observability-enabling interface to their tool, one that would let users explore and identify unknown-unknowns, we would all be in a far better place. 🙂

 

 

 

 

 

Notes on the Perfidy of Dashboards

On Call Shouldn’t Suck: A Guide For Managers

There are few engineering topics that provoke as much heated commentary as oncall. Everybody has a strong opinion. So let me say straight up that there are few if any absolutes when it comes to doing this well; context is everything. What’s appropriate for a startup may not suit a larger team. Rules are made to be broken.

That said, I do have some feelings on the matter. Especially when it comes to the compact between engineering and management. Which is simply this:

It is engineering’s responsibility to be on call and own their code. It is management’s responsibility to make sure that on call does not suck. This is a handshake, it goes both ways, and if you do not hold up your end they should quit and leave you.

As for engineers who write code for 24×7 highly available services, it is a core part of their job is to support those services in production. (There are plenty of software jobs that do not involve building highly available services, for those who are offended by this.) Tossing it off to ops after tests pass is nothing but a thinly veiled form of engineering classism, and you can’t build high-performing systems by breaking up your feedback loops this way.

Someone needs to be responsible for your services in the off-hours. This cannot be an afterthought; it should play a prominent role in your hiring, team structure, and compensation decisions from the very start. These are decisions that define who you are and what you value as a team.

Some advice on how to organize your on call efforts, in no particular order.

  • It is easier to keep yourself from falling into an operational pit of doom than it is to claw your way out of one. Make good operational hygiene a priority from the start. Value good, clean, high-level abstractions that allow you to delegate large swaths of your infrastructure and operational burden to third parties who can do it better than you — serverless, AWS, *aaS, etc. Don’t fall into the trap of disrespecting operations engineering labor, it’s the only thing that can save you.
  • Invest in good release and deploy tooling. Make this part of your engineering roadmap, not something you find in the couch cushions. Get code into production within minutes after merging, and watch how many of your nightmares melt away or never happen.
  • Invest in good instrumentation and observability. Impress upon your engineers that their job is not done when tests pass; it is not done until they have watched users using their code in production. Promote an ownership mentality over the full software life cycle. This is how dev.to did it.
  • Construct your feedback loops thoughtfully. Try to alert the person who made the broken change directly. Never send an alert to someone who isn’t fully equipped and empowered to fix it.
  • When an engineer is on call, they are not responsible for normal project work — period. That time is sacred and devoted to fixing things, building tooling, and creating guard-rails to protect people from themselves. If nothing is on fire, the engineer can take the opportunity to fix whatever has been annoying them. Allow for plenty of agency and following one’s curiosity, wherever it may lead, and it will be a special treat.
  • Closely track how often your team gets alerted. Take ANY out-of-hours-alert seriously, and prioritize the work to fix it. Night time pages are heart attacks, not diabetes.
  • Consider joining the on call rotation yourself! If nothing else, generously pinch hit and be an eager and enthusiastic backup on the regular.
  • Reliability work and technical debt are not secondary to product work. Budget them into your roadmap, right alongside your features and fixes. Don’t plan so tightly that you have no flex for the unexpected. Don’t be afraid to push back on product and don’t neglect to sell it to your own bosses. People’s lives are in your hands; this is what you get paid to do.
  • Consider making after-hours on call fully-elective. Why not? What is keeping you from it? Fix those things. This is how Intercom did it.
  • Depending on your stage and available resources, consider compensating for it.This doesn’t have to be cash, it could be a Friday off the week after every on call rotation. The more established and funded a company you are, the more likely you should do this in order to surface the right incentives up the org chart.
  • Once you’ve dug yourself out of firefighting mode, invest in SLOs (Service Level Objectives). SLOs and observability are the mature way to get out of reactive mode and plan your engineering work based on tradeoffs and user impact.

I believe it is thoroughly possible to construct an on call rotation that is 100% opt-in, a badge of pride and accomplishment, something that brings meaning and mastery to people’s engineering roles and ties them emotionally to their users. I believe that being on call is something that you can genuinely look forward to.

But every single company is a unique complex sociotechnical snowflake. Flipping the script on whether on call is a burden or a blessing will require a unique solution, crafted to meet your specific needs and drawing on your specific history. It will require tinkering. It will take maintenance.

Above all: ✨RAISE YOUR STANDARDS✨ for what you expect from yourselves. Your greatest enemy is how easily you accept the status quo, and then make up excuses for why it is necessarily this way. You can do better. I know you can.

There is lots and lots of prior art out there when it comes to making on call work for you, and you should research it deeply. Watch some talks, read some pieces, talk to some people. But then you’ll have to strike out on your own and try something. Cargo-culting someone else’s solution is always the wrong answer.

Any asshole can write some code; owning and tending complex systems for the long run is the hard part. How you choose to shoulder this burden will be a deep reflection of your values and who you are as a team.

And if your on call experience is mandatory and severely life-impacting, and if you don’t take this dead seriously and fix it ASAP? I hope your team will leave you, and go find a place that truly values their time and sleep.

 

On Call Shouldn’t Suck: A Guide For Managers

Questionable Advice: War Rooms? Really?!?

My company has recently begun pushing for us to build and staff out what I can only describe as “command centers”. They’re picturing graphs, dashboards…people sitting around watching their monitors all day just to find out which apps or teams are having issues. With your experience in monitoring and observability, and your opinions on teams supporting their own applications…do you think this sounds like a bad idea? What are things to watch out for, or some ways this might all go sideways?

— Anonymous

Jesus motherfucking Christ on a stick. Is it 1995 where you work? That’s the only way I can try and read this plan like it makes sense.

It’s a giant waste of money and no, it won’t work. This path leads into a death spiral where alarms are going off constantly (yet somehow never actually catching the real problems), people getting burned out, and anyone competent will either a) leave or b) refuse to be on call. Sideways enough for you yet?

Snark aside, there are two foundational flaws with this plan.

1) watching graphs is pointless. You can automate that shit, remember?  ✨Computers!✨ Furthermore, this whole monitoring-based approach will only ever help you find the known unknowns, the problems you already know to look for. But most of your actual problems will be unknown unknowns, the ones you don’t know about yet.

2) those people watching the graphs… When something goes wrong, what exactly can they do about it? The answer, unfortunately, is “not much”. The only people who can swiftly diagnose and fix complex systems issues are the people who build and maintain those systems, and those people are busy building and maintaining, not watching graphs.

That extra human layer is worse than useless; it is actively harmful. By insulating developers from the consequences of their actions, you are concealing from them the information they need to understand the consequences of their actions. You are interfering with the most basic of feedback loops and causing it to malfunction.

The best time to find a bug is as soon as possible after writing it, while it’s all fresh in your head. If you let it fester for days, weeks, or months, it will be exponentially more challenging to find and solve. And the best people to find those bugs are the people who wrote them

Helpful? Hope so. Good luck. And if they implement this anyway — leave. You deserve to work for a company that won’t waste your fucking time.

with love, charity.

selfie - 4

Questionable Advice: War Rooms? Really?!?

Questionable Advice: “What’s the critical path?”

Dan Golant asked a great question today: “Any advice/reading on how to establish a team’s critical path?”

I repeated back: “establish a critical path?” and he clarified:

Yea, like, you talk about buttoning up your “critical path”, making sure it’s well-monitored etc. I think that the right first step to really improving Observability is establishing what business processes *must* happen, what our “critical paths” are. I’m trying to figure out whether there are particularly good questions to ask that can help us document what these paths are for my team/group in Eng.

“Critical path” is one of those phrases that I think I probably use a lot. Possibly because the very first real job I ever had was when I took a break from college and worked at criticalpath.net (“we handle the world’s email”) — and by “work” I mean, “lived in SF for a year when I was 18 and went to a lot of raves and did a lot of drugs with people way cooler than me”. Then I went back to college, the dotcom boom crashed, and the CP CFO and CEO actually went to jail for cooking the books, becoming the only tech execs I am aware of who actually went to jail.

Where was I.

Right, critical path. What I said to Dan is this: “What makes you money?”

Like, if you could only deploy three end-to-end checks that would perform entire operations on your site and ensure they work at all times, what would they be? what would they do? “Submit a payment” is a super common one; another is new user signups.

The idea here is to draw up a list of the things that are absolutely worth waking someone up to fix immediately, night or day, rain or shine. That list should be as compact and well-defined as possible. This allows you to be explicit about the fact that anything else can wait til morning, or some other less-demanding service level agreement.

And typically the right place to start on this list is by asking yourselves: “what makes us money?” as a proxy for the real questions, which are: “what actions allow us to survive as a business? What do our customers care the absolute most about? What makes us us?” That’s your critical path.

Someone will usually seize this opportunity to argue that absolutely any deterioration in service is worth paging someone immediately to fix it, day or night. They are wrong, but it’s good to flush these assumptions out and have this argument kindly out in the open.

(Also, this is really a question about service level objectives. So if you’re asking yourself about the critical path, you should probably consider buying Alex Hidalgo’s book on SLOs, and you may want to look into the Honeycomb SLO product, the only one in the industry that actually implements SLOs as the Google SRE book defines them (thanks Liz!) and lets you jump straight from “what are our customers experiencing?” to “WHY are they experiencing it”, without bouncing awkwardly from aggregate metrics to logs and back and just … hoping … the spikes line up according to your visual approximations.)

charity.
Questionable Advice: “What’s the critical path?”

Deploys: It’s Not Actually About Fridays

I just read this piece, which is basically a very long subtweet about my Friday deploy threads.  Go on and read it: I’ll wait.

Here’s the thing.  After getting over some of the personal gibes (smug optimism?  literally no one has ever accused me of being an optimist, kind sir), you may be expecting me to issue a vigorous rebuttal.  But I shan’t.  Because we are actually in violent agreement, almost entirely.

I have repeatedly stressed the following points:

  1. I want to make engineers’ lives better, by giving them more uninterrupted weekends and nights of sleep.  This is the goal that underpins everything I do.
  2. Anyone who ships code should develop and exercise good engineering judgment about when to deploy, every day of the week
  3. Every team has to make their own determination about which policies and norms are right given their circumstances and risk tolerance
  4. A policy of “no Friday deploys” may be reasonable for now but should be seen as a smell, a sign that your deploys are risky.  It is also likely to make things WORSE for you, not better, by causing you to adopt other risky practices (e.g. elongating the interval between merge and deploy, batching changes up in a single deploy)

This has been the most frustrating thing about this conversation: that a) I am not in fact the absolutist y’all are arguing against, and b) MY number one priority is engineers and their work/life balance.  Which makes this particularly aggravating:

Lastly there is some strange argument that choosing not to deploy on Friday “Shouldn’t be a source of glee and pride”. That one I haven’t figured out yet, because I have always had a lot of glee and pride in being extremely (overly?) protective of the work/life balance of the engineers who either work for me, or with me.  I don’t expect that to change.

Hold up.  Did you catch that clever little logic switcheroo?  You defined “not deploying on Friday” as being a priori synonymous with “protecting the work/life balance of engineers”.  This is how I know you haven’t actually grasped my point, and are arguing against a straw man.  My entire point is that the behaviors and practices associated with blocking Friday deploys are in fact hurting your engineers.

I, too, take a lot of glee and pride in being extremely, massively, yes even OVERLY protective of the work/life balance of the engineers who either work for me, or with me.

AND THAT IS WHY WE DEPLOY ON FRIDAYS.

Because it is BETTER for them.  Because it is part of a deploy ecosystem which results in them being woken up less and having fewer weekends interrupted overall than if I had blocked deploys on Fridays.fire_burn

It’s not about Fridays.  It’s about having a healthy ecosystem and feedback loop where you trust your deploys, where deploys aren’t a big deal, and they never cause engineers to have to work outside working hours.  And part of how you get there is by not artificially blocking off a big bunch of the week and not deploying during that time, because that breaks up your virtuous feedback loop and causes your deploys to be much more likely to fail in terrible ways.

The other thing that annoys me is when people say, primly, “you can’t guarantee any deploy is safe, but you can guarantee people have plans for the weekend.”

Know what else you can guarantee?  That people would like to sleep through the fucking night, even on weeknights.

When I hear people say this all I hear is that they don’t care enough to invest the time to actually fix their shit so it won’t wake people up or interrupt their off time, seven days a week.  Enough with the virtue signaling already.

You cannot have it both ways, where you block off a bunch of undeployable time AND you have robust, resilient, swift deploys.  Somehow I keep not getting this core point across to a substantial number of very intelligent people.  So let me try a different way.

Let’s try telling a story.

A tale of two startups

Here are two case studies.

Company X

Company X is a three-year-old startup.  It is a large, fast-growing multi-tenant platform on a large distributed system with spiky traffic, lots of user-submitted data, and a very green database.  Company X deploys the API about once per day, and does a global deploy of all services every Tuesday.  Deploys often involve some firefighting and a rollback or two, and Tuesdays often involve deploying and reverting all day (sigh).

Pager volume at Company X isn’t the worst, but usually involves getting woken up a couple times a week, and there are deploy-related alerts after maybe a third of deploys, which then need to be triaged to figure out whose diff was the cause.

Company Z

Company Z is a three-year-old startup.  It is a large, fast-growing multi-tenant platform on a large distributed system with spiky traffic, lots of user-submitted data, and a very green house-built distributed storage engine.  Company Z automatically triggers a deploy within 30 minutes of a merge to master, for all services impacted by that merge.  Developers at company Z practice observability-driven deployment, where they instrument all changes, ask “how will I know if this change doesn’t work?” during code review, and have a muscle memory habit of checking to see if their changes are working as intended or not after they merge to master.

Deploys rarely result in the pager going off at Company Z; most problems are caught visually by the engineer and reverted or fixed before any paging alert can fire.  Pager volume consists of roughly one alert per week outside of working hours, and no one is woken up more than a couple times per year.

Same damn problem, better damn solutions.

If it wasn’t extremely obvious, these companies are my last two jobs, Parse (company X, from 2012-2016) and Honeycomb (company Z, from 2016-present).

They have a LOT in common.  Both are services for developers, both are platforms, both are running highly elastic microservices written in golang, both get lots of spiky traffic and store lots of user-defined data in a young, homebrewed columnar storage engine.  They were even built by some of the same people (I built infra for both, and they share four more of the same developers).

At Parse, deploys were run by ops engineers because of how common it was for there to be some firefighting involved.  We discouraged people from deploying on Fridays, we locked deploys around holidays and big launches.  At Honeycomb, none of these things are true.  In fact, we literally can’t remember a time when it was hard to debug a deploy-related change.

Screen Shot 2019-10-28 at 12.04.48 AM

Screen Shot 2019-10-28 at 12.05.04 AM

What’s the difference between Company X and Company Z?

So: what’s the difference?  Why are the two companies so dramatically different in the riskiness of their deploys, and the amount of human toil it takes to keep them up?

I’ve thought about this a lot.  It comes down to three main things.

  1. Observability
  2. Observability-driven development
  3. Single merge per deploy

1. Observability. 

I think that I’ve been reluctant to hammer this home as much as I ought to, because I’m exquisitely sensitive about sounding like an obnoxious vendor trying to sell you things.  😛  (Which has absolutely been detrimental to my argument.)

When I say observability, I mean in the precise technical definition as I laid out in this piece: with high cardinality, arbitrarily wide structured events, etc.   Metrics and other generic telemetry will not give you the ability to do the necessary things, e.g. break down by build id in combination with all your other dimensions to see the world through the lens of your instrumentation.  Here, for example, are all the deploys for a particular service last Friday:

Screen Shot 2019-10-28 at 12.31.12 AM

Each shaded area is the duration of an individual deploy: you can see the counters for each build id, as the new versions replace the old ones,

2. Observability-driven development.

This is cultural as well as technical.  By this I mean instrumenting a couple steps ahead of yourself as you are developing and shipping code.  I mean making a cultural practice of asking each other “how will you know if this is broken?” during code review.  I mean always going and looking at your service through the lens of your instrumentation after every diff you ship.  Like muscle memory.

3.  Single merge per deploy.

The number one thing you can do to make your deploys intelligible, other than observability and instrumentation, is this: deploy one changeset at a time, as swiftly as possible after it is merged to master.  NEVER glom multiple changesets into a single deploy — that’s how you get into a state where you aren’t sure which change is at fault, or who to escalate to, or if it’s an intersection of multiple changes, or if you should just start bisecting blindly to try and isolate the source of the problem.  THIS is what turns deploys into long, painful marathons.

headlamps, illuminating whatever’s in front of my face: this is the image in my mind when i think about instrumenting my code

And NEVER wait hours or days to deploy after the change is merged.  As a developer, you know full well how this goes.  After you merge to master one of two things will happen.  Either:

  • you promptly pull up a window to watch your changes roll out, checking on your instrumentation to see if it’s doing what you intended it to or if anything looks weird, OR
  • you close the project and open a new one.

When you switch to a new project, your brain starts rapidly evicting all the rich context about what you had intended to do and and overwriting it with all the new details about the new project.

Whereas if you shipped that changeset right after merging, then you can WATCH it roll out.  And 80-90% of all problems can be, should be caught right here, before your users ever notice —  before alerts can fire off and page you.  If you have the ability to break down by build id, zoom in on any errors that happen to arise, see exactly which dimensions all the errors have in common and how they differ from the healthy requests, see exactly what the context is for any erroring requests.

Healthy feedback loops == healthy systems.

That tight, short feedback loop of build/ship/observe is the beating heart of a healthy, observable distributed system that can be run and maintained by human beings, without it sucking your life force or ruining your sleep schedule or will to live.

Most engineers have never worked on a system like this.  Most engineers have no idea what a yawning chasm exists between a healthy, tractable system and where they are now.  Most engineers have no idea what a difference observability can make.  Most engineers are far more familiar with spending 40-50% of their week fumbling around in the dark, trying to figure out where in the system is the problem they are trying to fix, and what kind of context do they need to reproduce.

Most engineers are dealing with systems where they blindly shipped bugs with no observability, and reports about those bugs started to trickle in over the next hours, days, weeks, months, or years.  Most engineers are dealing with systems that are obfuscated and obscure, systems which are tangled heaps of bugs and poorly understood behavior for years compounding upon years on end.

That’s why it doesn’t seem like such a big deal to you break up that tight, short feedback loop.  That’s why it doesn’t fill you with horror to think of merging on Friday morning and deploying on Monday.  That’s why it doesn’t appall you to clump together all the changes that happen to get merged between Friday and Monday and push them out in a single deploy.

It just doesn’t seem that much worse than what you normally deal with.  You think this raging trash fire is, unfortunately … normal.

How realistic is this, though, really?

Maybe you’re rolling your eyes at me now.  “Sure, Charity, that’s nice for you, on your brand new shiny system.  Ours has years of technical debt,  It’s unrealistic to hold us to the same standard.”

Yeah, I know.  It is much harder to dig yourself out of a hole than it is to not create a hole in the first place.  No doubt about that.

Harder, yes.  But not impossible.

I have done it.

Parse in 2013 was a trash fire.  It woke us up every night, we spent a lot of time stabbing around in the dark after every deploy.  But after we got acquired by Facebook, after we started shipping some data sets into Scuba, after (in retrospect, I can say) we had event-level observability for our systems, we were able to start paying down that debt and fixing our deploy systems.

We started hooking up that virtuous feedback loop, step by step.

  1. We reworked our CI/CD system so that it built a new artifact after every single merge.
  2. We put developers at the steering wheel so they could push their own changes out.
  3. We got better at instrumentation, and we made a habit of going to look at it during or after each deploy.
  4. We hooked up the pager so it would alert the person who merged the last diff, if an alert was generated within an hour after that service was deployed.

We started finding bugs quicker, faster, and paying down the tech debt we had amassed from shipping code without observability/visibility for many years.

Developers got in the habit of shipping their own changes, and watching them as they rolled out, and finding/fixing their bugs immediately.

It took some time.  But after a year of this, our formerly flaky, obscure, mysterious, massively multi-tenant service that was going down every day and wreaking havoc on our sleep schedules was tamed.  Deploys were swift and drama-free.  We stopped blocking deploys on Fridays, holidays, or any other days, because we realized our systems were more stable when we always shipped consistently and quickly.  

Allow me to repeat.  Our systems were more stable when we always shipped right after the changes were merged.  Our systems were less stable when we carved out times to pause deployments.  This was not common wisdom at the time, so it surprised me; yet I found it to be true over and over and over again.

This is literally why I started Honeycomb.

When I was leaving Facebook, I suddenly realized that this meant going back to the Dark Ages in terms of tooling.  I had become so accustomed to having the Parse+scuba tooling and being able to iteratively explore and ask any question without having to predict it in advance.  I couldn’t fathom giving it up.

The idea of going back to a world without observability, a world where one deployed and then stared anxiously at dashboards — it was unthinkable.  It was like I was being asked to give up my five senses for production — like I was going to be blind, deaf, dumb, without taste or touch.

Look, I agree with nearly everything in the author’s piece.  I could have written that piece myself five years ago.

But since then, I’ve learned that systems can be better.  They MUST be better.  Our systems are getting so rapidly more complex, they are outstripping our ability to understand and manage them using the past generation of tools.  If we don’t change our ways, it will chew up another generation of engineering lives, sleep schedules, relationships.

Observability isn’t the whole story.  But it’s certainly where it starts.  If you can’t see where you’re going, you can’t go very far.

Get you some observability.

And then raise your standards for how systems should feel, and how much of your human life they should consume.  Do better. 

Because I couldn’t agree with that other post more: it really is all about people and their real lives.

Listen, if you can swing a four day work week, more power to you (most of us can’t).  Any day you aren’t merging code to master, you have no need to deploy either.  It’s not about Fridays; it’s about the swift, virtuous feedback loop.

And nobody should be shamed for what they need to do to survive, given the state of their systems today.

But things aren’t gonna get better unless you see clearly how you are contributing to your present pain.  And congratulating ourselves for blocking Friday deploys is like congratulating ourselves for swatting ourselves in the face with the flyswatter.  It’s a gross hack.

Maybe you had a good reason.  Sure.  But I’m telling you, if you truly do care about people and their work/life balance: we can do a lot better.

charity.

IMG_9017

Deploys: It’s Not Actually About Fridays

Love (and Alerting) in the Time of Cholera (and Observability)

I made a vow this year to post one blog post a month, then I didn’t post anything at all from May to September.  I have some catching up to do.  😑   I’ve also been meaning to transcribe some of the twitter rants that I end up linking back to into blog posts, so if Graph Everything, Kittensthere’s anything you especially want me to write about, tell me now while I’m in repentance mode.

This is one request I happened to make a note of because I can’t believe I haven’t already written it up!  I’ve been saying the same thing over and over in talks and on twitter for years, but apparently never a blog post.

The question is: what is the proper role of alerting in the modern era of distributed systems?  Has it changed?  What are the updated best practices for alerting?

It’s a great question.  I want to wax philosophically about some stuff, but first let me briefly outline the way to modernize your alerting best practices:

  1. implement observability
  2. implement SLOs and/or end-to-end checks that traverse key code paths and correlate to user-impacting events
  3. create a secondary channel (tasks, ticketing system, whatever) for “things that on call should look at soon, but are not impacting users yet” which does not page anyone, but which on call is expected to look at (at least) first thing in the morning, last thing in the evening, and midday
  4. move as many paging alerts as possible to the secondary channel, by engineering your services to auto-remediate or run in degraded mode until they can be patched up
  5. wake people up only for SLOs and health checks that correlate to user-impacting events

Or, in an even shorter formulation: delete all your paging alerts, then page only on e2e alerts that mean users are in pain.  Rely on debugging tools for debugging, and paging only when users are in pain.

To understand why I advocate deleting all your paging alerts, and when it’s safe to delete them, first we need to understand why have we accumulated so many crappy paging alerts over the years.

Monoliths, LAMP stacks, and death by pagebomb

Here, let’s crib a couple of slides from one of my talks on observability.  Here are the characteristics of older monolithic LAMP-stack style systems, and best practices for running them:

 

The sad truth is, that when all you have is time series aggregates and traditional monitoring dashboards, you aren’t really debugging with science so much as you are relying on your gut and a handful of dashboards, using intuition and scraps of data to try and reconstruct an impossibly complex system state.

This works ok, as long as you have a relatively limited set of failure scenarios that happen over and over again.  You can just pattern match from past failures to current data, and most of the time your intuition can bridge the gap correctly.  Every time there’s Graph Everything Unicorn 2x2an outage, you post mortem the incident, figure out what happened, build a dashboard “to help us find the problem immediately next time”, create a detailed runbook for how to respond to it, and (often) configure a paging alert to detect that scenario.

Over time you build up a rich library of these responses.  So most of the time when you get paged you get a cluster of pages that actually serves to help you debug what’s happening.  For example, at Parse, if the error graph had a particular shape I immediately knew it was a redis outage.  Or, if I got paged about a high % of app servers all timing out in a short period of time, I could be almost certain the problem was due to mysql connections.  And so forth.

Things fall apart; the pagebomb cannot stand

However, this model falls apart fast with distributed systems.  There are just too many failures.  Failure is constant, continuous, eternal.  Failure stops being interesting.  It has to stop being interesting, or you will die.

 

 

 

Instead of a limited set of recurring error conditions, you have an infinitely long list of things that almost never happen …. except that one time they do.  If you invest your time into runbooks and monitoring checks, it’s wasted time if that edge case never happens again.

Frankly, any time you get paged about a distributed system, it should be a genuinely new failure that requires your full creative attention.  You shouldn’t just be checking your phone, going “oh THAT again”, and flipping through a runbook.  Every time you get paged it should be genuinely new and interesting.

And thus you should actually have drastically fewer paging alerts than you used to.

A better way: observability and SLOs.

Instead of paging alerts for every specific failure scenario, the technically correct answer is to define your SLOs (service level objectives) and page only on those, i.e. when you are going to run out of budget ahead of schedule.  But most people aren’t yet operating at this level of sophistication.  (SLOs sound easy, but are unbelievably challenging to do well; many great teams have tried and failed.  This is why we have built an SLO feature into Honeycomb that does the heavy lifting for you.  Currently alpha testing with users.)

If you haven’t yet caught the SLO religion, the alternate answer is that “you should only page on high level end-to-end alerts, the ones which traverse the code paths that make you money and correspond to user pain”.  Alert on the three golden signals: request rate, latency, and errors, and make sure to traverse every shard and/or storage type in your critical path.

That’s it.  Don’t alert on the state of individual storage instances, or replication, or anything that isn’t user-visible.

(To be clear: by “alert” I mean “paging humans at any time of day or night”.  You might reasonably choose to page people during normal work hours, but during sleepy hours most errors should be routed to a non-paging address.  Only wake people up for actual user-visible problems.)

Here’s the thing.  The reason we had all those paging alerts was because we depended on them to understand our systems.

Once you make the shift to observability, once you have rich instrumentation and the ability to swiftly zoom in from high level “there might be a problem” to identifying specifically what the errors have in common, or the source of the problem — you no longer need to lean on that scattershot bunch of pagebombs to understand your systems.  You should be able to confidently ask any question of your systems, understand any system state — even if you have never encountered it before.

With observability, you debug by systematically following the trail of crumbs back to their source, whatever that is.  Those paging alerts were a crutch, and now you don’t need them anymore.

Everyone is on call && on call doesn’t suck.

I often talk about how modern systems require software ownership.  The person who is writing the software, who has the original intent in their head, needs to shepherd that code out into production and watch real users use it.  You can’t chop that up into multiple roles, dev and ops.  You just can’t.  Software engineers working on highly available systems need to be on call for their code.Graph Unicorn 4_x4_

But the flip side of this responsibility belongs to management.  If you’re asking everyone to be on call, it is your sworn duty to make sure that on call does not suck.  People shouldn’t have to plan their lives around being on call.  People shouldn’t have to expect to be woken up on a regular basis.  Every paging alert out of hours should be as serious as a heart attack, and this means allocating real engineering resources to keeping tech debt down and noise levels low.

And the way you get there is first invest in observability, then delete all your paging alerts and start over from scratch.

It works.  It really does. 🌈

 

 

Love (and Alerting) in the Time of Cholera (and Observability)