Bring Back Ops Pride (xpost)

January 19, 2026 mipsytipsycrossposted, operations, ops, picking fights, substack1 Comment

Cross-posted from “Bring Back Ops Pride“

“Operations” is not a dirty word, a synonym for toil, or a title for people who can’t write code. May those who shit on ops get the operational outcomes they deserve.

I was planning to write something else today, but god dammit, I got nerd-sniped.

Last week I published a piece on the Honeycomb blog called “You Had One Job: Why Twenty Years of DevOps Has Failed to Do It.” Allow me to quote myself:

In retrospect, I think the entire DevOps movement was a mighty, twenty year battle to achieve one thing: a single feedback loop connecting devs with prod.

On those grounds, it failed.

Not because software engineers weren’t good at their jobs, or didn’t care enough. It failed because the technology wasn’t good enough. The tools we gave them weren’t designed for this, so using them could easily double, triple, or quadruple the time it took to do their job: writing business logic.

This isn’t true everywhere. Please keep in mind that all data tools are effectively fungible if you can assume an infinite amount of time, money, and engineering skill. You can run production off an Excel spreadsheet if you have to, and some SREs have done so. That doesn’t make it a great solution, the right use of resources, or accessible to the median engineering org.

It’s a fun piece, if I do say so myself, and you should go read it. (Much stick art!)

I posted the link on LinkedIn yesterday morning. Comments, as ever, ensued.

(The internet was a mistake, Virginia.)

“Devs should own everything”

A number of commenters said things like, “devs should own everything”, “make every team responsible for their own devops work”, and my personal favorite:

“I still think the main problem is with the ownership model – the fact that devs don’t own the full system, including infra and ops.”

(Courtesy of Alex Pulver, who has graciously allowed me to quote him here by name, adding that “he stands firmly behind this 😂”.)

As it happens, I have been aggressively advocating for the model I believe my friend is describing here, where software developers are empowered (and expected) to own their code in production, for approximately the past decade. No argument!

But “devs should own the full system, including infra and ops”?

We need to talk.

I do not think ‘ops’ means what you think it means

In software—and only in software—ops has become a dirty word. Nobody wants to claim it. Operations teams got renamed to DevOps teams, SRE, infrastructure, production engineering, or more recently, platform engineering teams. Anything but ops.

Ops means Toil! Hashtag #NoOps!

Number one, this is fucking ridiculous.

What’s wrong with operations? Ops is not a synonym for toil; it literally means “get shit done as efficiently as possible”. Every function has an operational component at scale: business ops, marketing ops, sales ops, product ops, design ops and everything else I could think of to search for, and so far as I can tell, none of them are treated with anything like the disrespect, dismissal and outright contempt that software engineering1 has chosen to heap upon its operational function.

Number two…what happened?

I can think of a number of contributing factors (APIs and cloud computing, soaring profit margins, etc), but I can also think of one easy, obvious, mustache-twirling villain, which would make a better story for you AND less work for me. (Root cause analysis wins again!! ✊)

Whose fault is it?

Google. It’s Google’s fault.

I know this, because I asked Google and it told me so.

(What? This is a free substack, not science.)

Here’s what I think happened. I think Google came out swinging, saying traditional operations could not scale and needed to become more like software engineering, and it was exactly the right message at exactly the right time, because cloud computing, APIs, SaaSes and so forth were just coming online and making it possible to manage systems more programmatically.

So far so good. But a crucial distinction got lost in translation, when we started defining this as developers (people who write code: good) vs operators (people who do things by hand: bad), which is what set us on the slippery slope to where we are today, where the entire business-critical function of operations engineering is widely seen as backwards and incompetent.

Dev vs Ops is a separation of concerns

The difference between “dev” and “ops” is not about whether or not you can write code. Dude, it’s 2026: everyone writes software.

The difference between dev and ops is a separation of concerns.

If your concern is building new features and products to attract customers and generate new revenue, then congrats: you’re a dev. (But you knew that.)

If your concern is building core services and protecting their ability to serve customers in the face of any and all threats (including, at the top of the list, your own developers): congratulations slash I’m sorry, but you are, in fact, in ops.

Both of these functional concerns are vital, as in “you literally can’t survive without them”, and complementary. You need product developers to be focused on building features and products, caring deeply about the experience of each user, and looking for ways to add value to the business. You need operations to provide a resilient, scalable, efficient base for those products to run on.

The hardest technical problems are found in ops

Ops is not “toil”. It does not mean “dummies who can’t program good”. Operations engineering is not easier or lesser than writing software to build products and features.

What’s darkly ironic is that, if anything, the opposite is true.

Product engineering is typically much simpler than infrastructure engineering—in part, of course, because one of the key functions of operations is to make it as easy as possible to build and ship products. Operations absorbs the toughest technical problems and provides a surface layer for product development that is simple, reliable, and easy to navigate.

Not because product engineers are dumb or lesser than (let’s not slip into that trap2 again!), but because cognitive bandwidth is the scarcest resource in any engineering org, and you want as much of that as possible going towards things that move the business materially forward, instead of wrestling with the messy underbelly.

The hardest technical challenges and the long, stubborn tail of intractable problems have always been on the infrastructure side. That’s why we work so hard to try not to have them—to solve them by partnerships, cloud computing, open source, etc. Anything is better than trying to build them again, starting over from scratch. We know the cost of new code in our bones.3

As I have said a thousand times: the closer you get to laying bits down on disk, the more conservative (and afraid) you should be.

The closer you get to user interaction, the more okay it is to get experimental, let AI take a shot, YOLO this puppy.

This is as it should be.

The cost and pain of developing software is approximately zero compared to the operational cost of maintaining it over time.

Domain level differences

The difference between dev and ops isn’t about writing code or not. But there are differences. In perspective, priorities, and (often) temperament.

I touched on a number of these in the article I just wrote on feedback loops, so I’m not going to repeat myself here.

The biggest difference I did not mention is that they have different relationships with resources and definitions of success.

Infrastructure is a cost center. You aren’t going to make more money if you give ten laptops to everyone in your company, and you aren’t going to make more money by over-spending on infrastructure, either. Great operations engineers and architects never forget that cost is a first class citizen of their engineering decisions.

You can, in theory, make more money by spending more on product engineering. This is what we refer to as an “investment”, although sometimes it seems to mean “engineers who forget their time costs money”.

(Sorry, that was rude.)

What about platform engineering?

“What about platform engineering?” Baby, that’s ops in dressup.

A bit less flippantly: I like my friend Abby Bangser’s quote: “platforms should encode things that are unique to your business but common to your teams”, and I like Jack Danger’s stick art, and his observation that “The only thing that naturally draws engineers to look at the middle of their system is pure blinding rage.”

What I love about the platform engineering movement is that it has brought design thinking and product development practices to the operational domain.

Yes, we should absolutely be treating our product developers like customers, and thinking critically about the interfaces we give them. Yes, there is a middle layer between infrastructure and product engineering, with patterns and footguns of its very own.

Also yes: from a functional perspective, platform engineering is still ops. (Or at least, more ops than not.)

Yes, you need an ops team. If you have hard operational problems. You should try not to have hard operational problems. (from a talk I gave back in 2015)

Does it matter what we call it?

Yeah, I kinda think it does.

All these trendy naming schemes do not change the core value of operations, which is to consolidate and efficiently serve the revenue-generating parts of the function.4 This is as true in technology as it is in sales or marketing. Running away from the term and denying your purpose muddies the water and causes confusion at the exact point where clarity is most needed.

An engineering team needs to know if they are oriented towards efficiency or investment. It changes how you hire, how you build, how you think about success and measure progress. It changes not only your appetite for risk, but what counts as a risk in the first place. You can’t optimize for both at once.

They also need to know whether they are responsible for the business logic or the platform it runs on.

Why? Because no one can do everything. Telling devs to own their code is one thing. (Great.) Asking them to own their code and the entire technological iceberg beneath it is wholly another. The more surface area you ask someone to master and attend to, the less focus you can expect from them in any given place. Do you want your revenue-generating teams generating revenue, or not?

If you can’t separate these concerns at the moment, maybe that’s something to work towards. Which is going to be hard to do, if we can’t talk about the function of operations without half the room running away and the remaining half squawking “toil!”

Naming is a form of respect

Operational rigor and excellence are not, how shall I say this…not yet something you can take for granted in the tech industry. The most striking thing about the 2025 DORA report was that the majority of companies report that AI is just adding more chaos to a system already defined by chaos. In other words, most companies are bad at ops.

To some extent, this is because the problems are hard. To a larger extent, I think it’s the cause (and result) of our wholesale abandonment of operations as a term of pride.

It’s another a fucking feedback loop. Ambitious young engineers get the message that being associated with ops is bad, so they run away from those teams. Managers and execs want to recruit great talent and make jobs sound enticing, so they adopt trendy naming schemes to make it clear this work is not ops.

If you want to do something well, historically speaking, this is not the way. The way to build excellence is to name it for what it is, build communities of practice, raise the bar, and compensate for a job well done.

Or as one prognosticator said, way back in 2016:

I think it’s time to bring back “operations” as a term of pride. As a thing that is valued, and rewarded.

“Operations” comes with baggage, no doubt. But I just don’t think that distance and denial are an effective approach for making something better, let alone trash talking and devaluing the skill sets that you need to deliver quality services.

You don’t make operational outcomes magically better by renaming the team “DevOps” or “SRE” or anything else. You make it better by naming it and claiming it for what it is, and helping everyone understand how their role relates to your operational objectives.

Wow. Truly, I couldn’t have said it better myself.

Footnotes

Not to get all Jungian on you all, but part of me has to wonder if “ops” represents the shadow self to software engineers, the parts of yourself that you hate and despise and are most insecure about (I am weak and bad at coding!! I might be automated out of a job!), and thus need to project onto some externalized other that can be safely loathed from a distance.

This might surprise you youngsters, but there was a time when systems folks were clearly the cool kids and developers were considered rather dim. Devs had to know data structures and algorithms, but sysadmins had to know everything. These things tend to come and go in cycles, so we may as well not shit on each other, eh?

As my friend Peter van Hardenburg likes to say, “The best code is no code at all. The second best code is code someone else writes and maintains for you. The worst code is the code you have to write and maintain yourself.” If it would fit on my knuckles, I would get this in knuckle tatts.

Would it be helpful to acknowledge that IT/ops serves an entirely different function than software engineering operations for production systems? Because it absolutely does.

The Future of Ops is Platform Engineering

September 30, 2022July 14, 2023 mipsytipsycareers, crossposted, culture, engineering, operations, platform engineering, sreLeave a comment

First published on 2022-09-30 at https://www.honeycomb.io/blog/future-ops-platform-engineering.

Two years ago I wrote a piece in The New Stack about the Future of Ops Careers. Towards the end, I wrote:

The reality is that jack-of-all-trades systems infrastructure jobs are slowly vanishing: the world doesn’t need thousands of people who can expertly tune postfix, SpamAssassin, and ClamAV—the world has Gmail. (…)

Building infrastructure and operational expertise used to be bundled together into a single role. But the industry is now bifurcating along an infrastructure fault line, and the overlap between infrastructure-oriented engineers and operationally-minded engineers is swiftly eroding. Engineers who love this work increasingly have a choice to make. Either you can 1) go deep on infrastructure by joining a company that does infrastructure as a service, or 2) go broad on operability by joining a company to help them do as little infrastructure as possible.

I described the second category as “operations engineering minus the infrastructure,” dedicated to evaluating and assembling a production stack of third-party platform providers, enabling software engineers to self-serve their services and own their own code in production. I said:

Your job will be to aggressively minimize the cycles your org devotes to infrastructure by finding effective ways to outsource or minimize infra labor. Your job is to NOT go deep if there is any workable alternative.
Your job will be to work cross-functionally with all the other software engineering teams, looking for ways to speed up their time to value and helping them own their own code in production.
Your job will be to move past the kludgey old models of “outsourcing” to sophisticated understandings of how and where to leverage abstractions that can radically accelerate development.

That second category I was describing now has a name. We call those teams “platform engineering.”

The fifty-year arc of software careers

In the beginning, there were people who wrote and ran software. At some point, we spun away ops skills from dev skills into two different professions, but that turned out to be a ginormous mistake, so along came DevOps to reunify them. Nowadays, ops as an independent profession is in the process of fading out. Companies are spinning down their ops teams left and right. Engineers who formerly identified as sysadmins or operations have turned into DevOps engineers, and soon there will just be “software people” again. This is the way of things.

Please note that this is NOT the same thing as saying “ops is dead,” or “ops skills are no longer valuable or needed¹.” Our systems are only getting more complex, more difficult to operate, and simultaneously more critical to life on earth, which means that operational excellence has never been more desperately needed (and if you don’t respect that, 🌈 you deserve to suffer 🌈).

The industry story of the past three to five years has been us trying to figure out how to help software engineers own their own code in production², phasing out dedicated ops teams, and aggressively outsourcing as much infrastructure as possible.

As we should. Developer cycles are the scarcest resource in your company, and you want to spend as many of those as possible on your core product: the crown jewel, the code that makes you a business. Money is cheaper than engineering cycles, and teams that are focused on their core business will always outperform teams whose focus is spread across dozens of non-revenue-generating projects. Let someone else build and run all the dependencies and adjacencies.

Before: some engineers wrote code, and some engineers ran code.

Now: all engineers write code, and all engineers run the code they write.

Platform engineering is what stands between you and darkness

When you start talking about putting software engineers on call for their own code, and generally being more involved in production, some percentage of the time you will hear back a guttural wail of despair: “You can’t expect me to know EVERYTHING about EVERYTHING!”

Quite right; we can’t. Platform engineering teams are part of the answer to this perfectly reasonable complaint. It’s not that you’re being asked to do or understand more in toto, but the distribution of labor and responsibility is shifting:

Before: some engineers wrote code, and some engineers ran code.

Now: all engineers write code, and all engineers run the code they write—but we divide the areas of responsibility by layer or function.

The emergence of a minimum viable self-serve tier

In the earliest days of a company, your first few engineers end up bootstrapping an infrastructure by reading AWS docs or blog posts, or asking a friend for recommendations to get started. They might start by setting up a managed container service, or configuring Terraform, and for a while everybody deploys and owns their own code, just as god intended.

But cognitive limits kick in pretty quickly. The maze of APIs and SDKs and components out there is simply bewildering, even for an experienced ops hand. Before long, it becomes someone’s job to make good decisions, pick a suite of compute and storage options that serve the team’s needs, and write some tooling that pulls everything into a coherent whole—which, at a minimum, lets you:

Run tests and generate new artifacts
Deploy artifacts, version them, and roll back
Instrument, monitor, and debug
Store data somewhere, manage schemas and migrations
Adjust capacity as needed
Define and commit all components (and their relationships) as code

Once these are built, it should be trivial for an engineer to come along and spin up a new service using templates and components from existing services. It should be much simpler and easier to use the blessed paths than anything else, and there should be friction if you go off the beaten path.

Congratulations! You’ve just been platformed 🎉. One of the key principles of any developer platform is that it should be easy to do the right things, and hard to do the wrong things.

The differences between platform engineering and traditional ops

Platform teams are typically staffed by engineers who are comfortable writing software. Not just scripting and automation, but writing tests and doing code reviews. Platform teams also operate much more like product development teams do, with product managers (and occasionally, designers, developer advocates, or UX researchers).

This doesn’t mean that everybody on a platform team has to have originally been a software engineer; in fact, a super common failure condition for platform teams is simply thinking all they need to do is hire software engineers to build developer tools. A strong platform team has an equally deep grounding in operations experience and software development. Individuals who are experts in both areas are fairly rare, but you can pull together a strong, well-rounded team by assembling a mix of SWEs (with some ops experience) and ops or DevOps engineers (with some software experience) and having them learn and grow from each other.

Platform teams are decidedly cloud-native; they actually mostly involve platforms built atop the cloud itself—PaaS, IaaS, everything-aaS, serverless, and so forth.

Ops/DevOps teams are oriented around managing infrastructure, often several generations of infrastructure. Their turf is everything from data centers and bare metal up through virtualization, containers, and the cloud (they aren’t so much cloud-native as cloud-enabled). They measure themselves on things like SLOs and the DORA metrics. You know they’re doing a good job if the system is up/available and users are happy.

Platform teams are oriented around providing a good experience for developers to self-serve and self-manage their code. The more swiftly and easily developers can move, the better your platform team. Operational excellence, in the platform model, is actually more the responsibility of the other engineering teams (and/or an adjacent SRE team) than that of the platform team.

Platform teams typically work higher up the stack than operations, DevOps, or SRE teams do, and they involve a great deal less infrastructure. On the contrary, platform teams are bent on paying other people to run as much shit as possible, preserving their own scarce development cycles for their core product.

Here is a somewhat tongue-in-cheek table of the similarities and differences between the archetypes.

Platform engineers vs. DevOps engineers

	Platform Engineer	Ops (or DevOps) Engineer
% of job spent writing code	> 50%	< 50%
Rest of time spent	Gathering product requirements, doing user research, architecture discussions, optimizing internal workflows, researching new tools and developer productivity ideas, reviewing other teams’ diffs for impact, performance tuning, helping other engineers own & scale their code, fixing CI/CD pipelines.	Fixing cron jobs, automating old setup docs, converting PXE/rsync to Chef/Puppet, converting Chef/Puppet to Terraform, converting VMs to containers, deploying software, debugging broken deploys, writing monitoring checks, doing retros, building out new services, pairing with software engineers to understand and debug their code, investigating weird shit, documentation, etc.
Responsible for	Enabling internal teams to self-serve their ability to run and own their code in production. Creating standard, reusable components and processes. Defining golden paths.	Infrastructure capacity planning, scaling, performance tuning, upgrading. Reliability and resiliency, SLOs and monitoring/alerting. Delivering quality experience to customers.
Builds for	Internal developer teams	Customers
Development style	Infrastructure as a product	Infrastructure as code
Works with product managers	Yes	No
Works with UX researchers or designers	Sometimes	No
Dashboards & graphs	Uses APM, observability, tracing. Cares a lot about instrumentation and OpenTelemetry.	Uses metrics, logs, dashboards; monitoring, alerting, and agent/sidecar/blackbox telemetry.
What ‘coding’ means to them	Developing new features & services, writing tests. These are (primarily) software people who do systems.	Automation, configuration, DSLs, extending and debugging existing code. These are systems people who do software.
Preferred language	Go, Rust	Python, Ruby
Time spent in Linux	Hardly any	A lot
Succeeds when	Developers can easily choose good defaults, self-serve their infra, and own their own code in production.	Infrastructure is scalable, secure, cost-effective, reliable, and customers are happy.
Native terrain	Serverless, *aaS, APIs for everything (cloud-native and above).	Instances, VMs, containers, regions, multi-cloud (everything “below,” but up to and including the cloud).
Databases	Uses hosted DBs	Runs their own, blending automation & DBA expertise
SSH	No	Yes
Shell	REPL	bash/zsh
Mantra	“Run Less Software”	“Cattle, Not Pets”

What about DevOps vs. SRE?

Countless words have been spilled on the difference between DevOps and SRE³, which I won’t rehash.

Here’s what I’ll say: DevOps, to me, feels like a relevant concept for companies that have a lot of infrastructure to wrangle. Companies that do in fact have dev teams and ops teams, or dev teams and DevOps teams (🙄), tend to have a lot of operational shit to automate, test, and run. They use config management, virtualization, and containers, often managing several generations worth of technology, possibly even down to data centers and bare metal. DevOps is for companies that have some combination of bare metal, VMs, regions, AZs, multi-cloud, networking devices, self-managed databases, etc.

DevOps is capacious. It contains multitudes. DevOps writes code, and DevOps has a fuckload of code to manage.

It is also on its way to becoming irrelevant. We are swiftly entering a post-DevOps world.

SRE, to me, feels different. I associate SRE with very large companies, where they mostly have software engineers owning their own code in production, but maybe still struggle with it a bit. SREs are often embedded within software engineering teams or product groups, and they focus a lot on, well, reliability, as the name suggests.

This means they do less infrastructure jockeying or automating (although they still do some coding). They typically have a lot to say about instrumentation, monitoring and observability, and cross-functional coordination. They run incident response and do blameless retros, and they tend to be experts at scaling.

If a company has both a DevOps team and SRE, typically I expect to see the SRE team more on the frontlines, involved with incidents, telemetry, etc., and DevOps teams more on the backburner, slinging pipes and plumbing.

Observability engineering as a case study

In the same piece I referenced earlier, I also wrote about the role of observability teams. I said they should largely no longer be running their own monitoring and graphing software in-house. Yet there is still a place for observability teams to exist: they remain a critical link between outsourced solutions and internal developer needs.

That team should write libraries, generate examples, and drive standardization; ushering in consistency, predictability, and usability. They should partner with internal teams to evaluate use cases. They should partner with your vendors as roadmap stakeholders. They might also write glue code and helper modules to connect disparate data sources and create cohesive visualizations. Basically, that team becomes an integration point between your organization and the outsourced work.

I originally wrote this about observability, but it could just as easily be used to describe platform engineering as a whole. This is the role—being the bridge between other vendors and your own core software. It’s a very high-leverage place to sit.

Ops is dead, long live ops

I’ve spent a lot of time thinking about this because we’ve had such a hard time nailing down exactly who the Honeycomb customer is. Sometimes our buyer is an ops team buying it for their SWEs, sometimes it’s SREs in the midst of an outage, sometimes it’s a VP or director of engineering, or an architect, or a CTO, or a “full stack” engineering team, or even a product manager. It is hard to form a snappy answer out of that list.

The first couple questions every new go-to-market candidate asks us are “who is your buyer?” and “how do we help them?” To which I respond with a five minute ramble where I list every above persona and each of their pain points. Hardly the concrete answer they would like to receive.

As it goes, sociotechnical trends come and go. A year ago, Christine and I were speculating that platform engineering might be on the verge of consolidating the necessary ingredients that makes up our ideal buyer:

Writing and shipping code, and needing to understand their own code
Positioned to help other teams with their instrumentation patterns and tooling
Firmly cloud-native+ and untethered to hardware or traditional infrastructure

To my delight, since that conversation, these trends have only accelerated—and I, for one, welcome our new platform engineering overlords to the observability table. ☺️

If you’d like to learn more about platform engineering, we’ll be running a Twitter space on ✨ October 20th ✨ at 12:00 p.m. PT. Come join us! I’ll be there along with two colleagues and we’ll be answering your questions and shedding more light on the topic.

¹ I do hear people saying that, and it used to make me fucking furious, but now I just smugly remind myself how much self-inflicted suffering they are in for. Disrespecting operational expertise is the shortest path to never again sleeping through the night.

² It is rather incredible how rapidly this idea has taken off. When we started talking about putting developers on call for their code in 2016, people got seriously angry with us. Before that, the only twitter mention I could find of putting devs on call was one by (of course) Adrian Cockcroft, but by 2019-2020 it had stopped being controversial and soon became common wisdom.

³ I actually wrote one of those myself: DevOps vs SRE: Delayed Coverage of the Dumbest War). LMAO. I think Liz had the final word on this back in … 2017? 2018? … when she said something like class SRE implements DevOps. And yes, DevOps is a philosophy or a methodology and not a job title, etc.

Rituals for Engineering Teams

August 15, 2022July 14, 2023 mipsytipsyculture, engineering, management, operations, rituals, startups13 Comments

Last weekend I happened to pick up a book called “Rituals For Work: 50 Ways To Create Engagement, Shared Purpose, And A Culture That Can Adapt To Change.” It’s a super quick read, more comic book than textbook, but I liked it.

It got me thinking about the many rituals I have initiated and/or participated in over the course of my career. Of course, I never thought of them as such — I thought of them as “having fun at work” 🙃 — but now I realize these rituals disproportionately contribute to my favorite moments and the most precious memories of my career.

Rituals (a definition): Actions that a person or group does repeatedly, following a similar pattern or script, in which they’ve imbued symbolism and meaning.

I think it is extremely worth reading the first 27 pages of the book — the Introduction and Part One. To briefly sum up the first couple chapters: the power of creative rituals comes from their ability to link the physical with the psychological and emotional, all with the benefit of “regulation” and intentionality. Physically going through the process of a ritual helps people feel satisfied and in control, with better emotional regulation and the ability to act in a steadier and more focused way. Rituals also powerfully increase people’s sense of belonging, giving them a stable feeling of social connection. (p. 5-6)

The thing that grabbed me here is that rituals create a sense of belonging. You show that you belong to the group by participating in the ritual. You feel like you belong to the group by participating in the ritual. This is powerful shit!

It seems especially relevant these days when so many of us are atomized and physically separated from our teammates. That ineffable sense of belonging can make all the difference between a job that you do and a role that feeds your soul. Rituals are a way to create that sense of belonging. Hot damn.

So I thought I’d write up some of the rituals for engineering teams I remember from jobs past. I would love to hear about your favorite rituals, or your experience with them (good or bad). Tell me your stories at @mipsytipsy. 🙃

Rituals at Linden Lab

Feature Fish Freeze

At Linden Lab, in the ancient era of SVN, we had something called the “Feature Fish”. It was a rubber fish that we kept in the freezer, frozen in a block of ice. We would periodically cut a branch for testing and deployment and call a feature freeze. Merging code into the branch was painful and time consuming, so If you wanted to get a feature in after the code freeze, you had to first take the fish out of the freezer and unfreeze it.

This took a while, so you would have to sit there and consider your sins as it slowly thawed. Subtext: Do you really need to break code freeze?

Stuffy the Code Reviewer

You were supposed to pair with another engineer for code review. In your commit message, you had to include the name of your reviewer or your merge would be rejected. But the template would also accept the name “Stuffy”, to confess that your only reviewer had been…Stuffy, the stuffed animal.

However if your review partner was Stuffy, you would have to narrate the full explanation of Stuffy’s code review (i.e., what questions Stuffy asked, what changes he suggested and what he thought of your code) at the next engineering meeting. Out loud.

Shrek Ears

We had a matted green felt headband with ogre ears on it, called the Shrek Ears. The first time an engineer broke production, they would put on the Ears for a day. This might sound unpleasant, like a dunce cap, but no — it was a rite of passage. It was a badge of honor! Everyone breaks production eventually, if they’re working on something meaningful.

If you were wearing the Shrek Ears, people would stop you throughout the day and excitedly ask what happened, and reminisce about the first time they broke production. It became a way for 1) new engineers to meet lots of their teammates, 2) to socialize lots of production wisdom and risk factors, and 3) to normalize the fact that yes, things break sometimes, and it’s okay — nobody is going to yell at you. ☺️

This is probably the number one ritual that everybody remembers about Linden Lab. “Congratulations on breaking production — you’re really one of us now!”

Vorpal Bunny

We had a stuffed Vorpal Bunny, duct taped to a 3″ high speaker stand, and the operations engineer on call would put the bunny on their desk so people knew who it was safe to interrupt with questions or problems.

At some point we lost the bunny (and added more offices), but it lingered on in company lore since the engineers kept on changing their IRC nick to “$name-bunny” when they went on call.

There was also a monstrous, 4-foot-long stuffed rainbow trout that was the source of endless IRC bot humor… I am just now noticing what a large number of Linden memories involve stuffed animals. Perhaps not surprising, given how many furries were on our platform ☺️

Rituals at Parse

The Tiara of Technical Debt

Whenever an engineer really took one for the team and dove headfirst into a spaghetti mess of tech debt, we would award them the “Tiara of Technical Debt” at the weekly all hands. (It was a very sparkly rhinestone wedding tiara, and every engineer looked simply gorgeous in it.)

Examples included refactoring our golang rewrite code to support injection, converting our entire jenkins fleet from AWS instances to containers, and writing a new log parser for the gnarliest logs anyone had ever seen (for the MongoDB pluggable storage engine update).

Bonfire of the Unicorns

We spent nearly 2.5 years rewriting our entire ruby/rails API codebase to golang. Then there was an extremely long tail of getting rid of everything that used the ruby unicorn HTTP server, endpoint by endpoint, site by site, service by service.

When we finally spun down the last unicorn workers, I brought in a bunch of rainbow unicorn paper sculptures and a jug of lighter fluid, and we ceremonially set fire to them in the Facebook courtyard, while many of the engineers in attendance gave their own (short but profane) eulogies.

Mission Accomplished

This one requires a bit of backstory.

For two solid years after the acquisiton, Facebook leadership kept pressuring us to move off of AWS and on to FB infra. We kept saying “no, this is a bad idea; you have a flat network, and we allow developers all over the world to upload and execute random snippets of javascript,” and “no, this isn’t cost effective, because we run large multi-terabyte MongoDB replica sets by RAIDing together multiple EBS volumes, and you only have 2.5TB FusionIO (for extremely high-perf mysql/RocksDB) and 40 TB spinning rust volumes (for Hadoop), and also it’s impossible to shrink or slice up replsets”, and so forth. But they were adamant. “You don’t understand. We’re Facebook. We can do anything.” (Literal quote)

Finally we caved and got on board. We were excited! I announced the migration and started providing biweekly updates to the infra leadership groups. Four months later, when the migration was half done, I get a ping from the same exact members of Facebook leadership:

“What are you doing?!?”
“Migrating!”
“You can’t do that, there are security issues!”
“No it’s fine, we have a fix for it.”
“There are hardware issues!”
“No it’s cool, we got it.”
“You can’t do this!!!”

ANYWAY. To make an EXTREMELY long and infuriating story short, they pulled the plug and canned the whole project. So I printed up a ten foot long “Mission Accomplished” banner (courtesy of George W Bush on the aircraft carrier), used Zuck’s credit card to buy $800 of top-shelf whiskey delivered straight to my desk (and cupcakes), and we threw an angry, ranty party until we all got it out of our systems.

Blue Hair

I honestly don’t remember what this one was about, but I have extensive photographic evidence to prove that I shaved the heads of and/or dyed the hair blue of at least seven members of engineering. I wish I could remember why! but all I remember is that it was fucking hilarious.

In Conclusion

Coincidentally (or not), I have no memories of participating in any rituals at the jobs I didn’t like, only the jobs I loved. Huh.

One thing that stands out in my mind is that all the fun rituals tend to come bottoms-up. A ritual that comes from your VP can run the risk of feeling like forced fun, in a way it doesn’t if it’s coming from your peer or even your manager. I actually had the MOST fun with this shit as a line manager, because 1) I had budget and 2) it was my job to care about teaminess.

There are other rituals that it does make sense for executives to create, but they are less about hilarious fun and more about reinforcing values. Like Amazon’s infamous door desks are basically just a ritual to remind people to be frugal.

Rituals tend to accrue mutations and layers of meaning as time goes on. Great rituals often make no sense to anybody who isn’t in the know — that’s part of the magic of belonging. 🥰

Now, go tell me about yours!

charity

Software deploys and cognitive biases

August 27, 2021July 14, 2023 mipsytipsycontinuous deployment, deploys, engineering, friday deploys, operations, sre1 Comment

There exist some wonderful teams out there who have valid, well thought through, legitimate reasons for enforcing “NO FRIDAY DEPLOYS” week in and week out, for not hooking CI/CD up to autodeploy, and for not shipping one person’s changes at a time.

And then there are the reasons most people have.

Bad decisions, and the biases they came from

“It is more dangerous to deploy than not to deploy” (prevention bias, zero-risk bias)
“It’s always riskier to do something than to not do something” (omission bias)
“Deploys are scary, so we need to slow down and be careful.”(slow motion bias)
“We’ve always done it this way; it works fine, you’re exaggerating, it doesn’t slow us down like those stories you tell.” (plan continuation bias, mere-exposure effect, time-saving bias, status-quo bias, default effect)
“Maybe that works for some teams, like baby startups, but not real software systems that people rely on” (false-uniqueness bias)
“We’ve already invested a ton of engineering hours into building a deployment framework that doesn’t ship especially fast and doesn’t ship one change at a time, but it works, and we don’t want to have to redo everything from scratch.” (surrogation, ostrich effect, irrational escalation, ikea-effect, law of the instrument)
“I get why it’s important, but the most important thing right now is for us to ship all of these features. At some point we’ll have the spare time to fix our deploys.” (hyperbolic discounting)
“We tried it once, and the whole site went down. Never again.” (non-adaptive choice switching, selective perception)
“Everybody says that no Friday deploys is the safe, sane, and caring choice.” (illusory truth effect, surrogation, continued influence effect, conservatism bias)
“Deploys are just inherently scary; there’s nothing that can be done about that. You should do them sparingly and with someone monitoring it closely.” (availability bias, dread aversion, functional fixedness)
“The best way to protect people’s personal time is to let merged code pile up between Thursday night and Monday morning, and ship all at once.” (continued-influence effect, plan-continuation bias, pseudocertainty effect, zero-risk bias)
“There is nothing anyone could say or do to convince me that the best thing I can do as a manager to protect their weekends is not blocking Friday deploys. I just feel it in my gut. I just know.” (… I got nuthin)

We’re humans. 💜 We leap to conclusions with the wetware we have doing the best it can based on heuristics that feel objectively true, but are ultimately just emotional reactions based on past lived experience. And then we retroactively enshrine those goofy gut feelings with the language of noble motive and moral values.

“I tell people not to deploy to production … because I care so deeply about my team and their ability to have a quiet weekend.”

Barf. 🙄 That’s just like saying you tell your kid not to brush his teeth at night, because you care SO DEEPLY about him and his ability to go to bed calm and happy.

Once the retcon engine in your brain gets running, it comes up with all sorts of reasons. Plausible-sounding reasons! But every single argument of the items in the list above is materially false.

Deploy myths are never going away for good; they appeal to too many of our cognitive biases. But what if there was one simple thing you could do that would invert many of these cognitive biases and cause people to grapple with the question in a new way? What if you could kickstart a recalculation?

My next post will pick up right here. I’ll tell you all about the One Simple Trick you can do to fix your deploys and set you on the virtuous path of high-performing teams.

Til then, here’s what I’ve previously written on the topic.

https://charity.wtf/2018/08/19/shipping-software-should-not-be-scary/

https://charity.wtf/2019/05/01/friday-deploy-freezes-are-exactly-like-murdering-puppies/

https://charity.wtf/2019/10/28/deploys-its-not-actually-about-fridays/

https://charity.wtf/2021/02/19/how-much-is-your-fear-costing-you/

https://charity.wtf/2020/12/31/why-are-my-tests-so-slow-a-list-of-likely-suspects-anti-patterns-and-unresolved-personal-trauma/

Footnotes

Availability bias: The tendency to overestimate the likelihood of events with greater “availability” in memory, which can be influenced by how recent the memories are or how unusual or emotionally charged they may be.

Continued influence effect: The tendency to believe previously learned misinformation even after it has been corrected. Misinformation can still influence inferences one generates after a correction has occurred.

Conservatism bias: The tendency to revise one’s belief insufficiently when presented with new evidence.

Default effect: When given a choice between several options, the tendency to favor the default one.

Dread aversion: Just as losses yield double the emotional impact of gains, dread yields double the emotional impact of savouring

False-uniqueness bias: The tendency of people to see their projects and themselves as more singular than they actually are.

Functional fixedness: Limits a person to using an object only in the way it is traditionally used

Hyperbolic discounting: Discounting is the tendency for people to have a stronger preference for more immediate payoffs relative to later payoffs. Hyperbolic discounting leads to choices that are inconsistent over time – people make choices today that their future selves would prefer not to have made, despite using the same reasoning

IKEA effect: The tendency for people to place a disproportionately high value on objects that they partially assembled themselves, such as furniture from IKEA, regardless of the quality of the end product

Illusory truth effect: A tendency to believe that a statement is true if it is easier to process, or if it has been stated multiple times, regardless of its actual veracity.

Irrational escalation: The phenomenon where people justify increased investment in a decision, based on the cumulative prior investment, despite new evidence suggesting that the decision was probably wrong. Also known as the sunk cost fallacy

Law of the instrument: An over-reliance on a familiar tool or methods, ignoring or under-valuing alternative approaches. “If all you have is a hammer, everything looks like a nail”

Mere exposure effect: The tendency to express undue liking for things merely because of familiarity with them

Negativity bias: Psychological phenomenon by which humans have a greater recall of unpleasant memories compared with positive memories

Non-adaptive choice switching: After experiencing a bad outcome with a decision problem, the tendency to avoid the choice previously made when faced with the same decision problem again, even though the choice was optimal

Omission bias: The tendency to judge harmful actions (commissions) as worse, or less moral, than equally harmful inactions (omissions).

Ostrich effect: Ignoring an obvious (negative) situation

Plan continuation bias: Failure to recognize that the original plan of action is no longer appropriate for a changing situation or for a situation that is different than anticipated

Prevention bias: When investing money to protect against risks, decision makers perceive that a dollar spent on prevention buys more security than a dollar spent on timely detection and response, even when investing in either option is equally effective

Pseudocertainty effect: The tendency to make risk-averse choices if the expected outcome is positive, but make risk-seeking choices to avoid negative outcomes

Salience bias: The tendency to focus on items that are more prominent or emotionally striking and ignore those that are unremarkable, even though this difference is often irrelevant by objective standards

Selective perception bias: The tendency for expectations to affect perception

Status-quo bias: If no special action is taken, the default action that will happen is that the code will go live. You will need an especially compelling reason to override this bias and manually stop the code from going live, as it would by default.

Slow-motion bias: We feel certain that we are more careful and less risky when we slow down. This is precisely the opposite of the real world risk factors for shipping software. Slow is dangerous for software; speed is safety. The more frequently you ship code, the smaller the diffs you ship, the less dangerous each one actually becomes. This is the most powerful and difficult to overcome of all of our biases, because there is no readily available counter-metaphor for us to use. (Riding a bike is the best I’ve come up with. 😔)

Surrogation: Losing sight of the strategic construct that a measure is intended to represent, and subsequently acting as though the measure is the construct of interest

Time-saving bias: Underestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively low speed and overestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively high speed.

Zero-risk bias: Preference for reducing a small risk to zero over a greater reduction in a larger risk.

Why every software engineering interview should include ops questions

August 21, 2021July 14, 2023 mipsytipsyculture, engineering, hiring, management, operations, sre, tech culture4 Comments

I’ve fallen way behind on my blog posts — my goal was to write one per month, and I haven’t published anything since MAY. Egads. So here I am dipping into the drafts archives! This one was written in April of 2016, when I was noodling over my CraftConf 2016 talk on “DevOps for Developers (see slides).”

So I got to the part in my talk where I’m talking about how to interview and hire software engineers who aren’t going to burn the fucking house down, and realized I could spend a solid hour on that question alone. That’s why I decided to turn it into a blog post instead.

Stop telling ops people to code better, start telling SWEs to ops better

Our industry has gotten very good at pressing operations engineers to get better at writing code, writing tests, and software engineering in general these past few years. Which is great! But we have not been nearly so good at pushing software engineers to level up their systems skills. Which is unfortunate, because it is just as important.

Most systems suffer from the syndrome of running too much software. Tossing more software into the heap is as likely to cause more problems as often as it solves them.

We see this play out at companies stacked with good software engineers who have built horrifying spaghetti messes of their infrastructure, and then commence paging themselves to death.

The only way to unwind this is to reset expectations, and make it clear that

you are still responsible for your code after it’s been deployed to production, and
operational excellence is everyone’s job.

Operations is the constellation of tools, practices, policies, habits, and docs around shipping value to users, and every single one of us needs to participate in order to do this swiftly and safely.

Every software engineering interviewing loop should have an ops component.

Nobody interviews candidates for SRE or ops nowadays without asking some coding questions. You don’t have to be the greatest programmer in the world, but you can’t be functionally illiterate. The reverse is less common: asking software engineers basic, stupid questions about the lifecycle of their code, instrumentation best practices, etc.

It’s common practice at lots of companies now to have a software engineer in the loop for hiring SREs to evaluate their coding abilities. It should be just as common to have an ops engineer in the loop for a SWE hire, especially for any SWE who is being considered for a key senior position. Those are the people you most rely on to be mentors and role models for junior hires. All engineers should embrace the ethos of owning their code in production, and nobody should be promoted or hired into a senior role if they don’t.

And yes, that means all engineers! Even your iOS/Android engineers and website developers should be interested in what happens to their code after they hit deploy. They should care about things like instrumentation, and what kind of data they may need later to debug their problems, and how their features may impact other infrastructure components.

You need to balance out your software engineers with engineers who don’t react to every problem by writing more code. You need engineers who write code begrudgingly, as a last resort. You’ll find these priceless gems in ops and SRE.

ops questions for software engineers

The best questions are broad and start off easy, with plenty of reasonable answers and pathways to explore. Even beginners can give a reasonable answer, while experts can go on talking for hours.

For example: give them the specs for a new feature, and ask them to talk through the infrastructure choices and dependencies to support that feature. Do they ask about things like which languages, databases, and frameworks are already supported by the team? Do they understand what kind of monitoring and observability tools to use, do they ask about local instrumentation best practices?

Or design a full deployment pipeline together. Probe what they know about generating artifacts, versioning, rollbacks, branching vs master, canarying, rolling restarts, green/blue deploys, etc. How might they design a deploy tool? Talk through the tradeoffs.

Some other good starting points:

“Tell me about the last time you caused a production outage. What happened, how did you find out, how was it resolved, and what did you learn?”
“What are some of your favorite tools for visibility, instrumentation, and debugging?
“Latency seems to have doubled over the last 6 hours. Where do you start looking, how do you start debugging?”
And this chestnut: “What happens when you type ‘google.com’ into a web browser?” You would be fucking *astonished* how many senior software engineers don’t know a thing about DNS, HTTP, SSL/TLS, cookies, TCP/IP, routing, load balancers, web servers, proxies, and on and on.

Another question I really like is: “what’s your favorite API (or database, or language) and why?” followed up by “… and what are the worst things about it?” (True love doesn’t mean blind worship.)

Remember, you’re exploring someone’s experience and depth here, not giving them a pass-fail quiz. It’s okay if they don’t know it all. You’re also evaluating them on communication skills, which is severely underrated by most people but is actually as a key technical skill.

Signals to look for

You’re not looking for perfection. You are teasing out signals for things like, how will this person perform on a team where software engineers are expected to own their code? How much do they know about the world outside the code they write themselves? Are they curious, eager, and willing to learn, or fearful, incurious and begrudging?

Do they expect networks to be reliable? Do they expect databases to respond, retries to succeed? Are they offended by the idea of being on call? Are they overly clever or do they look to simplify? (God, I hate clever software engineers 🙃.)

It’s valuable to get a feel for an engineer’s operational chops, but let’s be clear, you’re doing this for one big reason: to set expectations. By making ops questions part of the interview, you’re establishing from the start that you run an org where operations is valued, where ownership is non-optional. This is not an ivory tower where software engineers can merrily git push and go home for the day and let other people handle the fallout

It can be toxic when you have an engineer who thinks all ops work is toil and operations engineering is lesser-than. It tends to result in operations work being done very poorly. This is your best chance to let those people self-select out.

You know what, I’m actually feeling uncharacteristically optimistic right now. I’m remembering how controversial some of this stuff was when I first wrote it, five years ago in 2016. Nowadays it just sounds obvious. Like table stakes.

Hell yeah. 🤘

Notes on the Perfidy of Dashboards

August 9, 2021July 14, 2023 mipsytipsydashboards, monitoring, operations, sre9 Comments

The other day I said this on twitter —

every dashboard is a sunk cost
every dashboard is an answer to some long-forgotten question
every dashboard is an invitation to pattern-match the past instead of interrogate the present
every dashboard gives the illusion of correlation
every dashboard dampens your thinking https://t.co/OIEowa1COa

— Charity Majors (@mipsytipsy) July 19, 2021

… which stirred up some Feelings for many people. 🙃 So I would like to explain my opinions in more detail.

Static vs dynamic dashboards

First, let’s define the term. When I say “dashboard”, I mean STATIC dashboards, i.e. collections of metrics-based graphs that you cannot click on to dive deeper or break down or pivot. If your dashboard supports this sort of responsive querying and exploration, where you can click on any graph to drill down and slice and dice the data arbitrarily, then breathe easy — that’s not what I’m talking about. Those are great. (I don’t really consider them dashboards, but I have heard a few people refer to them as “dynamic dashboards”.)

Actually, I’m not even “against” static dashboards. Every company has them, including Honeycomb. They’re great for getting a high level sense of system functioning, and tracking important stats over long intervals. They are a good starting point for investigations. Every company should have a small, tractable number of these which are easily accessible and shared by everyone.

Debugging with dashboards: it’s a trap

What dashboards are NOT good at is debugging, or understanding or describing novel system states.

I can hear some of you now: “But I’ve debugged countless super-hard unknown problems using only static dashboards!” Yes, I’m sure you have. If all you have is a hammer, you CAN use it to drive screws into the wall, but that doesn’t mean it’s the best tool. And It takes an extraordinary amount of knowledge and experience to be able to piece together a narrative that translates low-level system statistics into bugs in your software and back. Most software engineers don’t have that kind of systems experience or intuition…and they shouldn’t have to.

Why are dashboards bad for debugging? Think of it this way: every dashboard is an answer to a question someone asked at some point. Your monitoring system is probably littered with dashboards, thousands and thousands of them, most of whose questions have been long forgotten and many of whose source data streams have long since gone silent.

So you come along trying to investigate something, and what do you do? You start skimming through dashboards, eyes scanning furiously, looking for visual patterns — e.g. any spikes that happened around the same time as your incident. That’s not debugging, that’s pattern-matching. That’s … eyeball racing.

if we did math like we do dashboards

Imagine you’re in a math competition, and you get handed a problem to solve. But instead of pulling out your pencil and solving the equation, step by step, you start hollering out guesses.

“27!”
“19992.41!”
“1/4325!”

That’s what flipping through dashboards feels like to me. You’re riffling through a bunch of graphs that were relevant to some long-ago situation, without context or history, without showing their work. Sometimes you’ll spot the exact scenario, and — huzzah! — the number you shout is correct! But when it comes to unknown scenarios, the odds are not in your favor.

Debugging looks and feels very different from flipping through answers. You ask a question, examine the answer, and ask another question based on the result. (“Which endpoints were erroring? Are all of the requests erroring, or only some? What did they have in common?”, etc.)

You methodically put one foot in front of the other, following the trail of bread crumbs, until the data itself leads you to the answer.

The limitations of metrics and dashboards

Unfortunately, you cannot do that with metrics-based dashboards, because you stripped away the connective tissue of the event back when you wrote the metrics out to disk.

If you happened to notice while skimming through dashboards that your 404 errors spiked at 14:03, and your /payment and /import endpoints started erroring at 14.03, and your database started returning a bunch of mysql errors shortly after 14:00, you’ll probably assume that they’re all related and leap to find more evidence that confirms it.

But you cannot actually confirm that those events are the same ones, not with your metrics dashboards. You cannot drill down from errors to endpoints to error strings; for that, you’d need a wide structured data blob per request. Those might in fact be two or three separate outages or anomalies happening at the same time, or just the tip of the iceberg of a much larger event, and your hasty assumptions might extend the outage for much longer than was necessary.

With metrics, you tend to find what you’re looking for. You have no way to correlate attributes between requests or ask “what are all of the dimensions these requests have in common?”, or to flip back and forth and look at the request as a trace. Dashboards can be fairly effective at surfacing the causes of problems you’ve seen before (raise your hand if you’ve ever been in an incident review where one of the follow up tasks was, “create a dashboard that will help us find this next time”), but they’re all but useless for novel problems, your unknown-unknowns.

Other complaints about dashboards:

They tend to have percentiles like 95th, 99th, 99.9th, 99.99th, etc. Which can cover over a multitude of sins. You really want a tool that allows you to see MAX and MIN, and heatmap distributions.

A lot of dashboards end up getting created that are overly specific to the incident you just had — naming specific hosts, etc — which just creates clutter and toil. This is how your dashboards become that graveyard of past outages.

The most useful approach to dashboards is to maintain a small set of them; cull regularly, and think of them as a list of starter queries for your investigations.

Fred Hebert has this analogy, which I like:

“I like to compare the dashboards to the big display in a hospital room: heartbeat, pressure, oxygenation, etc. Those can tell you when a thing is wrong, but the context around the patient chart (and the patient themselves) is what allows interpretation to be effective. If all we have is the display but none of the rest, we’re not getting anywhere close to an accurate picture. The risk with the dashboard is having the metrics but not seeing or knowing about the rest changing.”

In conclusion

Dashboards aren’t universally awful. The overuse of them just encourages sloppy thinking, and static ones make it impossible for you to follow the plot of an outage, or validate your hypotheses. 🤒 There’s too many of them, and not enough shared consensus. (It would help if, like, new dashboards expired within a month if nobody looked at them again.)

If what you have is “nothing”, even shitty dashboards are far better than no dashboards. But shitty dashboards have been the only game in town for far too long. We need more vendors to think about building for queryability, explorability, and the ability to follow a trail of breadcrumbs. Modern systems are going to demand more and more of this approach.

Nothing < Dashboards < a Queryable, Exploratory Interface

If everyone out there who slaps “observability” on their web page also felt the responsibility to add an observability-enabling interface to their tool, one that would let users explore and identify unknown-unknowns, we would all be in a far better place. 🙂

On Call Shouldn’t Suck: A Guide For Managers

October 3, 2020July 14, 2023 mipsytipsyculture, management, operations, paging & alerting, sre, tech culture7 Comments

Maybe I need to write a blog post called "On Call For Managers". If you're asking engineers to be on call for their code — and you should — you owe in return:

– enough time to fix what's broken
– hands to do the work
– closely track how often they are interrupted/woken
– ..etc

— Charity Majors (@mipsytipsy) September 25, 2020

There are few engineering topics that provoke as much heated commentary as oncall. Everybody has a strong opinion. So let me say straight up that there are few if any absolutes when it comes to doing this well; context is everything. What’s appropriate for a startup may not suit a larger team. Rules are made to be broken.

That said, I do have some feelings on the matter. Especially when it comes to the compact between engineering and management. Which is simply this:

It is engineering’s responsibility to be on call and own their code. It is management’s responsibility to make sure that on call does not suck. This is a handshake, it goes both ways, and if you do not hold up your end they should quit and leave you.

As for engineers who write code for 24×7 highly available services, it is a core part of their job is to support those services in production. (There are plenty of software jobs that do not involve building highly available services, for those who are offended by this.) Tossing it off to ops after tests pass is nothing but a thinly veiled form of engineering classism, and you can’t build high-performing systems by breaking up your feedback loops this way.

Someone needs to be responsible for your services in the off-hours. This cannot be an afterthought; it should play a prominent role in your hiring, team structure, and compensation decisions from the very start. These are decisions that define who you are and what you value as a team.

Some advice on how to organize your on call efforts, in no particular order.

It is easier to keep yourself from falling into an operational pit of doom than it is to claw your way out of one. Make good operational hygiene a priority from the start. Value good, clean, high-level abstractions that allow you to delegate large swaths of your infrastructure and operational burden to third parties who can do it better than you — serverless, AWS, *aaS, etc. Don’t fall into the trap of disrespecting operations engineering labor, it’s the only thing that can save you.
Invest in good release and deploy tooling. Make this part of your engineering roadmap, not something you find in the couch cushions. Get code into production within minutes after merging, and watch how many of your nightmares melt away or never happen.
Invest in good instrumentation and observability. Impress upon your engineers that their job is not done when tests pass; it is not done until they have watched users using their code in production. Promote an ownership mentality over the full software life cycle. This is how dev.to did it.
Construct your feedback loops thoughtfully. Try to alert the person who made the broken change directly. Never send an alert to someone who isn’t fully equipped and empowered to fix it.
When an engineer is on call, they are not responsible for normal project work — period. That time is sacred and devoted to fixing things, building tooling, and creating guard-rails to protect people from themselves. If nothing is on fire, the engineer can take the opportunity to fix whatever has been annoying them. Allow for plenty of agency and following one’s curiosity, wherever it may lead, and it will be a special treat.
Closely track how often your team gets alerted. Take ANY out-of-hours-alert seriously, and prioritize the work to fix it. Night time pages are heart attacks, not diabetes.
Consider joining the on call rotation yourself! If nothing else, generously pinch hit and be an eager and enthusiastic backup on the regular.
Reliability work and technical debt are not secondary to product work. Budget them into your roadmap, right alongside your features and fixes. Don’t plan so tightly that you have no flex for the unexpected. Don’t be afraid to push back on product and don’t neglect to sell it to your own bosses. People’s lives are in your hands; this is what you get paid to do.
Consider making after-hours on call fully-elective. Why not? What is keeping you from it? Fix those things. This is how Intercom did it.
Depending on your stage and available resources, consider compensating for it.This doesn’t have to be cash, it could be a Friday off the week after every on call rotation. The more established and funded a company you are, the more likely you should do this in order to surface the right incentives up the org chart.
Once you’ve dug yourself out of firefighting mode, invest in SLOs (Service Level Objectives). SLOs and observability are the mature way to get out of reactive mode and plan your engineering work based on tradeoffs and user impact.

I believe it is thoroughly possible to construct an on call rotation that is 100% opt-in, a badge of pride and accomplishment, something that brings meaning and mastery to people’s engineering roles and ties them emotionally to their users. I believe that being on call is something that you can genuinely look forward to.

But every single company is a unique complex sociotechnical snowflake. Flipping the script on whether on call is a burden or a blessing will require a unique solution, crafted to meet your specific needs and drawing on your specific history. It will require tinkering. It will take maintenance.

Above all: ✨RAISE YOUR STANDARDS✨ for what you expect from yourselves. Your greatest enemy is how easily you accept the status quo, and then make up excuses for why it is necessarily this way. You can do better. I know you can.

treat every alarm like a heart attack. _fix_ the motherfucker.

i do not care if this causes product development to screech to a halt. amortize it over a slightly longer period of time and it will more than pay for itself. https://t.co/JSck2u86ff

— Charity Majors (@mipsytipsy) October 14, 2019

There is lots and lots of prior art out there when it comes to making on call work for you, and you should research it deeply. Watch some talks, read some pieces, talk to some people. But then you’ll have to strike out on your own and try something. Cargo-culting someone else’s solution is always the wrong answer.

Any asshole can write some code; owning and tending complex systems for the long run is the hard part. How you choose to shoulder this burden will be a deep reflection of your values and who you are as a team.

And if your on call experience is mandatory and severely life-impacting, and if you don’t take this dead seriously and fix it ASAP? I hope your team will leave you, and go find a place that truly values their time and sleep.

Questionable Advice: War Rooms? Really?!?

September 2, 2020July 14, 2023 mipsytipsyadvice, culture, management, monitoring, operations, sre, tech culture2 Comments

My company has recently begun pushing for us to build and staff out what I can only describe as “command centers”. They’re picturing graphs, dashboards…people sitting around watching their monitors all day just to find out which apps or teams are having issues. With your experience in monitoring and observability, and your opinions on teams supporting their own applications…do you think this sounds like a bad idea? What are things to watch out for, or some ways this might all go sideways?

— Anonymous

Jesus motherfucking Christ on a stick. Is it 1995 where you work? That’s the only way I can try and read this plan like it makes sense.

It’s a giant waste of money and no, it won’t work. This path leads into a death spiral where alarms are going off constantly (yet somehow never actually catching the real problems), people getting burned out, and anyone competent will either a) leave or b) refuse to be on call. Sideways enough for you yet?

Snark aside, there are two foundational flaws with this plan.

1) watching graphs is pointless. You can automate that shit, remember? ✨Computers!✨ Furthermore, this whole monitoring-based approach will only ever help you find the known unknowns, the problems you already know to look for. But most of your actual problems will be unknown unknowns, the ones you don’t know about yet.

2) those people watching the graphs… When something goes wrong, what exactly can they do about it? The answer, unfortunately, is “not much”. The only people who can swiftly diagnose and fix complex systems issues are the people who build and maintain those systems, and those people are busy building and maintaining, not watching graphs.

That extra human layer is worse than useless; it is actively harmful. By insulating developers from the consequences of their actions, you are concealing from them the information they need to understand the consequences of their actions. You are interfering with the most basic of feedback loops and causing it to malfunction.

The best time to find a bug is as soon as possible after writing it, while it’s all fresh in your head. If you let it fester for days, weeks, or months, it will be exponentially more challenging to find and solve. And the best people to find those bugs are the people who wrote them

Helpful? Hope so. Good luck. And if they implement this anyway — leave. You deserve to work for a company that won’t waste your fucking time.

with love, charity.

Questionable Advice: “What’s the critical path?”

July 24, 2020July 14, 2023 mipsytipsyadvice, operations, paging & alerting, sre5 Comments

Dan Golant asked a great question today: “Any advice/reading on how to establish a team’s critical path?”

I repeated back: “establish a critical path?” and he clarified:

Yea, like, you talk about buttoning up your “critical path”, making sure it’s well-monitored etc. I think that the right first step to really improving Observability is establishing what business processes *must* happen, what our “critical paths” are. I’m trying to figure out whether there are particularly good questions to ask that can help us document what these paths are for my team/group in Eng.

“Critical path” is one of those phrases that I think I probably use a lot. Possibly because the very first real job I ever had was when I took a break from college and worked at criticalpath.net (“we handle the world’s email”) — and by “work” I mean, “lived in SF for a year when I was 18 and went to a lot of raves and did a lot of drugs with people way cooler than me”. Then I went back to college, the dotcom boom crashed, and the CP CFO and CEO actually went to jail for cooking the books, becoming the only tech execs I am aware of who actually went to jail.

Where was I.

Right, critical path. What I said to Dan is this: “What makes you money?”

Like, if you could only deploy three end-to-end checks that would perform entire operations on your site and ensure they work at all times, what would they be? what would they do? “Submit a payment” is a super common one; another is new user signups.

The idea here is to draw up a list of the things that are absolutely worth waking someone up to fix immediately, night or day, rain or shine. That list should be as compact and well-defined as possible. This allows you to be explicit about the fact that anything else can wait til morning, or some other less-demanding service level agreement.

And typically the right place to start on this list is by asking yourselves: “what makes us money?” as a proxy for the real questions, which are: “what actions allow us to survive as a business? What do our customers care the absolute most about? What makes us us?” That’s your critical path.

Someone will usually seize this opportunity to argue that absolutely any deterioration in service is worth paging someone immediately to fix it, day or night. They are wrong, but it’s good to flush these assumptions out and have this argument kindly out in the open.

(Also, this is really a question about service level objectives. So if you’re asking yourself about the critical path, you should probably consider buying Alex Hidalgo’s book on SLOs, and you may want to look into the Honeycomb SLO product, the only one in the industry that actually implements SLOs as the Google SRE book defines them (thanks Liz!) and lets you jump straight from “what are our customers experiencing?” to “WHY are they experiencing it”, without bouncing awkwardly from aggregate metrics to logs and back and just … hoping … the spikes line up according to your visual approximations.)

charity.

Deploys: It’s Not Actually About Fridays

October 28, 2019July 14, 2023 mipsytipsyculture, deploys, friday deploys, management, operations, paging & alerting, sre, tech culture2 Comments

I just read this piece, which is basically a very long subtweet about my Friday deploy threads. Go on and read it: I’ll wait.

Here’s the thing. After getting over some of the personal gibes (smug optimism? literally no one has ever accused me of being an optimist, kind sir), you may be expecting me to issue a vigorous rebuttal. But I shan’t. Because we are actually in violent agreement, almost entirely.

I have repeatedly stressed the following points:

I want to make engineers’ lives better, by giving them more uninterrupted weekends and nights of sleep. This is the goal that underpins everything I do.
Anyone who ships code should develop and exercise good engineering judgment about when to deploy, every day of the week
Every team has to make their own determination about which policies and norms are right given their circumstances and risk tolerance
A policy of “no Friday deploys” may be reasonable for now but should be seen as a smell, a sign that your deploys are risky. It is also likely to make things WORSE for you, not better, by causing you to adopt other risky practices (e.g. elongating the interval between merge and deploy, batching changes up in a single deploy)

This has been the most frustrating thing about this conversation: that a) I am not in fact the absolutist y’all are arguing against, and b) MY number one priority is engineers and their work/life balance. Which makes this particularly aggravating:

Lastly there is some strange argument that choosing not to deploy on Friday “Shouldn’t be a source of glee and pride”. That one I haven’t figured out yet, because I have always had a lot of glee and pride in being extremely (overly?) protective of the work/life balance of the engineers who either work for me, or with me. I don’t expect that to change.

Hold up. Did you catch that clever little logic switcheroo? You defined “not deploying on Friday” as being a priori synonymous with “protecting the work/life balance of engineers”. This is how I know you haven’t actually grasped my point, and are arguing against a straw man. My entire point is that the behaviors and practices associated with blocking Friday deploys are in fact hurting your engineers.

I, too, take a lot of glee and pride in being extremely, massively, yes even OVERLY protective of the work/life balance of the engineers who either work for me, or with me.

AND THAT IS WHY WE DEPLOY ON FRIDAYS.

Because it is BETTER for them. Because it is part of a deploy ecosystem which results in them being woken up less and having fewer weekends interrupted overall than if I had blocked deploys on Fridays.

It’s not about Fridays. It’s about having a healthy ecosystem and feedback loop where you trust your deploys, where deploys aren’t a big deal, and they never cause engineers to have to work outside working hours. And part of how you get there is by not artificially blocking off a big bunch of the week and not deploying during that time, because that breaks up your virtuous feedback loop and causes your deploys to be much more likely to fail in terrible ways.

The other thing that annoys me is when people say, primly, “you can’t guarantee any deploy is safe, but you can guarantee people have plans for the weekend.”

Know what else you can guarantee? That people would like to sleep through the fucking night, even on weeknights.

When I hear people say this all I hear is that they don’t care enough to invest the time to actually fix their shit so it won’t wake people up or interrupt their off time, seven days a week. Enough with the virtue signaling already.

You cannot have it both ways, where you block off a bunch of undeployable time AND you have robust, resilient, swift deploys. Somehow I keep not getting this core point across to a substantial number of very intelligent people. So let me try a different way.

Let’s try telling a story.

A tale of two startups

Here are two case studies.

Company X

Company X is a three-year-old startup. It is a large, fast-growing multi-tenant platform on a large distributed system with spiky traffic, lots of user-submitted data, and a very green database. Company X deploys the API about once per day, and does a global deploy of all services every Tuesday. Deploys often involve some firefighting and a rollback or two, and Tuesdays often involve deploying and reverting all day (sigh).

Pager volume at Company X isn’t the worst, but usually involves getting woken up a couple times a week, and there are deploy-related alerts after maybe a third of deploys, which then need to be triaged to figure out whose diff was the cause.

Company Z

Company Z is a three-year-old startup. It is a large, fast-growing multi-tenant platform on a large distributed system with spiky traffic, lots of user-submitted data, and a very green house-built distributed storage engine. Company Z automatically triggers a deploy within 30 minutes of a merge to master, for all services impacted by that merge. Developers at company Z practice observability-driven deployment, where they instrument all changes, ask “how will I know if this change doesn’t work?” during code review, and have a muscle memory habit of checking to see if their changes are working as intended or not after they merge to master.

Deploys rarely result in the pager going off at Company Z; most problems are caught visually by the engineer and reverted or fixed before any paging alert can fire. Pager volume consists of roughly one alert per week outside of working hours, and no one is woken up more than a couple times per year.

Same damn problem, better damn solutions.

If it wasn’t extremely obvious, these companies are my last two jobs, Parse (company X, from 2012-2016) and Honeycomb (company Z, from 2016-present).

They have a LOT in common. Both are services for developers, both are platforms, both are running highly elastic microservices written in golang, both get lots of spiky traffic and store lots of user-defined data in a young, homebrewed columnar storage engine. They were even built by some of the same people (I built infra for both, and they share four more of the same developers).

At Parse, deploys were run by ops engineers because of how common it was for there to be some firefighting involved. We discouraged people from deploying on Fridays, we locked deploys around holidays and big launches. At Honeycomb, none of these things are true. In fact, we literally can’t remember a time when it was hard to debug a deploy-related change.

What’s the difference between Company X and Company Z?

So: what’s the difference? Why are the two companies so dramatically different in the riskiness of their deploys, and the amount of human toil it takes to keep them up?

I’ve thought about this a lot. It comes down to three main things.

Observability
Observability-driven development
Single merge per deploy

1. Observability.

I think that I’ve been reluctant to hammer this home as much as I ought to, because I’m exquisitely sensitive about sounding like an obnoxious vendor trying to sell you things. 😛 (Which has absolutely been detrimental to my argument.)

When I say observability, I mean in the precise technical definition as I laid out in this piece: with high cardinality, arbitrarily wide structured events, etc. Metrics and other generic telemetry will not give you the ability to do the necessary things, e.g. break down by build id in combination with all your other dimensions to see the world through the lens of your instrumentation. Here, for example, are all the deploys for a particular service last Friday:

Each shaded area is the duration of an individual deploy: you can see the counters for each build id, as the new versions replace the old ones,

2. Observability-driven development.

This is cultural as well as technical. By this I mean instrumenting a couple steps ahead of yourself as you are developing and shipping code. I mean making a cultural practice of asking each other “how will you know if this is broken?” during code review. I mean always going and looking at your service through the lens of your instrumentation after every diff you ship. Like muscle memory.

3. Single merge per deploy.

The number one thing you can do to make your deploys intelligible, other than observability and instrumentation, is this: deploy one changeset at a time, as swiftly as possible after it is merged to master. NEVER glom multiple changesets into a single deploy — that’s how you get into a state where you aren’t sure which change is at fault, or who to escalate to, or if it’s an intersection of multiple changes, or if you should just start bisecting blindly to try and isolate the source of the problem. THIS is what turns deploys into long, painful marathons.

headlamps, illuminating whatever’s in front of my face: this is the image in my mind when i think about instrumenting my code

And NEVER wait hours or days to deploy after the change is merged. As a developer, you know full well how this goes. After you merge to master one of two things will happen. Either:

you promptly pull up a window to watch your changes roll out, checking on your instrumentation to see if it’s doing what you intended it to or if anything looks weird, OR
you close the project and open a new one.

When you switch to a new project, your brain starts rapidly evicting all the rich context about what you had intended to do and and overwriting it with all the new details about the new project.

Whereas if you shipped that changeset right after merging, then you can WATCH it roll out. And 80-90% of all problems can be, should be caught right here, before your users ever notice — before alerts can fire off and page you. If you have the ability to break down by build id, zoom in on any errors that happen to arise, see exactly which dimensions all the errors have in common and how they differ from the healthy requests, see exactly what the context is for any erroring requests.

Healthy feedback loops == healthy systems.

That tight, short feedback loop of build/ship/observe is the beating heart of a healthy, observable distributed system that can be run and maintained by human beings, without it sucking your life force or ruining your sleep schedule or will to live.

Most engineers have never worked on a system like this. Most engineers have no idea what a yawning chasm exists between a healthy, tractable system and where they are now. Most engineers have no idea what a difference observability can make. Most engineers are far more familiar with spending 40-50% of their week fumbling around in the dark, trying to figure out where in the system is the problem they are trying to fix, and what kind of context do they need to reproduce.

Most engineers are dealing with systems where they blindly shipped bugs with no observability, and reports about those bugs started to trickle in over the next hours, days, weeks, months, or years. Most engineers are dealing with systems that are obfuscated and obscure, systems which are tangled heaps of bugs and poorly understood behavior for years compounding upon years on end.

That’s why it doesn’t seem like such a big deal to you break up that tight, short feedback loop. That’s why it doesn’t fill you with horror to think of merging on Friday morning and deploying on Monday. That’s why it doesn’t appall you to clump together all the changes that happen to get merged between Friday and Monday and push them out in a single deploy.

It just doesn’t seem that much worse than what you normally deal with. You think this raging trash fire is, unfortunately … normal.

How realistic is this, though, really?

Maybe you’re rolling your eyes at me now. “Sure, Charity, that’s nice for you, on your brand new shiny system. Ours has years of technical debt, It’s unrealistic to hold us to the same standard.”

Yeah, I know. It is much harder to dig yourself out of a hole than it is to not create a hole in the first place. No doubt about that.

Harder, yes. But not impossible.

I have done it.

Parse in 2013 was a trash fire. It woke us up every night, we spent a lot of time stabbing around in the dark after every deploy. But after we got acquired by Facebook, after we started shipping some data sets into Scuba, after (in retrospect, I can say) we had event-level observability for our systems, we were able to start paying down that debt and fixing our deploy systems.

We started hooking up that virtuous feedback loop, step by step.

We reworked our CI/CD system so that it built a new artifact after every single merge.
We put developers at the steering wheel so they could push their own changes out.
We got better at instrumentation, and we made a habit of going to look at it during or after each deploy.
We hooked up the pager so it would alert the person who merged the last diff, if an alert was generated within an hour after that service was deployed.

We started finding bugs quicker, faster, and paying down the tech debt we had amassed from shipping code without observability/visibility for many years.

Developers got in the habit of shipping their own changes, and watching them as they rolled out, and finding/fixing their bugs immediately.

It took some time. But after a year of this, our formerly flaky, obscure, mysterious, massively multi-tenant service that was going down every day and wreaking havoc on our sleep schedules was tamed. Deploys were swift and drama-free. We stopped blocking deploys on Fridays, holidays, or any other days, because we realized our systems were more stable when we always shipped consistently and quickly.

Allow me to repeat. Our systems were more stable when we always shipped right after the changes were merged. Our systems were less stable when we carved out times to pause deployments. This was not common wisdom at the time, so it surprised me; yet I found it to be true over and over and over again.

This is literally why I started Honeycomb.

When I was leaving Facebook, I suddenly realized that this meant going back to the Dark Ages in terms of tooling. I had become so accustomed to having the Parse+scuba tooling and being able to iteratively explore and ask any question without having to predict it in advance. I couldn’t fathom giving it up.

The idea of going back to a world without observability, a world where one deployed and then stared anxiously at dashboards — it was unthinkable. It was like I was being asked to give up my five senses for production — like I was going to be blind, deaf, dumb, without taste or touch.

Look, I agree with nearly everything in the author’s piece. I could have written that piece myself five years ago.

But since then, I’ve learned that systems can be better. They MUST be better. Our systems are getting so rapidly more complex, they are outstripping our ability to understand and manage them using the past generation of tools. If we don’t change our ways, it will chew up another generation of engineering lives, sleep schedules, relationships.

Observability isn’t the whole story. But it’s certainly where it starts. If you can’t see where you’re going, you can’t go very far.

Get you some observability.

And then raise your standards for how systems should feel, and how much of your human life they should consume. Do better.

Because I couldn’t agree with that other post more: it really is all about people and their real lives.

Listen, if you can swing a four day work week, more power to you (most of us can’t). Any day you aren’t merging code to master, you have no need to deploy either. It’s not about Fridays; it’s about the swift, virtuous feedback loop.

And nobody should be shamed for what they need to do to survive, given the state of their systems today.

But things aren’t gonna get better unless you see clearly how you are contributing to your present pain. And congratulating ourselves for blocking Friday deploys is like congratulating ourselves for swatting ourselves in the face with the flyswatter. It’s a gross hack.

Maybe you had a good reason. Sure. But I’m telling you, if you truly do care about people and their work/life balance: we can do a lot better.

charity.

Love (and Alerting) in the Time of Cholera (and Observability)

September 20, 2019July 14, 2023 mipsytipsymonitoring, operations, paging & alerting, sre1 Comment

I made a vow this year to post one blog post a month, then I didn’t post anything at all from May to September. I have some catching up to do. 😑 I’ve also been meaning to transcribe some of the twitter rants that I end up linking back to into blog posts, so if there’s anything you especially want me to write about, tell me now while I’m in repentance mode.

This is one request I happened to make a note of because I can’t believe I haven’t already written it up! I’ve been saying the same thing over and over in talks and on twitter for years, but apparently never a blog post.

The question is: what is the proper role of alerting in the modern era of distributed systems? Has it changed? What are the updated best practices for alerting?

@mipsytipsy I've seen your thoughts on dashboards vs searching but haven't seen many thoughts from you on Alerting. Let me know if I've missed a blog somewhere on that! 🙂

— katy allred (@themindfulmommy) August 26, 2019

It’s a great question. I want to wax philosophically about some stuff, but first let me briefly outline the way to modernize your alerting best practices:

implement observability
implement SLOs and/or end-to-end checks that traverse key code paths and correlate to user-impacting events
create a secondary channel (tasks, ticketing system, whatever) for “things that on call should look at soon, but are not impacting users yet” which does not page anyone, but which on call is expected to look at (at least) first thing in the morning, last thing in the evening, and midday
move as many paging alerts as possible to the secondary channel, by engineering your services to auto-remediate or run in degraded mode until they can be patched up
wake people up only for SLOs and health checks that correlate to user-impacting events

Or, in an even shorter formulation: delete all your paging alerts, then page only on e2e alerts that mean users are in pain. Rely on debugging tools for debugging, and paging only when users are in pain.

To understand why I advocate deleting all your paging alerts, and when it’s safe to delete them, first we need to understand why have we accumulated so many crappy paging alerts over the years.

Monoliths, LAMP stacks, and death by pagebomb

Here, let’s crib a couple of slides from one of my talks on observability. Here are the characteristics of older monolithic LAMP-stack style systems, and best practices for running them:

The sad truth is, that when all you have is time series aggregates and traditional monitoring dashboards, you aren’t really debugging with science so much as you are relying on your gut and a handful of dashboards, using intuition and scraps of data to try and reconstruct an impossibly complex system state.

This works ok, as long as you have a relatively limited set of failure scenarios that happen over and over again. You can just pattern match from past failures to current data, and most of the time your intuition can bridge the gap correctly. Every time there’s an outage, you post mortem the incident, figure out what happened, build a dashboard “to help us find the problem immediately next time”, create a detailed runbook for how to respond to it, and (often) configure a paging alert to detect that scenario.

Over time you build up a rich library of these responses. So most of the time when you get paged you get a cluster of pages that actually serves to help you debug what’s happening. For example, at Parse, if the error graph had a particular shape I immediately knew it was a redis outage. Or, if I got paged about a high % of app servers all timing out in a short period of time, I could be almost certain the problem was due to mysql connections. And so forth.

Things fall apart; the pagebomb cannot stand

However, this model falls apart fast with distributed systems. There are just too many failures. Failure is constant, continuous, eternal. Failure stops being interesting. It has to stop being interesting, or you will die.

Instead of a limited set of recurring error conditions, you have an infinitely long list of things that almost never happen …. except that one time they do. If you invest your time into runbooks and monitoring checks, it’s wasted time if that edge case never happens again.

Frankly, any time you get paged about a distributed system, it should be a genuinely new failure that requires your full creative attention. You shouldn’t just be checking your phone, going “oh THAT again”, and flipping through a runbook. Every time you get paged it should be genuinely new and interesting.

Oh damn this talk looks baller. 😍 "Failure is important, but it is no longer interesting" — @this_hits_home… Netflix once again shining the light on where the rest of us need to get to over the next 3-5 years. 🙌🏅🎬 https://t.co/OY40Y0BTSa

— Charity Majors (@mipsytipsy) July 7, 2019

And thus you should actually have drastically fewer paging alerts than you used to.

A better way: observability and SLOs.

Instead of paging alerts for every specific failure scenario, the technically correct answer is to define your SLOs (service level objectives) and page only on those, i.e. when you are going to run out of budget ahead of schedule. But most people aren’t yet operating at this level of sophistication. (SLOs sound easy, but are unbelievably challenging to do well; many great teams have tried and failed. This is why we have built an SLO feature into Honeycomb that does the heavy lifting for you. Currently alpha testing with users.)

If you haven’t yet caught the SLO religion, the alternate answer is that “you should only page on high level end-to-end alerts, the ones which traverse the code paths that make you money and correspond to user pain”. Alert on the three golden signals: request rate, latency, and errors, and make sure to traverse every shard and/or storage type in your critical path.

That’s it. Don’t alert on the state of individual storage instances, or replication, or anything that isn’t user-visible.

(To be clear: by “alert” I mean “paging humans at any time of day or night”. You might reasonably choose to page people during normal work hours, but during sleepy hours most errors should be routed to a non-paging address. Only wake people up for actual user-visible problems.)

Here’s the thing. The reason we had all those paging alerts was because we depended on them to understand our systems.

Once you make the shift to observability, once you have rich instrumentation and the ability to swiftly zoom in from high level “there might be a problem” to identifying specifically what the errors have in common, or the source of the problem — you no longer need to lean on that scattershot bunch of pagebombs to understand your systems. You should be able to confidently ask any question of your systems, understand any system state — even if you have never encountered it before.

With observability, you debug by systematically following the trail of crumbs back to their source, whatever that is. Those paging alerts were a crutch, and now you don’t need them anymore.

Everyone is on call && on call doesn’t suck.

I often talk about how modern systems require software ownership. The person who is writing the software, who has the original intent in their head, needs to shepherd that code out into production and watch real users use it. You can’t chop that up into multiple roles, dev and ops. You just can’t. Software engineers working on highly available systems need to be on call for their code.

But the flip side of this responsibility belongs to management. If you’re asking everyone to be on call, it is your sworn duty to make sure that on call does not suck. People shouldn’t have to plan their lives around being on call. People shouldn’t have to expect to be woken up on a regular basis. Every paging alert out of hours should be as serious as a heart attack, and this means allocating real engineering resources to keeping tech debt down and noise levels low.

And the way you get there is first invest in observability, then delete all your paging alerts and start over from scratch.

It works. It really does. 🌈

Friday Deploy Freezes Are Exactly Like Murdering Puppies

May 1, 2019July 14, 2023 mipsytipsyculture, deploys, friday deploys, operations, paging & alerting, sre, tech culture20 Comments

VOICEOVER: “Previously, on twitter …”

Happy Friday

|￣￣￣￣￣￣￣￣￣￣￣|
Don't push
to production
|＿＿＿＿＿＿＿＿＿＿＿|
(___/) ||
(⌾ㅅ⌾) ||
/ 　づ

— Kelly Vaughn (@kvlly) April 12, 2019

So, that happened.

I hadn’t seen anyone say something like this in quite a while. I remember saying things like this myself as recently as, oh, 2016, but I thought the zeitgeist had moved on to continuous delivery.

Which is not to say that Friday freezes don’t happen anymore, or even that they shouldn’t; I just thought that this was no longer seen as a badge of responsibility and honor, rather a source of mild embarrassment. (Much like the fact that you still don’t automatedly restore your db backups and verify them every night. Do you.)

So I responded with an equally hyperbolic and indefensible claim:

If you're scared of pushing to production on Fridays, I recommend reassigning all your developer cycles off of feature development and onto your CI/CD process and observability tooling for as long as it takes to ✨fix that✨.

— Charity Majors (@mipsytipsy) April 15, 2019

Now obviously, OBVIOUSLY, reassigning all your developer cycles is probably a terrible idea. You don’t get 100x parallel efficiency if you put 100 developers on a single problem. So I thought it was clear that this said somewhat tongue in cheek, serious-but-not-really. I was wrong there too.

So let me explain.

There’s nothing morally “wrong” with Friday freezes. But it is a costly and cumbersome bandage for a problem that you would be better served to address directly. And if your stated goal is to protect people’s off hours, this strategy is likely to sabotage that goal and cause them to waste far more time and get woken up much more often, and it stunts your engineers’ technical development on top of that.

Fear is the mind-killer.

Fear of deploys is the ultimate technical debt. How much time does your company waste, between engineers:

waiting until it is “safe” to deploy,
batching up changes into bigger changes that are decidedly unsafe to deploy,
debugging broken deploys that had many changes batched into them,
waiting nervously to get paged after a deploy goes out,
figuring out if now is a good time to deploy or not,
cleaning up terrible deploy-related catastrophuckes

Anxiety related to deploys is the single largest source of technical debt in many, many orgs. Technical debt, lest we forget, is not the same as “bad code”. Tech debt hurts your people.

Saying “don’t push to production” is a code smell. Hearing it once a month at unpredictable intervals is concerning. Hearing it EVERY WEEK for an ENTIRE DAY OF THE WEEK should be a heartstopper alarm. If you’ve been living under this policy you may be numb to its horror, but just because you’re used to hearing it doesn’t make it any less noxious.

If you’re used to hearing it and saying it on a weekly basis, you are afraid of your deploys and you should fix that.

It’s a smell. If you can’t deploy at 6pm on a Friday it means you don’t understand or don’t trust your systems or process. That might be ok if you openly acknowledge it, but if your not talking about it, that’s dangerous

— Erik Peterson (@silvexis) April 16, 2019

If you are a software company, shipping code is your heartbeat. Shipping code should be as reliable and sturdy and fast and unremarkable as possible, because this is the drumbeat by which value gets delivered to your org.

Deploys are the heartbeat of your company.

Every time your production pipeline stops, it is a heart attack. It should not be ok to go around nonchalantly telling people to halt the lifeblood of their systems based on something as pedestrian as the day of the week.

Why are you afraid to push to prod? Usually it boils down to one or more factors:

your deploys frequently break, and require manual intervention just to get to a good state
your test coverage is not good, your monitoring checks are not good, so you rely on users to report problems back to you and this trickles in over days
recovering from deploys gone bad can regularly cause everything to grind to a halt for hours or days while you recover, so you don’t want to even embark on a deploy without 24 hours of work day ahead of you
your deploys are painfully slow, and take hours to run tests and go live.

These are pretty darn good reasons. If this is the state you are in, I totally get why you don’t want to deploy on Fridays. So what are you doing to actively fix those states? How long do you think these emergency controls will be in effect?

The answers of “nothing” and “forever” are unacceptable. These are eminently fixable problems, and the amount of drag they create on your engineering team and ability to execute are the equivalent of five-alarm fires.

Fix. That. Take some cycles off product and fix your fucking deploy pipeline.

It is difficult to combine "you shouldn't do this because it is a symptom of a systemic problem, you should instead address the systemic problem" with "yes actually you should do it for as long as the systemic problem exists, because it is adaptive".

— Senior Oops Engineer (@ReinH) April 16, 2019

If you’ve been paying attention to the DORA report or Accelerate, you know that the way you address the problem of flaky deploys is NOT by slowing down or adding roadblocks and friction, but by shipping more QUICKLY.

Science says: ship fast, ship often.

Deploy on every commit. Smaller, coherent changesets transform into debuggable, understandable deploys. If we’ve learned anything from recent research, it’s that velocity of deploys and lowered error rates are not in tension with each other, they actually reinforce each other. When one gets better, the other does too.

So by slowing down or batching up or pausing your deploys, you are materially contributing to the worsening of your own overall state.

If you block devs from merging on Fridays, then you are sacrificing a fifth of your velocity and overall output. That’s a lot of fucking output.

If you do not block merges on Fridays, and only block deploys, you are queueing up a bunch of changes to all get shipped days later, long after the engineers wrote the code and have forgotten half of the context. Any problems you encounter will be MUCH harder to debug on Monday in a muddled blob of changes than they would have been just shipping crisply, one at a time on Friday. Is it worth sacrificing your entire Monday? Monday-Tuesday? Monday-Tuesday-Wednesday?

The worst, most PTSD-inducing outages of my life have all happened after holiday code freezes. Every. Single. One.

Don't use rules. Practice good judgment, build tools to align incentives and deploy often; practice tolerating and recovering from pedestrian failures often too. https://t.co/GH7lef274z

— Charity Majors (@mipsytipsy) April 17, 2019

Good judgment matters more than rules.

I am not saying that you should make a habit of shipping a large feature at 4:55 pm on Friday and then sauntering out the door at 5. For fucks sake. Every engineer needs to learn and practice good technical judgment around deploy hygiene. LIke,

Don’t ship before you walk out the door on *any* day.

Don’t ship big, gnarly features right before the weekend, if you aren’t going to be around to watch them.
Instrument your code, and go and LOOK at the damn thing once it’s live.
Use feature flags and other tools that separate turning on code paths from deploys.

But you don’t need rules for this; in fact, rules actually inhibit the development of good judgment!

Policies (and enumerated exceptions to policies, and exceptions to exceptions) are a piss-poor substitute for judgment. Rules are blunt instruments that stunt your engineers' development and critical thinking skills.

Allow me to illustrate. https://t.co/H8KgT7bRfU

— Charity Majors (@mipsytipsy) April 17, 2019

Most deploy-related problems are readily obvious, if the person who has the context for the change in their heads goes and looks at it.

But if you aren’t looking for them, then sure — you probably won’t find out until user reports start to trickle in over the next few days.

So go and LOOK.

Stop shipping blind. Actually LOOK at what you ship.

I mean, if it takes 48 hours for a bug to show up, then maybe you better freeze deploys on Thursdays too, just to be safe! 🙄

I get why this seems obvious and tempting. The “safety” of nodeploy Friday is realized immediately, while the costs are felt later later. They’re felt when you lose Monday (and Tuesday) to debugging the big blob deplly. Or they get amortized out over time. Or you experience them as sluggish ship rates and a general culture of fear and avoidance, or learned helplessness, and the broad acceptance of fucked up situations as “normal”.

If pushing to production is a painful event, make it a *non-event* (same philosophy as chaos engineering)

— Paul Sec.jpeg .exe (@PaulWebSec) April 16, 2019

But if recovering from deploys is long and painful and hard, then you should fix that. If you don’t tend to detect reliability events until long after the event, you should fix that. If people are regularly getting paged on Saturdays and Sundays, they are probably getting paged throughout the night, too. You should fix that.

On call paging events should be extremely rare. There’s no excuse for on call being something that significantly impacts a person’s life on the regular. None.

I’m not saying that every place is perfect, or that every company can run like a tech startup. I am saying that deploy tooling is systematically underinvested in, and we abuse people far too much by paging them incessantly and running them ragged, because we don’t actually believe it can be any better.

It can. If you work towards it.

Devote some real engineering hours to your deploy pipeline, and some real creativity to your processes, and someday you too can lift the Friday ban on deploys and relieve your oncall from burnout and increase your overall velocity and productivity.

On virtue signaling

Finally, I heard from a alarming number of people who admitted that Friday deploy bans were useless or counterproductive, but they supported them anyway as a purely symbolic gesture to show that they supported work/life balance.

This makes me really sad. I’m … glad they want to support work/life balance, but surely we can come up with some other gestures that don’t work directly counter to their goals of life/work balance.

That's it. Because if you make it a virtue signal, it will NEVER GET FIXED. Blocking Friday deploy is not a mark of moral virtue; it is a physical bash script patching over technical rot.

And technical rot is bad because it HURTS PEOPLE. It is in your interest to fix it.

— Charity Majors (@mipsytipsy) April 16, 2019

Recovery: building a healthy deploy culture

Ways to begin recovering from a toxic deploy culture:

Have a deploy philosophy, make sure everybody knows what it is. Be consistent.
Build and deploy on every set of committed changes. Do not batch up multiple people’s commits into a deploy.
Train every engineer so they can run their own deploys, if they aren’t fully automated. Make every engineer responsible for their own deploys.
(Work towards fully automated deploys.)
Every deploy should be owned by the developer who made the changes that are rolling out. Page the person who committed the change that triggered the deploy, not whoever is oncall.
Set expectations around what “ownership” means. Provide observability tooling so they can break down by build id and compare the last known stable deploy with the one rolling out.
Never accept a diff if there’s no explanation for the question, “how will you know when this code breaks? how will you know if the deploy is not behaving as planned?” Instrument every commit so you can answer this question in production.
Shipping software and running tests should be fast. Super fast. Minutes, tops.
It should be muscle memory for every developer to check up on their deploy and see if it is behaving as expected, and if anything else looks “weird”.
Practice good deploy hygiene using feature flags. Decouple deploys from feature releases. Empower support and other teams to flip flags without involving engineers.

Each deploy should be owned by the developer who made the code changes. But your deploy pipeline needs to have a team that owns it too. I recommend putting your most experienced, senior developers on this problem to signal its high value.

You can find more tips for boring deploys in my piece on why shipping software should not be scary.

Good teams ship often.

Ultimately, I am not dogmatic about Friday deploys. Truly, I’m not. If that’s the only lever you have to protect your time, use it. But call it and treat it like the hack it is. It’s a gross workaround, not an ideal state.

Don’t let your people settle into the idea that it’s some kind of moral stance instead of a butt-ugly hack. Because if you do you will never, ever get rid of it.

There are plenty of good reasons to block deploys on Fridays, but it's not a good policy to cargo cult blindly. It imposes real costs and ultimately hinders you from achieving safe, boring deploys.

Good night.

— Charity Majors (@mipsytipsy) April 17, 2019

Remember: a team’s maturity and efficiency can be represented by how long it takes to get their shit into users’ hands after they write it. Ship it fast, while it’s still fresh in your developers’ heads. Ship one change set at a time, so you can swiftly debug and revert them. I promise your lives will be so much better. Every step helps. <3

charity.

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

February 13, 2019March 7, 2023 mipsytipsyaws, monitoring, operations, outsourcing, security, vendors1 Comment

This is part three of a three-part series of guest posts:

How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team. (George Chamales)
Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it. 🙏❤️ charity + friends

“Now Roll It Out And Keep Them Happy”

This is the third in a series of blog posts; previously we analyzed the security challenges of using a third party service, and we worked together with the security team to build empathy to deliver the project. You might want to read those first, since we are going to build on a lot of the ideas there to ship and maintain this integration.

Ready for launch

You’ve convinced the security team and other stakeholders, you’ve gotten the integration running, you’re getting promising results from dev-test or staging environments… now it’s time to move from proof-of-concept to full implementation. Depending on your situation this might be a transition from staging to production, or it might mean increasing a feature flipper flag from 5% to 100%, or it might mean increasing coverage of an integration from one API endpoint to cover your entire developer footprint.

Taking into account Murphy’s Law, we expect that some things will go wrong during the rollout. Perhaps during coverage, a developer realizes that the schema designed to handle the app’s event mechanism can’t represent a scenario, requiring a redesign or a hacky solution. Or perhaps the metrics dashboard shows elevated error rates from the API frontend, and while there’s no smoking gun, the ops oncall decides to rollback the integration Just In Case it’s causing the incident.

This gives us another chance to practice empathy — while it’s easy, wearing the champion hat, to dismiss any issues found by looking for someone to blame, ultimately this poisons trust within your organization and will hamper success. It’s more effective, in the long run (and often even in the short run), to find common ground with your peers in other disciplines and teams, and work through to solutions that satisfy everybody.

Keeping the lights on

In all likelihood as integration succeeds, the team will rapidly develop experts and expertise, as well as idiomatic ways to use the product. Let the experts surprise you; folks you might not expect can step up when given a chance. Expertise flourishes when given guidance and goals; as the team becomes comfortable with the integration, explicitly recognize a leader or point person for each vendor relationship. Having one person explicitly responsible for a relationship lets them pay attention to those vendor emails, updates, and avoid the tragedy of the “but I thought *you* were” commons. This Integration Lead is also a center of knowledge transfer for your organization — they won’t know everything or help every user come up to speed, but they can help empower the local power users in each team to ramp up their teams on the integration.

As comfort grows you will start to consider ways to change your usage, for example growing into new kinds of data. This is a good time to revisit that security checklist — does the change increase PII exposure to your vendor? Would the new data lead to additional requirements such as per-field encryption? Don’t let these security concerns block you from gaining valuable insight using the new tool, but do take the chance to talk it over with your security experts as appropriate.

Throughout this organic growth, the Integration Lead remains core to managing your changing profile of usage of the vendor they shepherd; as new categories of data are added to the integration, the Lead has responsibility to ensure that the vendor relationship and risk profile are well matched to the needs that the new usage (and presumably, business value) is placing on the relationship.

Documenting the Intergation Lead role and responsibilities is critical. The team should know when to check in, and writing it down helps it happen. When new code has a security implication, or a new use case potentially amplifies the cost of an integration, bringing the domain expert in will avoid unhappy surprises. Knowing how to find out who to bring in, and when to bring them in, will keep your team getting the right eyes on their changes.

Security threats and other challenges change over time, too. Collaborating with your security team so that they know what systems are in use helps your team take note of new information that is relevant to your business. A simple example is noting when your vendors publish a breach announcement, but more complex examples happen too — your vendor transitions cloud providers from AWS to Azure and the security team gets an alert about unexpected data flows from your production cluster; with transparency and trust such events become part of a routine process rather than an emergency.

It’s all operational

Monitoring and alerting is a fact of operations life, and this has to include vendor integrations (even when the vendor integration is a monitoring product.) All of your operations best practices are needed here — keep your alerts clean and actionable so that you don’t develop pager fatigue, and monitor performance of the integration so that you don’t get blindsided by a creeping latency monster in your APIs.

Authentication and authorization are changing as the threat landscape evolves and industry moves from SMS verification codes to U2F/WebAuthn. Does your vendor support your SSO integration? If they can’t support the same SSO that you use everywhere else and can’t add it — or worse, look confused when you mention SSO — that’s probably a sign you should consider a different vendor.

A beautiful sunset

Have a plan beforehand for what needs to be done should you stop using the service. Got any mobile apps that depend on APIs that will go away or start returning permission errors? Be sure to test these scenarios ahead of time.

What happens at contract termination to data stored on the service? Do you need to explicitly delete data when ceasing use?

Do you need to remove integrations from your systems before ending the commercial relationship, or can the technical shutdown and business shutdown run in parallel?

In all likelihood these are contingency plans that will never be needed, and they don’t need to be fully fleshed out to start, but a little bit of forethought can avoid unpleasant surprises.

Year after year

Industry best practice and common sense dictate that you should revisit the security questionnaire annually (if not more frequently). Use this chance to take stock of the last year and check in — are you getting value from the service? What has changed in your business needs and the competitive landscape?

It’s entirely possible that a new year brings new challenges, which could make your current vendor even more valuable (time to negotiate a better contract rate!) or could mean you’d do better with a competing service. Has the vendor gone through any major changes? They might have new offerings that suit your needs well, or they may have pivoted away from the features you need.

Check in with your friends on the security team as well; standards evolve, and last year’s sufficient solution might not be good enough for new requirements.

Andy thinks out loud about security, society, and the problems with computers on Twitter.

❤️ Thanks so much reading, folks. Please feel free to drop any complaints, comments, or additional tips to us in the comments, or direct them to me on twitter.

Have fun! Stay (a little bit) Paranoid!!

— charity