My (hypothetical) SRECon26 keynote (xpost)

My (hypothetical) SRECon26 keynote

One year ago, Fred Hebert and I delivered the closing keynote at SRECon25. Looking back on it now, I can hardly connect with how I felt then. Here’s what I’d say one year later.

Crossposted from here. 

Hey, it’s almost time for SRECon 2026! (I can’t go, but YOU really should!)

Which means it was almost a year ago that Fred Hebert and I were up on stage, delivering the closing keynote1 at SRECon25.

We argued that SREs should get involved and skill up on generative AI tools and techniques, instead of being naysayers and peanut gallerians. You can get a feel for the overall vibe from the description:

It’s easy to be cynical when there’s this much hype and easy money flying around, but generative AI is not a fad; it’s here to stay.

Which means that even operators and cynics — no, especially operators and cynics — need to get off the sidelines and engage with it. How should responsible, forward-looking SREs evaluate the truth claims being made in the market without being reflexively antagonistic?

Yep, that was our big pitch. Don’t be reflexively antagonistic. You should learn AI so that your critiques will land with credibility.

That is not the message I would give today, if I were keynoting SRECon26.

I came out of a hole, and the world had changed

I’ve been in a bit of a hole for the past few months, trying to get the second edition of “Observability Engineering” written and shipped.

Maybe the hole is why this feels so abrupt and discontinuous to me. Or maybe it’s just having such a clear artifact of my views one year ago. I don’t know.

What I do know is that one year ago, I still thought of generative AI as one more really big integration or use case we had to support, whether we liked it or not. Like AI was a slop-happy toddler gone mad in our codebase, and our sworn duty as SREs was to corral and control it, while trying not be a total dick about it.

Today, it’s very clear to me that the center of gravity has shifted from cloud/automation workflows to AI/generation workflows, and that the agentic revolution has only just begun. That toddler is heading off to school. With a loaded gun.

When the facts change, I change my mind

I don’t know when exactly that bit flipped in my head, I only know that it did. And as soon as it did, I felt like the last person on earth to catch on. I can barely connect with my own views from eleven months ago.

Were my views unreasonably pessimistic? Was I willfully ignoring credible evidence in early 2025?

Hmm, perhaps. But Silicon Valley hype trains have not exactly covered themselves in glory in recent years. VR/AR, crypto/web3/NFTs, wearable tech, the Metaverse, 3D printing, the sharing economy…this is not an illustrious string of wins.2

Cloud computing, on the other hand: genuinely huge. So was the Internet. Sometimes the hype train brings you internets, sometimes the hype train brings you tulips.

So no, I don’t think it was obvious in early 2025 that AI generated code would soon grow out of its slop phase. Skepticism was reasonable for a time, and then it was not. I know a lot of technologists who flipped the same bit at some point in 2025.

The keynote I would give today

If I was giving the keynote at SRECon 2026, I would ditch the begrudging stance. I would start by acknowledging that AI is radically changing the way we build software. It’s here, it’s happening, and it is coming for us all.

1 — This is happening

It is very, very hard to adjust to change that is being forced on you. So please don’t wait for it to be forced on you. Swim out to meet it. Find your way in, find something to get excited about.

As Adam Jacob recently advised,

“If you’re an engineer or an operations person, there is only one move. You have to start working in this new way as much as you can. If you can’t do it at work, do it at home. You want to be on the frontier of this change, because the career risk to being a laggard is incredibly high.” — Adam Jacob

This AI shit is not hard. The early days of any technology are the simplest, and this technology more than most. Conquer the brain weasels in your head by learning the truth of this for yourself.

2 — Know thyself

At a time of elevated uncertainty and anxiety, our natural human tendency to drift into confirmation bias and disconfirmation bias is higher than ever. Whatever proof you instinctively seek out, you are guaranteed to find.

The best advice I can give anyone is: know your nature, and lean against it.

  • If you are a reflexive naysayer or a pessimist, know that, and force yourself to find a way in to wonder, surprise and delight.
  • If you are an optimist who gets very excited and tends to assume that everything will improve: know that, and force yourself to mind real cautionary tales.

Try to keep your aperture wide, and remain open to possibilities you find uncomfortable. Curate the ocean you swim in. Puncture your bubble.

3 — Don’t panic

Don’t panic, and don’t give in to despair. The future isn’t written yet, and nobody knows what’s going to happen. I sure as hell don’t. Neither do you.

The fact that AI has radically changed the way we develop software in very a short time, and seems poised to change it much more in the next year or two, is real and undeniable.

This does not mean that everything else predicted by AI optimists will come to pass.

Extraordinary claims still require extraordinary evidence. AGI is, at present, an elaborate thought experiment, one that contradicts all the evidence we currently have about how technological breakthroughs typically yield enormous change in the early days, and then plateau.

We are all technologists now

Here’s another Adam quote I really like:

The bright side is that it’s a technology shift, not a manufacturing shift – meaning you still have to have technologists to do it.

I’ve written a number of blog posts over the years where I have advised people to go into the second half of their career thinking of themselves not as “engineers” or as “managers”, but as “technologists”. 3

Every great technologist needs an arsenal of skills on top of their technical expertise. They need to understand how to navigate an organization, how to translate between the language of technology and the language of the business; how to wield influence and drive results across team, company, even industry lines.

These remain durable skills, in an era where good code can be generated practically for free.

This is the moment for pragmatists

Many people who love the art and craft of software are struggling in this moment, as the value of that craft is diminishing.

People who take a much more…functional…approach to software seem to be thriving in the present chaos. “Functional” describes most of the SREs I know, including myself.

After all, SREs have always been judged by outcomes — uptime, reliability, whether the thing kept running. An outcome orientation turns out to be excellent preparation for a world where the “how” of software is becoming less important than the what and the whether, across the board.

So maybe the advice we gave at SRECon wasn’t so bad after all. Especially this part:

Which means that even operators and cynics — no, especially operators and cynics — need to get off the sidelines and engage with it.

Who can build better guardrails for AI, than SREs and operators who have spent their entire careers building guardrails for software engineers and customers?

The industry needs us. But not begrudgingly, eyerollingly, pretending to get on board in order to slow things down from the inside. The industry needs our skills to help engineering teams go fast forever.

Don’t sit back and wait for change to reach you. Run towards the waves. It’s nice out here.

1 — Our talk was called “AIOps: Prove it! An Open Letter to Vendors Selling AI for SREs”. In retrospect, this was a terrible title. It was not an open letter to vendors at all; if anything, it was an open letter to SREs. It started out as one topic, but by the time the event rolled around, it had morphed into something entirely different. Ah well.

2 — I am not even listing the kooky religious shit like effective accelerationism, transhumanism, AI “alignment” or the Singularity, all of which has seeped into the water table around these parts.

3 — Omg, I have so many unwritten posts wriggling around in my brain right now on this topic.

My (hypothetical) SRECon26 keynote (xpost)

First I wrote the wrong book, then I wrote the right book (xpost)

I’m not sure whether to say “thank you” or “HOW COULD YOU DO THIS TO ME”, but this one goes out to all the people who sent me advice on buying software last fall.

This is the second in a two-part episode. The first part ended on a ✨cliffhanger!!!✨ — so if you missed the first episode, catch up here:

Six long weeks of writer’s block

I was merrily cranking away what I believed to be my last chapter when I asked the internet — YOU guys — for help the first time. “Are you an experienced software buyer? I could use some help,” went up on September 19th, 2025.

The response was overwhelming. I heard from software engineers, SREs, observability leads, CTOs, VPs, distinguished engineers, consultants, even the odd CISO. All these emails and responses and lengthy threads kept me busy for a while, but eventually I had to get back to writing. That’s when I discovered, to my unpleasant surprise, that I couldn’t seem to write anymore.

“Well,” I reasoned, “maybe I’ll just ask the internet for EVEN MORE advice” — and out popped Buffy-themed post number two, on October 13th.

Keep in mind, I thought I would be done by then. November was my stretch deadline, my just in caseI better leave myself some breathing room kind of deadline.

As November 1st came and went, my frustration began spiraling out into blind panic. What the hell is going on and why can I not finish this???

In which I finally listen to the advice I asked for

A week before Thanksgiving, I was up late tinkering with Claude. I imported all the emails and advice I had gotten from y’all, and started sorting into themes and picking out key quotes, and that is when it finally hit me: I had written the wrong thing.

No, this deserves a bigger font.

✨I wrote the wrong thing.✨

I wrote the wrong thing, for the wrong people, and none of it was going to move the needle in any meaningful way.

The chapters I had written were full of practical advice for observability engineering teams and platform engineering teams, wrestling with implementation challenges like instrumentation and cost overflows. Practical stuff.

Yes.

The internet was right (this ONE time)

My inbox, on the other hand, was overflowing with stories like these:

  • “Many times [competitive research] is faked. One person has their favorite option and then they do just enough ‘competitive analysis’ to convince the sourcing folks that due diligence was done or to nullify the CIO/CTO/whoever is accepting this on to their budget”
  • “We [the observability team] spent six months exhaustively trialing three different solutions before we made a decision. The CEO of one of the losing vendors called our CEO, and he overruled our decision without even telling us.” (Does your CEO know anything at all about engineering??) “No.”
  • “Our SRE teams have vetoed any attempt to modernize our tool stack. ($Vendor) is part of their identity, and since they would have to help roll out and support any changes, we are stuck living in 2015 apparently forever.” (What does management have to say?) “It’s been twenty years since they touched a line of code.”
  • “We’re weird in that most of the company hates technology and really hates that we have to pay for it since they don’t understand the value it brings to the company. This is intentional ignorance, we make the value props continually and well, we just haven’t succeeded yet….We’re a little obsessed with trying to get champagne quality at Boone’s prices.”
  • “When it comes to dealing with salespeople and the enterprise sales process, the best tip for engineers is to not anthropomorphize sales professionals who are driven by commission. The best ones are like robot lawn mowers dressed in furry unicorn costumes. They may seem cute and nice but they do not care about anything besides closing the next deal….All of the best SaaS companies are full of these friendly fake unicorn zombies who suck cash instead of blood.”

Nearly all of the emails I got were either describing a terminally fucked up buying process from the top down, or the long term consequences of those fucked up decisions.

In other words: I was writing tactical advice for teams who were surviving in a strategic vacuum.

So I threw the whole thing out, and started over from scratch. 😭

Even good teams are struggling right now

As Tolstoy once wrote, “Happy teams are all alike; every fucked up team is fucked up in its own precious way.”

There is an infinity of ways to screw something up. But there is one pattern I see a critical mass of engineering orgs falling into right now, even orgs that are generally quite solid. That is when there is no shared alignment or even shared vocabulary between engineering and other stakeholders directors, VPs and SVPs, CTO, CIO, principal and distinguished engineers — on some pretty clutch questions. Such as:

  • “What is observability?”
  • “Who needs it?”
  • “What problem are we trying to solve?”

And my favorite: “Is observability still relevant in a post-AI era? Can’t agents do that stuff now?”

Even some generally excellent CTOs[1] have been heard saying things like, “yeah, observability is definitely very important, but all our top priorities are related to AI right now.”

Which gets causality exactly backwards. Because your ability to get any returns on your investments into AI will be limited by how swiftly you can validate your changes and learn from them. Another word for this is “OBSERVABILITY”.

Enough ranting. Want a peek? I’ll share the new table of contents, and a sentence or two about a couple of my own favorite chapters.

Part 6: “Observability Governance” (v2)

The new outline is organized to speak to technical decision-makers, starting at the top and loosely descending. What do CTOs need to know? What do VPs and distinguished engineers need to know? and so on. We start off abstract, and become more concrete.

Since every technical term (e.g. high cardinality, high dimensionality, etc) has become overloaded and undifferentiated by too much sales and marketing, we mostly avoid it. Instead, we use the language of systems and feedback loops.

Again, we are trying to help your most senior engineers and execs develop a shared understanding of “What problem are we solving?” and “What is our goal? Technical terms can actually detract and distract from that shared understanding.

  1. An Open Letter to CTOs: Why Organizational Learning Speed is Now Your Biggest Constraint. Organizations used to be limited by the speed of delivery; now they are limited by how swiftly they can validate and understand what they delivered.
  2. Systems Thinking for Software Delivery. Observability is the signal that connects the dots to make a feedback loop; no observability, no loop. What happens to amplifying or balancing loops when that signal is lossy, laggy, or missing?
  3. The Observability Landscape Through a Systems Lens. What feedback loops do developers need, and what feedback loops does ops need? How do these map to the tools on the market?
  4. The Business Case for Observability. Is your observability a cost center or an investment? How should you quantify your RoI?
  5. Diagnosing Your Observability Investment
  6. The Organizational Shift
  7. Build vs Buy (vs Open Source)
  8. The Art and Science of Vendor Partnerships. Internal transformations run on trust and credibility; vendor partnerships run on trust and reciprocity. We’ll talk about both of these, as well as how to run a strong POC.
  9. Instrumentation for Observability Teams
  10. Where to Go From Here

Hey, I have a lot of empathy right now for leaders and execs who feel like they’re behind on everything. I feel it too. Anyone who doesn’t is lying to themselves (or their name is Simon Willison).

But the role observability plays in complex sociotechnical systems is one of those foundational concepts you need to understand. You’re not gonna get this right by accident. You’re not going to win by doing the same thing you were doing five years ago. And if you screw up your observability, you screw up everything downstream of it too.

To those of you who do understand this, and are working hard to drive change in your organizations: I see you. It is hard, often thankless work, but it is work worth doing. If I can ever be of help: reach out.

A longer book, but a better book

The last few chapters are heading into tech review on Friday, February 20th. Finally. The last 3.5 months have been some of the most panicky and stressful of my life. I….just typed several paragraphs about how terrible this has been, and deleted them, because you do not need to listen to me whine. ☺️

Like I said, I have never felt especially proud of the first edition. I am not UN proud, it’s just…eh. I feel differently this time around. I think—I hope—it can be helpful to a lot of different people who are wrestling with adapting to our new AI-native reality, from a lot of different angles.[2]

Thanks, Christine. (Another for the folder marked ”NOW YOU TELL ME”)

I am incredibly grateful to my co-authors, collaborators, and our editor, Rita Fernando, without whom I never would have made it through.

But there’s one more group that deserves some credit, and it’s…you guys. I asked for help, and help I got. So many people wrote me such long, thought-provoking emails full of stories, advice and hard-earned wisdom. The better the email, the more I peppered you with followup questions, which is a great way to punish a good deed.

Blame these people

I am a tiny bit torn on whether to say “thank you” or “fuck you”, because my life would have been much nicer if I had stuck to the plan and wrapped in October.

But the following list of people were especially instrumental in forcing me to rethink my approach. It made the book much stronger, so if you catch one of them in the wild, please buy them a stiff drink. (Or buy yourself one, and throw it in their face with my sincere compliments.)

  • Abraham Ingersoll, the aforementioned “odd CISO”, who would be quoted in the book had his advice not been so consistently unprintable by the standards of respectable publications
  • Benjamin Mann of Delivery Hero, who I would work for in a heartbeat, and not just for the way he wields “NOPE” as a term of art
  • Marty Lindsay, who has spent more time explaining POCs and tech evals to me than anyone should have to. (If you need an o11y consultant, Marty should be your very first stop).
  • Sam Dwyer, whose stories seeded my original plan to write a set of chapters for observability engineering teams. (I hope the replacement plan is useful too!)

Many others sent me terrific advice, and endured multiple rounds of questions and more questions and clarifications on said questions. A few of them:

Matthew Sanabria, Chris Cooney, Glen Mailer, Austin Culbertson, John Scancella, John Doran, Bryan Finster, Hazel Weakly, Chris Ziehr, Thomas Owens, Mike Lee, Jay Gengelbach, Will Hegedus, Natasha Litt, Alonso Suarez, Jason McMunn, Evgeny Rubtsov, George Chamales, Ken Finnegan, Cliff Snyder, Robyn Hirano, Rita Canavarro, Matt Schouten, Shalini Samudri Ananda Rao (Sam).

I am definitely forgetting some names; I will try to update the list as I remember them.

But seriously: thank you, from the bottom of my heart. I loved hearing your stories, your complaints, your arguments about how the world should improve. Your DNA is in this book; I hope it does you justice.

~charity
💜💙💚💛🧡❤️💖

 

[1] It’s ironic (and makes me uncomfortably self-conscious), but some of the worst top-down decision-making processes I have ever seen have come from companies where CEO and CTO are both former engineers. The confidence they have in their own technical acumen may be not wholly unfounded, but it is often ten or more years out of date. We gotta update those priors, my friends. Stay humble.

[2] On the other hand, as my co-founder, Christine Yen, informed me last week: “Nobody reads books anymore.”

First I wrote the wrong book, then I wrote the right book (xpost)

Martin Fowler told me the second edition should be shorter (it’s twice as long) (xpost)

90% new material, a clearer mission, and a rogues gallery of contributors. The second edition of Observability Engineering is almost done.

Last week I got to meet Martin Fowler for the first time in person. This was an exciting moment for me. Martin ranks high on my personal pantheon; he is, so far as I can tell, hardly ever wrong.

Martin Fowler, me, and Nathen Harvey, at the happy hout for Gergely’s Pragmatic Summit! Not pictured: my glass of wine, Nathen’s box of fucks to give
After letting me know that he “only hugs on this side of the Rockies” (noted), the topic of writing books came up, to which he drawled,

“The second edition should always be shorter. I always made my second editions shorter. Shorter books are better books.”

God dammit, Martin.. Now you tell me. 🙈

In completely unrelated news, our last few chapters for “Observability Engineering, 2nd edition” go out to tech reviewers this week! This puts us on track for dead tree publication in June, although chapters will be available earlier for O’Reilly subscribers as well as behind an email gate on the Honeycomb site.

What’s different about the second edition?

Almost everything. The only chapters that carry over some material are the ones on sampling, retriever (columnar store), and a smattering of the SLO stuff—maybe 10% all told? And we’ve added a monstrous amount of new material.

So no, it will not be shorter than the first edition. It is almost twice as long.[1] (Sorry!)

The first edition was a spaghetti mess

Books, as I understand, are like children; if you made them, you are not allowed to say you aren’t proud of them.

So fine, I won’t say it. But I think we can all privately agree that the first edition was a bit of a hot mess.

No shade on my wonderful co-authors, Liz and George, or our O’Reilly editors, or myself for that matter. We did our best, but now, with the clarity of hindsight, it’s easy to see all the ways the ground was shifting under us as we wrote.

When we started the book in 2018, Honeycomb was the only observability company, and our definition of observability—high cardinality, high dimensionality, explorability—was the only definition. By the time the book came out in 2021, everyone was rebranding their products as observability, Gartner had waded into the fray.. it was a mess.

Perhaps the mature thing to do would be to have gone back and rewritten the book in light of the evolving market definition. But while I won’t speak for my co-authors, after 3.5 years, I was pretty fucking desperate to be done.

Artist’s rendering of the traditional authorial glow of pride, joy and deep satisfaction upon completing any book manuscript

I swore I would never go through that again. And when O’Reilly first approached us about writing a second edition, my first reaction was blind panic.

The second edition has a clearer mission

But once my lizard brain calmed down, I realized two things. Number one, it absolutely needed to be written; number two, I definitely wanted to help write it.

SO MUCH has changed. SO MUCH needs saying. When we met up in June to pull together a new outline, it seemed to just flow out of us.

A few of the many things that were not at all clear in 2018, but are crystal clear today:

  • Who we are writing for (software engineers)
  • What they are doing (instrumenting their code and analyzing it in production, with and without AI)
  • What observability means to analysts and the market at large (literally anything to do with software telemetry)
  • The integrations game is over, and OpenTelemetry has won
  • Most companies still don’t have real observability. And they don’t know it. 😕

I am excited and incredibly grateful for the opportunity to take a second whack at this book in the era of AI. Not how I thought I’d feel, but I will take it.

The first edition of “Observability Engineering” was translated into Japanese, Korean, and Chinese (I believe it’s Mandarin?).

We brought Austin Parker on as a fourth co-author very early, with a special emphasis on topics related to OpenTelemetry and AI.

We also invited a number of people we admire to contribute in a variety of formats… guest chapters, use cases, stories, embedded advice, and more:

  • Jeremy Morrell on how to instrument your code effectively
  • Hanson Ho and Matt Klein on observability for mobile and frontend
  • Kesha Mykhailov and Darragh Curran from Intercom on fast feedback loops and developing with LLMs
  • Dale McDiarmid on how to use Clickhouse for observability workloads
  • Rick Clark on the mechanics of driving organizational transformation in order to build and learn at the speed of AI
  • Frank Chen, a returning champion from our first edition, wrote about ontologies for your instrumentation chain
  • Phillip Carter wrote about eval pipelines and instrumenting LLMs
  • Mat Vine has a case study about moving ANZ from thresholds to SLIs/SLOs
  • Mike Kelly on managing telemetry pipelines for fun and profit
  • Hugo Santos on how to instrument your CI/CD pipelines
  • Peter Corless made our chapter on “Build vs Buy (vs Open Source)” immensely better and more well-rounded

What a fucking list, huh? 🙌

Truly, this book is a veritable rogues gallery of engineers and companies we look up to (including some of our own direct competitors 😉). The one thing all these people have in common (besides being great writers with a unique perspective, and people who are willing to return our emails) is that we share a similar vision for observability and the future of software development.

Spotted this week: Nathen Harvey, walking around, giving out fucks by the handful.

In addition to the sections written for software engineers on “Instrumentation Fundamentals” and “Analysis Workflows”, both with and without AI, we have a section on “Observability Use Cases” and another on “Technical Deep Dives”, which lets us cover even more ground.

Which brings us to the last section, the one that I personally signed up to write.

Part 6: “Observability Governance”

When we met in June, I successfully pitched Liz and George on adding one final section: “Observability Governance”. Unlike the rest of the book, these chapters would be written for the observability team, or the platform engineering team, or whoever is wrestling with problems like cost containment and tool migrations.

I sketched out a few ideas and started writing. July passed, August, September…I was cranking out one governance chapter per month, right on track, planning to wrap up well before November.

In September, halfway through my last chapter, I reached out to the internet for advice. “Are you an experienced software buyer? I could use some help.

The response was ✨tremendous✨; my inbox swelled with interesting stories, bitter rants, lessons learned, and practical tips from engineers and executives alike.

But when I tried to finish the chapter, my engine stalled out. I could not write. I kept doggedly blocking off time on my calendar, silencing interruptions, staring at drafts, writing and rewriting, trying every angle. Four weeks passed with no progress made.

Five weeks. Six.

Cliffhanger!

Tomorrow I’ll publish the second half of this story, in which the due date for my chapters comes and goes, and I end up throwing away everything I had written and starting over from scratch. Good times!

 

[1] If we ever write a third edition, I swear on the lives of my theoretical children that it will be MUCH shorter than this one.

Martin Fowler told me the second edition should be shorter (it’s twice as long) (xpost)

Bring Back Ops Pride (xpost)

Cross-posted from “Bring Back Ops Pride

“Operations” is not a dirty word, a synonym for toil, or a title for people who can’t write code. May those who shit on ops get the operational outcomes they deserve.

I was planning to write something else today, but god dammit, I got nerd-sniped.

Last week I published a piece on the Honeycomb blog called “You Had One Job: Why Twenty Years of DevOps Has Failed to Do It.” Allow me to quote myself:

In retrospect, I think the entire DevOps movement was a mighty, twenty year battle to achieve one thing: a single feedback loop connecting devs with prod.

On those grounds, it failed.

Not because software engineers weren’t good at their jobs, or didn’t care enough. It failed because the technology wasn’t good enough. The tools we gave them weren’t designed for this, so using them could easily double, triple, or quadruple the time it took to do their job: writing business logic.

This isn’t true everywhere. Please keep in mind that all data tools are effectively fungible if you can assume an infinite amount of time, money, and engineering skill. You can run production off an Excel spreadsheet if you have to, and some SREs have done so. That doesn’t make it a great solution, the right use of resources, or accessible to the median engineering org.

It’s a fun piece, if I do say so myself, and you should go read it. (Much stick art!)

I posted the link on LinkedIn yesterday morning. Comments, as ever, ensued.

(The internet was a mistake, Virginia.)

“Devs should own everything”

A number of commenters said things like, “devs should own everything”, “make every team responsible for their own devops work”, and my personal favorite:

“I still think the main problem is with the ownership model – the fact that devs don’t own the full system, including infra and ops.”

(Courtesy of Alex Pulver, who has graciously allowed me to quote him here by name, adding that “he stands firmly behind this 😂”.)

As it happens, I have been aggressively advocating for the model I believe my friend is describing here, where software developers are empowered (and expected) to own their code in production, for approximately the past decade. No argument!

But “devs should own the full system, including infra and ops”?

We need to talk.

fire_burn

I do not think ‘ops’ means what you think it means

In software—and only in software—ops has become a dirty word. Nobody wants to claim it. Operations teams got renamed to DevOps teams, SRE, infrastructure, production engineering, or more recently, platform engineering teams. Anything but ops.

Ops means Toil! Hashtag #NoOps!

Number one, this is fucking ridiculous.

What’s wrong with operations? Ops is not a synonym for toil; it literally means “get shit done as efficiently as possible”. Every function has an operational component at scale: business ops, marketing ops, sales ops, product ops, design ops and everything else I could think of to search for, and so far as I can tell, none of them are treated with anything like the disrespect, dismissal and outright contempt that software engineering1 has chosen to heap upon its operational function.

Number two…what happened?

I can think of a number of contributing factors (APIs and cloud computing, soaring profit margins, etc), but I can also think of one easy, obvious, mustache-twirling villain, which would make a better story for you AND less work for me. (Root cause analysis wins again!! ✊)

Whose fault is it?

Google. It’s Google’s fault.

I know this, because I asked Google and it told me so.

(What? This is a free substack, not science.)

Here’s what I think happened. I think Google came out swinging, saying traditional operations could not scale and needed to become more like software engineering, and it was exactly the right message at exactly the right time, because cloud computing, APIs, SaaSes and so forth were just coming online and making it possible to manage systems more programmatically.

So far so good. But a crucial distinction got lost in translation, when we started defining this as developers (people who write code: good) vs operators (people who do things by hand: bad), which is what set us on the slippery slope to where we are today, where the entire business-critical function of operations engineering is widely seen as backwards and incompetent.

Dev vs Ops is a separation of concerns

The difference between “dev” and “ops” is not about whether or not you can write code. Dude, it’s 2026: everyone writes software.

The difference between dev and ops is a separation of concerns.

Dev and ops responsibilities.

If your concern is building new features and products to attract customers and generate new revenue, then congrats: you’re a dev. (But you knew that.)

If your concern is building core services and protecting their ability to serve customers in the face of any and all threats (including, at the top of the list, your own developers): congratulations slash I’m sorry, but you are, in fact, in ops.

Both of these functional concerns are vital, as in “you literally can’t survive without them”, and complementary. You need product developers to be focused on building features and products, caring deeply about the experience of each user, and looking for ways to add value to the business. You need operations to provide a resilient, scalable, efficient base for those products to run on.

The hardest technical problems are found in ops

Ops is not “toil”. It does not mean “dummies who can’t program good”. Operations engineering is not easier or lesser than writing software to build products and features.

What’s darkly ironic is that, if anything, the opposite is true.

Product engineering is typically much simpler than infrastructure engineering—in part, of course, because one of the key functions of operations is to make it as easy as possible to build and ship products. Operations absorbs the toughest technical problems and provides a surface layer for product development that is simple, reliable, and easy to navigate.

Not because product engineers are dumb or lesser than (let’s not slip into that trap2 again!), but because cognitive bandwidth is the scarcest resource in any engineering org, and you want as much of that as possible going towards things that move the business materially forward, instead of wrestling with the messy underbelly.

The hardest technical challenges and the long, stubborn tail of intractable problems have always been on the infrastructure side. That’s why we work so hard to try not to have them—to solve them by partnerships, cloud computing, open source, etc. Anything is better than trying to build them again, starting over from scratch. We know the cost of new code in our bones.3

As I have said a thousand times: the closer you get to laying bits down on disk, the more conservative (and afraid) you should be.

The closer you get to user interaction, the more okay it is to get experimental, let AI take a shot, YOLO this puppy.

This is as it should be.

The cost and pain of developing software is approximately zero compared to the operational cost of maintaining it over time.
The cost and pain of developing software is approximately zero compared to the operational cost of maintaining it over time.

Domain level differences

The difference between dev and ops isn’t about writing code or not. But there are differences. In perspective, priorities, and (often) temperament.

I touched on a number of these in the article I just wrote on feedback loops, so I’m not going to repeat myself here.

The biggest difference I did not mention is that they have different relationships with resources and definitions of success.

Infrastructure is a cost center. You aren’t going to make more money if you give ten laptops to everyone in your company, and you aren’t going to make more money by over-spending on infrastructure, either. Great operations engineers and architects never forget that cost is a first class citizen of their engineering decisions.

You can, in theory, make more money by spending more on product engineering. This is what we refer to as an “investment”, although sometimes it seems to mean “engineers who forget their time costs money”.

(Sorry, that was rude.)

What about platform engineering?

“What about platform engineering?” Baby, that’s ops in dressup.

A bit less flippantly: I like my friend Abby Bangser’s quote: “platforms should encode things that are unique to your business but common to your teams”, and I like Jack Danger’s stick art, and his observation that “The only thing that naturally draws engineers to look at the middle of their system is pure blinding rage.”

What I love about the platform engineering movement is that it has brought design thinking and product development practices to the operational domain.

Yes, we should absolutely be treating our product developers like customers, and thinking critically about the interfaces we give them. Yes, there is a middle layer between infrastructure and product engineering, with patterns and footguns of its very own.

Also yes: from a functional perspective, platform engineering is still ops. (Or at least, more ops than not.)

Yes, you need an ops team. If you have hard operational problems. You should try not to have hard operational problems. (from a talk I gave back in 2015)
Yes, you need an ops team. If you have hard operational problems. You should try not to have hard operational problems. (from a talk I gave back in 2015)

Does it matter what we call it?

Yeah, I kinda think it does.

All these trendy naming schemes do not change the core value of operations, which is to consolidate and efficiently serve the revenue-generating parts of the function.4 This is as true in technology as it is in sales or marketing. Running away from the term and denying your purpose muddies the water and causes confusion at the exact point where clarity is most needed.

An engineering team needs to know if they are oriented towards efficiency or investment. It changes how you hire, how you build, how you think about success and measure progress. It changes not only your appetite for risk, but what counts as a risk in the first place. You can’t optimize for both at once.

They also need to know whether they are responsible for the business logic or the platform it runs on.

Why? Because no one can do everything. Telling devs to own their code is one thing. (Great.) Asking them to own their code and the entire technological iceberg beneath it is wholly another. The more surface area you ask someone to master and attend to, the less focus you can expect from them in any given place. Do you want your revenue-generating teams generating revenue, or not?

If you can’t separate these concerns at the moment, maybe that’s something to work towards. Which is going to be hard to do, if we can’t talk about the function of operations without half the room running away and the remaining half squawking “toil!”

Naming is a form of respect

Operational rigor and excellence are not, how shall I say this…not yet something you can take for granted in the tech industry. The most striking thing about the 2025 DORA report was that the majority of companies report that AI is just adding more chaos to a system already defined by chaos. In other words, most companies are bad at ops.

To some extent, this is because the problems are hard. To a larger extent, I think it’s the cause (and result) of our wholesale abandonment of operations as a term of pride.

It’s another a fucking feedback loop. Ambitious young engineers get the message that being associated with ops is bad, so they run away from those teams. Managers and execs want to recruit great talent and make jobs sound enticing, so they adopt trendy naming schemes to make it clear this work is not ops.

If you want to do something well, historically speaking, this is not the way. The way to build excellence is to name it for what it is, build communities of practice, raise the bar, and compensate for a job well done.

Or as one prognosticator said, way back in 2016:

I think it’s time to bring back “operations” as a term of pride. As a thing that is valued, and rewarded.

“Operations” comes with baggage, no doubt. But I just don’t think that distance and denial are an effective approach for making something better, let alone trash talking and devaluing the skill sets that you need to deliver quality services.

You don’t make operational outcomes magically better by renaming the team “DevOps” or “SRE” or anything else. You make it better by naming it and claiming it for what it is, and helping everyone understand how their role relates to your operational objectives.

Wow. Truly, I couldn’t have said it better myself.

Footnotes

1

Not to get all Jungian on you all, but part of me has to wonder if “ops” represents the shadow self to software engineers, the parts of yourself that you hate and despise and are most insecure about (I am weak and bad at coding!! I might be automated out of a job!), and thus need to project onto some externalized other that can be safely loathed from a distance.

2

This might surprise you youngsters, but there was a time when systems folks were clearly the cool kids and developers were considered rather dim. Devs had to know data structures and algorithms, but sysadmins had to know everything. These things tend to come and go in cycles, so we may as well not shit on each other, eh?

3

As my friend Peter van Hardenburg likes to say, “The best code is no code at all. The second best code is code someone else writes and maintains for you. The worst code is the code you have to write and maintain yourself.” If it would fit on my knuckles, I would get this in knuckle tatts.

4

Would it be helpful to acknowledge that IT/ops serves an entirely different function than software engineering operations for production systems? Because it absolutely does.

Bring Back Ops Pride (xpost)

On Friday Deploys: Sometimes that Puppy Needs Murdering (xpost)

(Cross posted from its original source)

‘Tis the season that one of my favorite blog posts gets pulled out and put in rotation, much like “White Christmas” on the radio station. I’m speaking, of course, of “Friday Deploy Freezes are Exactly Like Murdering Puppies” (old link on WP).

This feels like as good a time as any to note that I am not as much of an extremist as people seem to think I am when it comes to Friday deploys, or deploy freezes in general.

(Sometimes I wonder why people think I’m such an extremist, and then I remember that I did write a post about murdering puppies. Ok, ok. Point taken.)

Take this recent thread from LinkedIn, where Michael Davis posted an endorsement of my Puppies article along with his own thoughts on holiday code freezes, followed by a number of smart, thoughtful comments on why this isn’t actually attainable for everyone. Payam Azadi talks about an “icing” and “defrosting” period where you ease into and out of deploy freezes (never heard of this, I like it!), and a few other highly knowledgeable folks chime in with their own war stories and cautionary tales.

It’s a great thread, with lots of great points. I recommend reading it. I agree with all of them!!

For the record, I do not believe that everyone should get rid of deploy freezes, on Fridays or otherwise.

If you do not have the ability to move swiftly with confidence, which in practice means “you can generally find problems in your new code before your customers do”, which generally comes down to the quality and usability of your observability tooling, and your ability to explore high cardinality dimensions in real time (which most teams do not have), then deploy freezes before a holiday or a big event, or hell, even weekends, are probably the sensible thing to do.

If you can’t do the “right” thing, you find a workaround. This is what we do, as engineers and operators.

Deploy freezes are a hack, not a virtue

Look, you know your systems better than I do. If you say you need to freeze deploys, I believe you.

Honestly, I feel like I’ve always been fairly pragmatic about this. The one thing that does get my knickers in a twist is when people adopt a holier-than-thou posture towards their Friday deploy freezes. Like they’re doing it because they Care About People and it’s the Right Thing To Do and some sort of grand moral gesture. Dude, it’s a fucking hack. Just admit it.

It’s the best you can do with the hand you’ve been dealt, and there’s no shame in that! That is ALL I’m saying. Don’t pat yourself on the back, act a little sheepish, and I am so with you.

I think we can have nice things

I think there’s a lot of wisdom in saying “hey, it’s the holidays, this is not the time to be rushing new shit out the door absent some specific forcing function, alright?”

My favorite time of year to be at work (back when I worked in an office) was always the holidays. It was so quiet and peaceful, the place was empty, my calendar was clear, and I could switch gears and work on completely different things, out of the critical line of fire. I feel like I often peaked creatively during those last few weeks of the year.

I believe we can have the best of both worlds: a yearly period of peace and stability, with relatively low change rate, and we can evade the high stakes peril of locks and freezes and terrifying January recoveries.

How? Two things.

Don’t freeze deploys. Freeze merges.

To a developer, ideally, the act of merging their changes back to main and those changes being deployed to production should feel like one singular atomic action, the faster the better, the less variance the better. You merge, it goes right out. You don’t want it to go out, you better not merge.

The worst of both worlds is when you let devs keep merging diffs, checking items off their todo lists, closing out tasks, for days or weeks. All these changes build up like a snowdrift over a pile of grenades. You aren’t going to find the grenades til you plow into the snowdrift on January 5th, and then you’ll find them with your face. Congrats!

If you want to freeze deploys, freeze merges. Let people work on other things. I assure you, there is plenty of other valuable work to be done.

Don’t freeze deploys unless your goal is to test deploy freezes

The second thing is a corollary. Don’t actually freeze deploys, unless your SREs and on call folks are bored and sitting around together, going “wouldn’t this be a great opportunity to test for memory leaks and other systemic issues that we don’t know about due to the frequency and regularity of our deploys?”

If that’s you, godspeed! Park that deploy engine and sit on the hood, let’s see what happens!

People always remember the outages and instability that we trigger with our actions. We tend to forget about the outages and instability we trigger with our inaction. But if you’re used to deploying every day, or many times a day: first, good for you. Second, I bet you a bottle of whiskey that something’s gonna break if you go for two weeks without deploying.

I bet you the good shit. Top shelf. 🥃

This one is so easy to mitigate, too. Just run the deploy process every day or two, but don’t ship new code out.

Alright. Time for me to go fly to my sister’s house. Happy holidays everyone! May your pagers be silent and your bellies be full, and may no one in your family or friend group mention politics this year!

💜💙💚💛🧡❤️💖
charity

Me and Bubba and Miss Pinky Persnickety

P.S. The title is hyperbole! I was frustrated! I felt like people were intentionally misrepresenting my point and my beliefs, so I leaned into it. Please remember that I grew up on a farm and we ended up eating most of our animals. Possibly I am still adjusting to civilized life. Also, I have two cats and I love them very much and have not eaten either of them yet.

A few other things I’ve written on the topic:

On Friday Deploys: Sometimes that Puppy Needs Murdering (xpost)

2025 was for AI what 2010 was for cloud (xpost)

The satellite, experimental technology has become the mainstream, foundational tech. (At least in developer tools.) (xposted from new home)

I was at my very first job, Linden Lab, when EC2 and S3 came out in 2006. We were running Second Life out of three datacenters, where we racked and stacked all the servers ourselves. At the time, we were tangling with a slightly embarrassing data problem in that there was no real way for users to delete objects (the Trash folder was just another folder), and by the time we implemented a delete function, our ability to run garbage collection couldn’t keep up with the rate of asset creation. In desperation, we spun up an experimental project to try using S3 as our asset store. Maybe we could make this Amazon’s problem and buy ourselves some time?

Why yes, we could. Other “experimental” projects sprouted up like weeds: rebuilding server images in the cloud, running tests, storing backups, load testing, dev workstations. Everybody had shit they wanted to do that exceeded our supply of datacenter resources.

By 2010, the center of gravity had shifted. Instead of “mainstream engineering” (datacenters) and “experimental” (cloud), there was “mainstream engineering” (cloud) and “legacy, shut it all down” (datacenters).

Why am I talking about the good old days? Because I have a gray beard and I like to stroke it, child. (Rude.)

And also: it was just eight months ago that Fred Hebert and I were delivering the closing keynote at SRECon. The title is “AIOps: Prove It! An Open Letter to Vendors Selling AI for SREs”, which makes it sound like we’re talking to vendors, but we’re not; we’re talking to our fellow SREs, begging them to engage with AI on the grounds that it’s not ALL hype.

We’re saying to a room of professional technological pessimists that AI needs them to engage. That their realism and attention to risk is more important than ever, but in order for their critique to be relevant and accurate and be heard, it has to be grounded in expertise and knowledge. Nobody cares about the person outside taking potshots.

This talk recently came up in conversation, and it made me realize—with a bit of a shock—how far my position has come since then.

That was just eight months ago, and AI still felt like it was somehow separable, or a satellite of tech mainstream. People would gripe about conferences stacking the lineup with AI sessions, and AI getting shoehorned into every keynote.

I get it. I too love to complain about technology, and this is certainly an industry that has seen its share of hype trains: dotcom, cloud, crypto, blockchain, IoT, web3, metaverse, and on and on. I understand why people are cynical—why some are even actively looking for reasons to believe it’s a mirage.

But for me, this year was for AI what 2010 was for the cloud: the year when AI stopped being satellite, experimental tech and started being the mainstream, foundational technology. At least in the world of developer tools.

It doesn’t mean there isn’t a bubble. Of COURSE there’s a fucking bubble. Cloud was a bubble. The internet was a bubble. Every massive new driver of innovation has come with its own frothy hype wave.

But the existence of froth doesn’t disprove the existence of value.

Maybe y’all have already gotten there, and I’m the laggard. 😉 (Hey, it’s an SRE’s job to mind the rear guard.) But I’m here now, and I’m excited. It’s an exciting time to be a builder.

2025 was for AI what 2010 was for cloud (xpost)

Hello World (xpost from substack)

I recently posted a short note about moving from WordPress to Substack after ten years on WP. A number of people replied, commented, or DM’d me to express their dismay, along the lines of “why are you supporting Nazis?”. A few begged me to reconsider.

So I did. I paused the work I was doing on migration and setup, and I paused the post I was drafting on Substack. I read the LeaveSubstack site, and talked with its author (thank you, Sean 💜). I had a number of conversations with people I consider experts in content creation, and people I consider stakeholders (my coworkers and customers), as well as my own personal Jiminy Cricket, Liz Fong-Jones. I also slept on it.

I’ve decided to stay.

I said I would share my thinking once I made a decision, and it comes down to this: I have a job to do, and I haven’t been doing it.

I have not been doing my job 💔

I’ve gone increasingly dark on social media over the past few years, and while this has been delightful from a personal perspective, I have developed an uncomfortable conviction that I have also been abdicating a core function of my job in doing so.

The world of software is changing—fast. It’s exciting. But it is not enough to have interesting ideas and say them once, or write them down in a single book. You need to be out there mixing it up with the community every day, or at least every week. You need to be experimenting with what language works for people, what lands, what sparks a light in people’s eyes.

You (by which I mean me) also need to be listening more— reading and interacting with other people’s thoughts, volleying back and forth, polishing each other like diamonds.

How many times did we define observability or high cardinality or the sins of aggregation? Cool. How many times have we talked about the ways that AI has made the honeycomb vision technologically realizable for the first time? Uh, less, by an order of thousands.

Write more, engage with mainstream tech

My primary goal is to get back into the mainstream of technical discussion and mix it up a lot more. Unfortunately, to the extent there is a tech mainstream, it still exists on X. I am not ruling out the possibility of returning, but I would strongly prefer not to. I’m going to see if I can do my job by being much more active on LinkedIn and Substack.

My secondary goal is to remove friction and barriers to posting. WordPress just feels so heavyweight. Like I’m trying to craft a web page, not write a quick post. Substack feels more like writing an email. I’ve been trying to make myself post more all year on WP, and it hasn’t happened. I have a lot of shit backed up to talk about, and I think this will help grease the wheels.

There are platforms that are outside the pale, that exist solely to platform and support Nazis and violent extremists—your Gabs, your Parlers. Substack is very far from being one of those. All of these content platforms exist on some continuum of grey, and governance is hard, hard, hard in an era of mainstreaming right wing extremism.

Substack may not make all the decisions I would make, but I feel like it is a light dove grey, all things considered.

Some mitigations

I have received some tips and done some research on how to minimize the value of my writing to Substack. Here they are.

  • Substack makes money from paid subscriptions, so I don’t accept money. Ever.
  • I am told that if you use email or RSS, it benefits Substack less than if you use the app. RSS feed here.
  • I will set up an auto-poster from Substack to WordPress (at some point… probably whenever I find the time to fix the url rewriter and change domain pointer)

I hope these will allow conscientous objectors continue to read and engage with my work, but I also understand if not.

A vegan friend of mine once used an especially vivid metaphor to indignantly tell us why no, he could NOT just pick the meat and dairy off his plate and eat the vegetables and grains left behind (they were not cooked together). He said, “If somebody shit on your plate, would you just pick the shit off and keep eating?”

So. If Substack is the shit in your social media plate, and you feel morally obligated to reject anything that has ever so much as touched the domain, I can respect that.

Everyone has to decide which battles are theirs to fight. This one is not mine.

💜💙💚💛🧡❤️💖,
charity.

Hello World (xpost from substack)

Moving from WordPress to Substack

Well, shit.

I wrote my first blog post in this space on December 27th, 2015 — almost exactly a decade ago.

“Hello, world.”

I had just left Facebook, hadn’t yet formally incorporated Honeycomb, and it just felt like it was time, long past time for me to put something up and start writing.

Ten years later, it feels long past time for me to do something else. I despise WP (who doesn’t?), and there’s so much friction in getting a post out that I just don’t do it. Plus, it’s clear that to the extent that there is a vibrant ecosystem of tech writers in longform, it’s on substack. I miss tech twitter, always will. Time to give substack a try.

I’ve been working on the second edition of “Observability Engineering” for much of this year, and I have learned SO MUCH in the writing process. As soon as these rough drafts have all been turned in, I will be streaming my thoughts out via substack. They are burning a hole in my brain the longer I hold them in.

Housekeeping notes:

  • I’ve tried to export the email subscribers and import them into substack, but it’s held up for manual review. I don’t know what that means.
  • I won’t be able to export and bring along the comments you folks have left over the years. I’m sorry. 🙁
  • I am going to leave charity.wtf pointed here for the foreseeable future, even tho I am working to port over the corpus of posts. I don’t want to break anyone’s bookmarks or article links, so I’ll leave it up here until / unless I find a solution.

If you want to go subscribe, I’m at charitydotwtf.substack.com. Here’s the blurb:

Thank you all for a wonderful 10 years together. WordPress may be a piece of shit, but the community I’ve found here has been anything but. I hope to see you on the other side.

💜💙💚💛🧡❤️💖

charity

 

Moving from WordPress to Substack

From Cloudwashing to O11ywashing

I was just watching a panel on observability, with a handful of industry executives and experts who shall remain nameless and hopefully duly obscured—their identities are not the point, the point is that this is a mainstream view among engineering executives and my head is exploding.

Scene: the moderator asked a fairly banal moderator-esque question about how happy and/or disappointed each exec has been with their observability investments.

One executive said that as far as traditional observability tools are concerned (“are there faults in our systems?”), that stuff “generally works well.”

However, what they really care about is observing the quality of their product from the customer’s perspective. EACH customer’s perspective.

Nines don't matter if users aren't happy
Nines don’t matter if users aren’t happy

“Did you know,” he mused, “that there are LOTS of things that can interrupt service or damage customer experience that won’t impact your nines of availability?”

(I begin screaming helplessly into my monitor.)

“You could have a dependency hiccup,” he continued, oblivious to my distress. “There could be an issue with rendering latency in your mobile app. All kinds of things.”

(I look down and realize that I am literally wearing this shirt.)

He finishes with,“And that is why we have invested in our own custom solution to measure key workflows through startup payment and success.”

(I have exploded. Pieces of my head now litter this office while my headless corpse types on and on.)

It’s twenty fucking twenty five. How have we come to this point?

 

Observability is now a billion dollar market for a meaningless term

My friends, I have failed you.

It is hard not to register this as a colossal fucking failure on a personal level when a group of modern, high performing tech execs and experts can all sit around a table nodding their heads at the idea that “traditional observability” is about whether your systems are UP👆 or DOWN👇, and that the idea of observing the quality of service from each customer’s perspective remains unsolved! unexplored! a problem any modern company needs to write custom tooling from scratch to solve. 

This guy is literally describing the original definition of observability, and he doesn’t even know it. He doesn’t know it so hard that he went and built his own thing.

You guys know this, right? When he says “traditional observability tools”, he means monitoring tools. He means the whole three fucking pillars model: metrics, logging, and tracing, all separate things. As he notes, these traditional tools are entirely capable of delivering on basic operational outcomes (are we up, down, happy, sad?). They can DO this. They are VERY GOOD tools if that is your goal.

But they are not capable of solving the problem he wants to solve, because that would require combining app, business, and system telemetry in a unified way. Data that is traceable, but not just tracing. With the ability to slice and dice by any customer ID, site location, device ID, blah blah. Whatever shall we call THAT technological innovation, when someone invents it? Schmobservability, perhaps?

So anyway, “traditional observability” is now part of the mainstream vernacular. Fuck. What are we going to do about it? What CAN be done about it?

From cloudwashing to o11ywashing

I learned a new term yesterday: cloudwashing. I learned this from Rick Clark, who tells a hilarious story about the time IBM got so wound up in the enthusiasm for cloud computing that they reclassified their Z series mainframe as “cloud” back in 2008. 

(Even more hilarious: asking Google about the precipitating event, and following the LLM down a decade-long wormhole of incredibly defensive posturing from the IBM marketing department and their paid foot soldiers in tech media about how this always gets held up as an example of peak cloudwashing but it was NOT AT ALL cloudwashing due to being an extension of the Z/Series Mainframe rather than a REPLACEMENT of the Z/Series Mainframe, and did you know that Mainframes are bigger business and more relevant today than ever before?)

(Sorry, but I lost a whole afternoon to this nonsense, I had to bring you along for the ride.)

Rick says the same thing is happening right now with observability. And of course it is. It’s too big of a problem, with too big a budget: an irresistible target. It’s not just the legacy behemoths anymore. Any vendor that does anything remotely connected to telemetry is busy painting on a fresh coat of o11ywashing. From a marketing perspective, It would be irresponsible not to.

How to push back on *-washing

Anyway, here are the key takeaways from my weekend research into cloudwashing.

  1. This o11ywashing problem isn’t going away. It is only going to get bigger, because the problem keeps getting bigger, because the traditional vendors aren’t solving it, because they can’t solve it.

  2. The Gartners of the world will help users sort this out someday, maybe, but only after we win. We can’t expect them to alienate multibillion dollar companies in the pursuit of technical truth, justice and the American Way. If we ever want to see “Industry Experts” pitching in to help users spot o11ywashing, as they eventually did with cloudwashing (see exhibit A), we first need to win in the market.
    How to Spot Cloudwashing
    Exhibit A: “How to Spot Cloudwashing”

  3. And (this is the only one that really matters.) we have to do a better job of telling this story to engineering executives, not just engineers. Results and outcomes, not data structures and algorithms.

    (I don’t want to make this sound like an epiphany we JUST had…we’ve been working hard on this for a couple years now, and it’s starting to pay off. But it was a powerful confirmation.)

Talking to execs is different than talking to engineers

When Christine and I started Honeycomb, nearly ten years ago, we were innocent, doe-eyed engineers who truly believed on some level that if we just explained the technical details of cardinality and dimensionality clearly and patiently enough to the world, enough times, the consequences to the business would become obvious to everyone involved.

It has now been ten years since I was a hands-on engineer every day (say it again, like pressing on a bruise makes it hurt less), and I would say I’ve been a decently functioning exec for about the last three or four of those years. 

What I’ve learned in that time has actually given me a lot of empathy for the different stresses and pressures that execs are under. 

I wouldn’t say it’s less or more than the stresses of being an SRE on call for some of the world’s biggest databases, but it is a deeply and utterly different kind of stress, the kind of stress less expiable via fine whiskey and poor life choices. (You just wake up in the morning with a hangover, and the existential awareness of your responsibilities looming larger than ever.)

This is a systems problem, not an operational one

There is a lot of noise in the field, and executives are trying to make good decisions that satisfy all parties and constraints amidst the unprecedented stress-panic-opportunity-terror of AI changing everything. That takes storytelling skills and sales discipline on our part, in addition to technical excellence.

Companies are dumping more and more and more money into their so-called observability tools, and not getting any closer to a solution. Nor will they, so long as they keep thinking about observability in terms of operational outcomes (and buying operational tools). Observability is a systems problem. It’s the most powerful lever in your arsenal when it comes to disrupting software doom spirals and turning them into positive feedback loops. Or it should be.

As Fred Hebert might say, it’s great you’re so good at firefighting, but maybe it’s time to go read the city fire codes.

Execs don’t know what they don’t know, because we haven’t been speaking to them. But we’re starting to.

What will be the next term that gets invented and coopted in the search to solve this problem?

Where to start, with a project so big? Google’s AI says that “experts suggest looking for specific features to identify true cloud observability solutions versus cloudwashed o11ywashed ones.”

I guess this is a good place to start as any: If your “observability” tooling doesn’t help you understand the quality of your product from the customer’s perspective, EACH customer’s perspective, it isn’t fucking observability. 

It’s just monitoring dressed up in marketing dollars.

Call it o11ywashing.

From Cloudwashing to O11ywashing

How many pillars of observability can you fit on the head of a pin?

My day started off with an innocent question, from an innocent soul.

“Hey Charity, is profiling a pillar?”

I hadn’t even had my coffee yet.

“Someone was just telling me that profiling is the fourth pillar of observability now. I said I think profiling is a great tool, but I don’t know if it quite rises to the level of pillar. What do you think?”

What….do.. I think.

What I think is, there are no pillars. I think the pillars are a fucking lie, dude. I think the language of pillars does a lot of work to keep good engineers trapped inside a mental model from the 1980s, paying outrageous sums of money for tooling that can’t keep up with the chaos and complexity of modern systems.

Here is a list of things I have recently heard people refer to as the “fourth pillar of observability”:

  • Profiling
  • Tokens (as in LLMs)
  • Errors, exceptions
  • Analytics
  • Cost

Is it a pillar, is it not a pillar? Are they all pillars? How many pillars are there?? How many pillars CAN there be? Gaahhh!

This is not a new argument. Take this ranty little tweet thread of mine from way back in 2018, for starters.

 

Or perhaps you have heard of TEMPLE: Traces, Events, Metrics, Profiles, Logs, and Exceptions?

Or the “braid” of observability data, or “They Aren’t Pillars, They’re Lenses”, or the Lightstep version: “Three Pillars, Zero Answers” (that title is a personal favorite).

Alright, alright. Yes, this has been going on for a long time. I’m older now and I’m tireder now, so here’s how I’ll sum it up.

Pillar is a marketing term.
Signal is a technical term.

So “is profiling a pillar?” is a valid question, but it’s not a technical question. It’s a question about the marketing claims being made by a given company. Some companies are building a profiling product right now, so yes, to them, it is vitally important to establish profiling as a “pillar” of observability, because you can charge a hell of a lot more for a “pillar” than you can charge for a mere “feature”. And more power to them. But it doesn’t mean anything from a technical point of view.

On the other hand, “signal” is absolutely a technical term. The OpenTelemetry Signals documentation, which I consider canon, says that OTel currently supports Traces, Metrics, Logs, and Baggage as signal types, with Events and Profiles at the proposal/development stage. So yes, profiling is a type of signal.

The OTel docs define a telemetry signal as “a type of data transmitted remotely for monitoring and analysis”, and they define a pillar as … oh, they don’t even mention pillars? like at all??

I guess there’s your answer.

And this is probably where I should end my piece. (Why am I still typing…. 🤔)

Pillars vs signals

First of all, I want to stress that it does not bother me when engineers go around talking about pillars. Nobody needs to look at me guiltily and apologize for using the term ‘pillar’ atBunnies Addendum (For the Buffy Fans) - En Tequila Es Verdad the bar after a conference because they think I’m mad at them. I am not the language police, it is not my job to go around enforcing correct use of technical terms. (I used to, I know, and I’m sorry! 😆)

When engineers talk about pillars of observability, they’re just talking about signals and signal types, and “pillar” is a perfectly acceptable colloquialism for “signal”.

When a vendor starts talking about pillars, though — as in the example above! — it means they are gearing up to sell you something: another type of signal, siloed off from all the other signals you send them. Your cost multiplier is about to increment again, and then they’re going to start talking about how Important it is that you buy a product for each and every one of the Pillars they happen to have.

As a refresher: there are two basic architecture models used by observability companies, the multiple pillars model and the unified storage model (aka o11y 2.0). The multiple pillars model is to store every type of signal in a different siloed storage location — metrics, logs, traces, profiling, exceptions, etc, everybody gets a database! The unified storage model is to store all signals together in ONE database, preserving context and relationships, so you can treat data like data: slice and dice, zoom in, zoom out, etc.

Most of the industry giants were built using the pillars model, but Honeycomb (and every other observability company founded post-2019) has built using the unified storage model, building wide, structured log events on a columnar storage engine with high cardinality support, and so on.

Bunny-hopping from pillar to pillar

When you use each signal type as a standalone pillar, this leads to an experience I think of as “bunny products” 🐇 where the user is always hopping from pillar to pillar. You see something on your metrics dashboard that looks scary? hop-hop to your logs and try to find it there, using grep and search and matching by timestamps. If you can find the right logs, then you need to trace it, so you hop-hop-hop to your traces and repeat your search there. With profiling as a pillar, maybe you can hop over to that dataset too.🐇🐰

The amount of data duplication involved in this model is mind boggling. You are literally storing the same information in your metrics TSDB as you are in your logs and your traces, just formattedThe 30 Best Bunny Rabbit Memes - Hop to Pop differently. (I never miss an opportunity to link to Jeremy Morrell’s masterful doc on instrumenting your code for wide events, which also happens to illustrate this nicely.) This is insanely expensive. Every request that enters your system gets stored how many times, in how many signals? Count it up; that’s your cost multiplier.

Worse, much of the data that connects each “pillar” exists only in the heads of the most senior engineers, so they can guess or intuit their way around the system, but anyone who relies on actual data is screwed. Some vendors have added an ability to construct little rickety bridges post hoc between pillars, e.g. “this metric is derived from this value in this log line or trace”, but now you’re paying for each of those little bridges in addition to each place you store the data (and it goes without saying, you can only do this for things you can predict or hook up in the first place).

The multiple pillars model (formerly known as observability 1.0) relies on you believing that each signal type must be stored separately and treated differently. That’s what the pillars language is there to reinforce. Is it a Pillar or not?? It doesn’t matter because pillars don’t exist. Just know that if your vendor is calling it a Pillar, you are definitely going to have to Pay for it. 😉

Zooming in and out

But all this data is just.. data. There is no good reason to silo signals off from each other, and lots of good reasons not to. You can derive metrics from rich, structured data blobs, or append your metrics to wide, structured log events. You can add span IDs and visualize them as a trace. The unified storage model (“o11y 2.0”) says you should store your data once, and do all the signal processing in the collection or analysis stages. Like civilized folks.

Anya Bunny Quote - Etsy
All along, Anya was right

From the perspective of the developer, not much changes. It just gets easier (a LOT easier), because nobody is harping on you about whether this nit of data should be a metric, a log, a trace, or all of the above, or if it’s low cardinality or high cardinality, or whether the cardinality of the data COULD someday blow up, or whether it’s a counter, a gauge, a heatmap, or some other type of metric, or when the counter is going to get reset, or whether your heatmap buckets are defined at useful intervals, or…or…

Instead, it’s just a blob of json. Structured data.. If you think it might be interesting to you someday, you dump it in, and if not, you don’t. That’s all. Cognitive load drops way down..

On the backend side, we store it once, retaining all the signal type information and connective tissue.

It’s the user interface where things change most dramatically. No more bunny hopping around from pillar to pillar, guessing and copy-pasting IDs and crossing your fingers. Instead, it works more like the zoom function on PDFs or Google maps.

You start with SLOs, maybe, or a familiar-looking metrics dashboard. But instead of hopping, you just.. zoom in. The SLOs and metrics are derived from the data you need to debug with, so you’re just like.. “Ah what’s my SLO violation about? Oh, it’s because of these events.” Want to trace one of them? Just click on it. No hopping, no guessing, no pasting IDs around, no lining up time stamps.

Zoom in, zoom out, it’s all connected. Same fucking data.

“But OpenTelemetry FORCES you to use three pillars”

There’s a misconception out there that OpenTelemetry is very pro-three pillars, and very anti o11y 2.0. This is a) not true and b) actually the opposite. Austin Parker has written a voluminous amount of material explaining that actually, under the hood, OTel treats everything like one big wide structured event log.

As Austin puts it, “OpenTelemetry, fundamentally, unifies telemetry signals through shared, distributed context.” However:

“The project doesn’t require you to do this. Each signal is usable more or less independently of the other. If you want to use OpenTelemetry data to feed a traditional ‘three pillars’ system where your data is stored in different places, with different query semantics, you can. Heck, quite a few very successful observability tools let you do that today!”

“This isn’t just ‘three pillars but with some standards on top,’ it’s a radical departure from the traditional ‘log everything and let god sort it out’ approach that’s driven observability practices over the past couple of decades.”

You can use OTel to reinforce a three pillars mindset, but you don’t have to. Most vendors have chosen to implement three pillarsy crap on top of it, which you can’t really hold OTel responsible for. One[1] might even argue that OTel is doing as much as it can to influence you in the opposite direction, while still meeting Pillaristas where they’re at.

A postscript on profiling

What will profiling mean in a unified storage world? It just means you’ll be able to zoom in to even finer and lower-level resolution, down to syscalls and kernel operations instead of function calls. Like when Google Maps got good enough that you could read license plates instead of just rooftops.

Admittedly, we don’t have profiling yet at Honeycomb. When we did some research into the profiling space, what we learned was that most of the people who think they’re in desperate need of a profiling tool are actually in need of a good tracing tool. Either they didn’t have distributed tracing or their tracing tools just weren’t cutting it, for reasons that are not germane in a Honeycomb tracing world.

We’ll get to profiling, hopefully in the near-ish future, but for the most part, if you don’t need syscall level data, you probably don’t need profiling data either. Just good traces.

Also… I did not make this site or have any say whatsoever in the building of it, but I did sign the manifesto[2] and every day that I remember it exists is a day I delight in the joy and fullness of being alive: kill3pill.com 📈

Kill Three Pillars

Hop hop, little friends,
~charity

 

[1] Austin argues this. I’m talking about Austin, if not clear enough.
[2] Thank you, John Gallagher!!

How many pillars of observability can you fit on the head of a pin?

Got opinions on observability? I could use your help (once more, with feeling)

Last month I dropped a desperate little plea for help in this space, asking people to email me any good advice and/or strong opinions they happened to have on the topic of buying software.

I wasn’t really sure what to expect — desperate times, desperate measures — but holy crap, you guys delivered. To the many people who took the time to write up your experiences and expertise for me, and suffer through rounds of questions and drafts: ✨thank you✨. And thank you, too, to those of you who forwarded my queries along to experts in your network and asked for help on my behalf.

I learned a LOT about buying software and managing vendor relationships in the process of writing this. Honestly, this chapter is shaping up to be one of the things I’m most excited about for the second edition of the book.

Why I’m excited about the software buying chapter (& you should be too)

I’m imagining you reading this with a skeptical expression and an arched eyebrow. “Really, Charity…‘how to buy software’ doesn’t exactly suggest peak engineering prowess.”

Au contraire, my friends. I’ve come to believe that vendor engineering is one of the subtlest and most powerful practical applications of deep subject matter expertise, and some of the highest leverage work an engineer can do. How often do you get to make decisions that leverage the labor of hundreds or thousands of engineers per year, for fractions of pennies on the dollar? How many of the decisions you make will have an impact on every single engineer you work with and their ability to do their jobs well, as well as the experience of every single customer?

If you think I’m hyperventilating a bit, nah; this is entry level shit. In the book, I tell the story of the best engineer I ever worked with, and how I watched him alter the trajectory of multiple other companies, none of which he was working for, buying from, or formally connected to in any way — in the space of a few conversations. It upended my entire worldview about what it can look like for an engineer to wield great power.

Doing this stuff well takes both technical depth and technical breadth, in addition to systems thinking and knowledge of the business. It is one of the only ways a staff+ engineer can acquire and develop executive-level communication, strategy, and execution skills while remaining an individual contributor.

I’ve been wanting to write about this for YEARS. Anyway — ergh! — I’m rambling now. That was not what I came here to talk about, I’m just excited. Back to the point.

My second (and final) round of questions

I got so much out of your thoughtful responses that I thought I’d press my luck and put a few more questions out to the universe, before it’s too late.

These questions speak to areas where I worry that my writing may be a little weak or uninformed, or too far away from the world where people are using the “three pillars” model (aka multiple pillars or o11y 1.0) and happy about it. I don’t know many (any??) of those people, which suggests some pretty heavy selection bias.

I don’t expect anyone to answer all the questions; if one or two resonate with you, write about those and ignore the rest. If there’s something I didn’t ask that I should have asked, answer that. Something I’ve written in the past that bugged you that you hope I won’t say again? Tell me! We are almost out of time ⌛ so gimme what you got. 🙌

On migrations:

📈 Have you ever migrated from one observability vendor to another? If so, what did you learn? What was the hardest part, what took you by surprise? What do you wish you could go back in time and tell your self at the start?

📈 If you ran (or were involved in) a large scale migration or tool change… how did you structure the process? Like, was it team by team, service by service, product by product? Did you have a playbook? What did you do to make it fun or push through organizational inertia? How long did it take?

On managing costs for the traditional three pillars:

📈 For orgs that are using Datadog, Grafana, Chronosphere, or another traditional three pillars architecture.. How would you describe your approach to cutting and controlling costs? Pro tips and/or comprehensive strategy.

📈 Alternately, if there are particular blog posts with advice you have followed and can personally vouch for, would you send me a link?

📈 How do you guide your software engineers on which data to send to which place — metrics, logs, traces, errors/exceptions, profiling, etc? How do you manage cardinality? How do you work to keep the pillars in sync, or are there any particular tips and tricks you have for linking / jumping between the data sources?

📈 How many ongoing engineering cycles does it take to manage and maintain costs, once you’ve gotten them to a sustainable place?

On managing costs at massive scale:

(Especially for people who work at a large enterprise, the kind with multiple business units, but others welcome too!):

  • Do you use tiers of service for managing costs? How do you define those?
  • How do new tools get taken for a spin? (Like, sometimes there is an office of the CTO with carte blanche to try new things and evaluate them for the rest of the org)
  • How do you use telemetry pipelines?

Observability teams (quick poll):

📈 If you have an observability team, how big is it? What part of the org does it report up into? Roughly how many engineers does that team support?

📈 If you don’t have an observability team — and you have more than, say, 300 engineers — who owns observability? Platform? SRE? Other?

A grab bag:

📈 Build vs Buy: If you built your own observability tool(s)…. What were the reasons? What does it do? Would you make the same decision today?

📈 OpenTelemetry: If your team has weighed the pros and cons of adopting OTel and ultimately decided not to, for technical or philosophical reasons (i.e. not just “we’re too busy”) — what are those reasons?

📈 Instrumentation: what do you do to try and remove cognitive overhead for engineers? How much have you been able to make automatic and magical, and where has the magic failed?

📈 Consolidation: I would love to hear any thoughts on tool consolidation vs tool proliferation. Is this primarily driven by execs, or do technical users care too? Is it driven by cost concerns, usability, or something else?

edited on 2025-10-15 to add… oh crap, one last question:

📈 Open source: Are you using open source observability tools, and if so, are these your primary tools or one piece of a comprehensive tooling strategy? If the latter, could you describe that strategy for me?

Send it to me in an email

Please send me your opinions or answers in an email, to my first name at honeycomb dot io, with the subject line “Observability questions”.

If I end up cribbing from your material, it okay for me to print your name? (As in, “thanks to the people who informed my thinking on this subject, abc xyz etc”). I will not mention your employer or where you work, don’t worry.

If you send it to me more than a week from now, I probably won’t be able to use it. Augh, I wish I had thought of this in JUNE!!! #ragrets

✨THANK YOU✨

I know this is an incredibly time consuming thing to ask of someone, and I can’t express how much I appreciate your help.

P.S. Yes, the title is absolutely a reference to the Buffy musical. Hey, I had to give you guys something fun to read along with my second bleg in less than a month (do people still say “bleg”??).

6 Musical Episodes of TV Shows That Deserve an Encore

P.P.S. Grammar quiz of the day: should my title read “opinions ABOUT observability” or “opinions ON observability” ??

GREAT QUESTION — and, as it turns out, the preposition you choose may reveal more than you realized.

“About” is used to introduce a topic or subject in a broad, vague, or approximate sense, while “on” is used to signal more detailed, specific, formal or serious subject matter (as well as physical objects). “Let’s talk about dinner” vs “she delivered a lecture on why AI is trying to kill babies.”

Or as Xander says, “To read makes our English speaking good.”

The earth is doomed,
~charity

Got opinions on observability? I could use your help (once more, with feeling)

Are you an experienced software buyer? I could use some help.

If it seems like I’ve been relatively quiet lately on social media and my blog, that’s because I have. Liz, Austin, George and I have been busy toiling away on the second edition of “Observability Engineering” ever since April or May. I personally have been trying to spend 75-80% of my time on the book since May.

Have I been successful in that attempt? No. But I’m trying. Progress is being made. Hopefully just a few more weeks of drafting and we’ll be on to edits, and on to your grubby little paws by May-ish.

The world has changed A LOT since we wrote the first edition, in 2019-2022. Do you know, the phrase “observability engineering teams” doesn’t even occur in the first edition of the book? Try and search — it can’t be found! Even the phrase “observability teams” doesn’t pop up til near the end, and when it does, we are referring to those few teams that choose to build their own observability tools from scratch.

These days, observability engineering teams are everywhere. Which is why we are adding a whole new section, a sizable one, called “Observability Governance.” The governance section will have a bunch of chapters on topics like how to staff these teams, where they should fit in the org chart, how to buy good tools, how to integrate them, how to manage costs, how to make the business case up the chain to senior execs, how to manage schemas and semantic conventions at scale, and much much more.

The problem

The problem is, I’ve never really bought software. Not like this. I’ve never even worked at a  truly large, software-buying enterprise tech company. So I am not well equipped to give good advice on questions like:

  • How do you shop around for options?
  • What are some signs you may need to suck it up and change vendors?
  • What does a good POC (proof of concept) look like?
  • Who are your stakeholders? What are their concerns?
  • How do you drive consensus when millions of dollars (and the work experience of thousands of engineers) are on the line? What does ‘consensus’ even mean in that context?
  • What are the primary considerations should you take into account when making a decision? What are secondary considerations?

I’m looking for the kind of advice that a principal engineer who has done this many times might give a staff engineer who is doing it for the first time. Or that a VP who’s done this many times might give a director who is doing it for the first time.

Can you help?

This is me wearing Leia buns and projecting a unicorn-shaped rainbow bat signal out into the sky for help. Do you have any advice for me? What guidance would you give to the readers of the second edition of this book?

Please send your advice to me in an email, addressed to my first name at honeycomb dot io, with the subject line: “Buying Software”. Include any relevant context about how large the company or engineering org is, and what your role in purchasing was.

I may respond with more questions, or reply and ask if you are able to talk synchronously. But I will not quote anything you send me without first asking your permission and getting a signed release. I will not mention ANY vendors by name, good or bad.

I am not fishing for honeycomb customers or buyers, I assume most of you haven’t tried honeycomb and don’t care about it and that is fine. This is not a Honeycomb project, this is an O’Reilly writing project. I just want to gather up some good advice on buying software and funnel it back out to good engineers.

Can you help? Your industry needs you! <3

 

 

Are you an experienced software buyer? I could use some help.

How We Migrated the Parse API From Ruby to Golang (Resurrected)

I wrote a lot of blog posts over my time at Parse, but they all evaporated after Facebook killed the product. Most of them I didn’t care about (there were, ahem, a lot of “service reliability updates”), but I was mad about losing one specific piece, a deceptively casual retrospective of the grueling, murderous two-year rewrite of our entire API from Ruby on Rails to Golang..

I could have sworn I’d looked for it before, but someone asked me a question about migrations this morning, which spurred me to pull up the Wayback Machine again and dig in harder, and … ✨I FOUND IT!!✨

Honestly, it is entirely possible that if we had not done this rewrite, there might be no Honeycomb. In the early days of the rewrite, we would ship something in Go and the world would break, over and over and over. As I said,

Rails HTTP processing is built on a philosophy of “be liberal in what you accept”. So developers end up inadvertently sending API requests that are undocumented or even non-RFC compliant … but Rails middleware cleans them up and handles it fine.

Rails would accept any old trash, Go would not. Breakage ensues. Tests couldn’t catch what we didn’t know to look for. Eventually we lit upon a workflow where we would split incoming production traffic, run each request against a Go API server and a Ruby API server, each backed by its own set of MongoDB replicas, and diff the responses. This is when we first got turned on to how incredibly powerful Scuba was, in its ability to compare individual responses, field by field, line by line.

Once you’ve used a tool like that, you’re hooked.. you can’t possibly go back to metrics and aggregates. The rest, as they say, is history.

The whole thing is still pretty fun to read, even if I can still smell the blood and viscera a decade later. Enjoy.


“How We Moved Our API From Ruby to Go and Saved Our Sanity”

Originally posted on blog.parse.com on June 10th, 2015.

The first lines of Parse code were written nearly four years ago. In 2011 Parse was a crazy little idea to solve the problem of building mobile apps.

Those first few lines were written in Ruby on Rails.


Ruby on Rails

Ruby let us get the first versions of Parse out the door quickly. It let a small team of engineers iterate on it and add functionality very fast. There was a deep bench of library support, gems, deploy tooling, and best practices available, so we didn’t have to reinvent very many wheels.

We used Unicorn as our HTTP server, Capistrano to deploy code, RVM to manage the environment, and a zillion open source gems to handle things like YAML parsing, oauth, JSON parsing, MongoDB, and MySQL. We also used Chef which is Ruby-based to manage our infrastructure so everything played together nicely. For a while.

The first signs of trouble bubbled up in the deploy process. As our code base grew, it took longer and longer to deploy, and the “graceful” unicorn restarts really weren’t very graceful. So, we monkeypatched rolling deploy groups in to Capistrano.

“Monkeypatch” quickly became a key technical term that we learned to associate with our Ruby codebase.

A year and a half in, at the end of 2012, we had 200 API servers running on m1.xlarge instance types with 24 unicorn workers per instance. This was to serve 3000 requests per second for 60,000 mobile apps. It took 20 minutes to do a full deploy or rollback, and we had to do a bunch of complicated load balancer shuffling and pre-warming to prevent the API from being impacted during a deploy.

Then, Parse really started to take off and experience hockey-stick growth.


Problems

When our API traffic and number of apps started growing faster, we started having to rapidly spin up more database machines to handle the new request traffic. That is when the “one process per request” part of the Rails model started to fall apart.

With a typical Ruby on Rails setup, you have a fixed pool of worker processes, and each worker can handle only one request at a time. So any time you have a type of request that is particularly slow, your worker pool can rapidly fill up with that type of request. This happens too fast for things like auto-scaling groups to react. It’s also wasteful because the vast majority of these workers are just waiting on another service. In the beginning, this happened pretty rarely and we could manage the problem by paging a human and doing whatever was necessary to keep the API up. But as we started growing faster and adding more databases and workers, we added more points of failure and more ways for performance to get degraded.

We started looking ahead to when Parse would 10x its size, and realized that the one-process-per-request model just wouldn’t scale. We had to move to an async model that was fundamentally different from the Rails way. Yeah, rewrites are hard, and yeah they always take longer than anyone ever anticipates, but we just didn’t see how we could make the Rails codebase scale while it was tied to one process per request.


What next?

We knew we needed asynchronous operations. We considered a bunch of options:

EventMachine

We already had some of our push notification service using EventMachine, but our experience was not great as it too was scaling. We had constant trouble with accidentally introducing synchronous behavior or parallelism bugs. The vast majority of Ruby gems are not asynchronous, and many are not threadsafe, so it was often hard to find a library that did some common task asynchronously.

JRuby

This might seem like the obvious solution – after all, Java has threads and can handle massive concurrency. Plus it’s Ruby already, right? This is the solution Twitter investigated before settling on Scala. But since JRuby is still basically Ruby, it still has the problem of asynchronous library support. We were concerned about needing a second rewrite later, from JRuby to Java. And literally nobody at all on our backend or ops teams wanted to deal with deploying and tuning the JVM. The groans were audible from outer space.

C++

We had a lot of experienced C++ developers on our team. We also already had some C++ in our stack, in our Cloud Code servers that ran embedded V8. However, C++ didn’t seem like a great choice. Our C++ code was harder to debug and maintain. It seemed clear that C++ development was generally less productive than more modern alternatives. It was missing a lot of library support for things we knew were important to us, like HTTP request handling. Asynchronous operation was possible but often awkward. And nobody really wanted to write a lot of C++ code.

C#

C# was a strong contender. It arguably had the best concurrency model with Async and Await. The real problem was that C# development on Linux always felt like a second-class citizen. Libraries that interoperate with common open source tools are often unavailable on C#, and our toolchain would have to change a lot.

Go

Go and C# both have asynchronous operation built into the language at a low level, making it easy for large groups of people to write asynchronous code. The MongoDB Go driver is probably the best MongoDB driver in existence, and complex interaction with MongoDB is core to Parse. Goroutines were much more lightweight than threads. And frankly we were most excited about writing Go code. We thought it would be a lot easier to recruit great engineers to write Go code than any of the other solid async languages.

In the end, the choice boiled down to C# vs Go, and we chose Go.


Wherein we rewrite the world

We started out rewriting our EventMachine push backend from Ruby to Go. We did some preliminary benchmarking with Go concurrency and found that each network connection ate up only 4kb of RAM. After rewriting the EventMachine push backend to Go we went from 250k connections per node to 1.5 million connections per node without even touching things like kernel tuning. Plus it seemed really fun. So, Go it was.

We rewrote some other minor services and starting building new services in Go. The main challenge, though, was to rewrite the core API server that handles requests to api.parse.com while seamlessly maintaining backward compatibility. We rewrote this endpoint by endpoint, using a live shadowing system to avoid impacting production, and monitored the differential metrics to make sure the behaviors matched.

During this time, Parse 10x’d the number of apps on our backend and more than 10x’d our request traffic. We also 10x’d the number of storage systems backed by Ruby. We were chasing a rapidly moving target.

The hardest part of the rewrite was dealing with all the undocumented behaviors and magical mystery bits that you get with Rails middleware. Parse exposes a REST API, and Rails HTTP processing is built on a philosophy of “be liberal in what you accept”. So developers end up inadvertently sending API requests that are undocumented or even non-RFC compliant … but Rails middleware cleans them up and handles it fine.

So we had to port a lot of delightful behavior from the Ruby API to the Go API, to make sure we kept handling the weird requests that Rails handled. Stuff like doubly encoded URLs, weird content-length requirements, bodies in HTTP requests that shouldn’t have bodies, horrible oauth misuse, horrible mis-encoded Unicode.

Our Go code is now peppered with fun, cranky comments like these:

// Note: an unset cache version is treated by ruby as “”.
// Because of this, dirtying this isn’t as simple as deleting it – we need to
// actually set a new value.

// This byte sequence is what ruby expects.
// yes that’s a paren after the second 180, per ruby.

// Inserting and having an op is kinda weird: We already know
// state zero. But ruby supports it, so go does too.

// single geo query, don’t do anything. stupid and does not make sense
// but ruby does it. Changing this will break a lot of client tests.
// just be nice and fix it here.

// Ruby sets various defaults directly in the structure and expects them to appear in cache.
// For consistency, we’ll do the same thing.

Results

Was the rewrite worth it? Hell yes it was. Our reliability improved by an order of magnitude. More importantly, our API is not getting more and more fragile as we spin up more databases and backing services. Our codebase got cleaned up and we got rid of a ton of magical gems and implicit assumptions. Co-tenancy issues improved for customers across the board. Our ops team stopped getting massively burned out from getting paged and trying to track down and manually remediate Ruby API outages multiple times a week. And needless to say, our customers were happier too.

We now almost never have reliability-impacting events that can be tracked back to the API layer – a massive shift from a year ago. Now when we have timeouts or errors, it’s usually constrained to a single app – because one app is issuing a very inefficient query that causes timeouts or full table scans for their app, or it’s a database-related co-tenancy problem that we can resolve by automatically rebalancing or filtering bad actors.

An asynchronous model had many other benefits. We were also able to instrument everything the API was doing with counters and metrics, because these were no longer blocking operations that interfered with communicating to other services. We could downsize our provisioned API server pool by about 90%. And we were also able to remove silos of isolated Rails API servers from our stack, drastically simplifying our architecture.

As if that weren’t enough, the time it takes to run our full integration test suite dropped from 25 minutes to 2 minutes, and the time to do a full API server deploy with rolling restarts dropped from 30 minutes to 3 minutes. The go API server restarts gracefully so no load balancer juggling and prewarming is necessary.

We love Go. We’ve found it really fast to deploy, really easy to instrument, really lightweight and inexpensive in terms of resources. It’s taken a while to get here, but the journey was more than worth it.

Credits/Blames

Credits/Blames go to Shyam Jayaraman for driving the initial decision to use Go, Ittai Golde for shepherding the bulk of the API server rewrite from start to finish, Naitik Shah for writing and open sourcing a ton of libraries and infrastructure underpinning our Go code base, and the rest of the amazing Parse backend SWE team who performed the rewrite.

How We Migrated the Parse API From Ruby to Golang (Resurrected)