First I wrote the wrong book, then I wrote the right book (xpost)

I’m not sure whether to say “thank you” or “HOW COULD YOU DO THIS TO ME”, but this one goes out to all the people who sent me advice on buying software last fall.

This is the second in a two-part episode. The first part ended on a ✨cliffhanger!!!✨ — so if you missed the first episode, catch up here:

Six long weeks of writer’s block

I was merrily cranking away what I believed to be my last chapter when I asked the internet — YOU guys — for help the first time. “Are you an experienced software buyer? I could use some help,” went up on September 19th, 2025.

The response was overwhelming. I heard from software engineers, SREs, observability leads, CTOs, VPs, distinguished engineers, consultants, even the odd CISO. All these emails and responses and lengthy threads kept me busy for a while, but eventually I had to get back to writing. That’s when I discovered, to my unpleasant surprise, that I couldn’t seem to write anymore.

“Well,” I reasoned, “maybe I’ll just ask the internet for EVEN MORE advice” — and out popped Buffy-themed post number two, on October 13th.

Keep in mind, I thought I would be done by then. November was my stretch deadline, my just in caseI better leave myself some breathing room kind of deadline.

As November 1st came and went, my frustration began spiraling out into blind panic. What the hell is going on and why can I not finish this???

In which I finally listen to the advice I asked for

A week before Thanksgiving, I was up late tinkering with Claude. I imported all the emails and advice I had gotten from y’all, and started sorting into themes and picking out key quotes, and that is when it finally hit me: I had written the wrong thing.

No, this deserves a bigger font.

✨I wrote the wrong thing.✨

I wrote the wrong thing, for the wrong people, and none of it was going to move the needle in any meaningful way.

The chapters I had written were full of practical advice for observability engineering teams and platform engineering teams, wrestling with implementation challenges like instrumentation and cost overflows. Practical stuff.

Yes.

The internet was right (this ONE time)

My inbox, on the other hand, was overflowing with stories like these:

  • “Many times [competitive research] is faked. One person has their favorite option and then they do just enough ‘competitive analysis’ to convince the sourcing folks that due diligence was done or to nullify the CIO/CTO/whoever is accepting this on to their budget”
  • “We [the observability team] spent six months exhaustively trialing three different solutions before we made a decision. The CEO of one of the losing vendors called our CEO, and he overruled our decision without even telling us.” (Does your CEO know anything at all about engineering??) “No.”
  • “Our SRE teams have vetoed any attempt to modernize our tool stack. ($Vendor) is part of their identity, and since they would have to help roll out and support any changes, we are stuck living in 2015 apparently forever.” (What does management have to say?) “It’s been twenty years since they touched a line of code.”
  • “We’re weird in that most of the company hates technology and really hates that we have to pay for it since they don’t understand the value it brings to the company. This is intentional ignorance, we make the value props continually and well, we just haven’t succeeded yet….We’re a little obsessed with trying to get champagne quality at Boone’s prices.”
  • “When it comes to dealing with salespeople and the enterprise sales process, the best tip for engineers is to not anthropomorphize sales professionals who are driven by commission. The best ones are like robot lawn mowers dressed in furry unicorn costumes. They may seem cute and nice but they do not care about anything besides closing the next deal….All of the best SaaS companies are full of these friendly fake unicorn zombies who suck cash instead of blood.”

Nearly all of the emails I got were either describing a terminally fucked up buying process from the top down, or the long term consequences of those fucked up decisions.

In other words: I was writing tactical advice for teams who were surviving in a strategic vacuum.

So I threw the whole thing out, and started over from scratch. 😭

Even good teams are struggling right now

As Tolstoy once wrote, “Happy teams are all alike; every fucked up team is fucked up in its own precious way.”

There is an infinity of ways to screw something up. But there is one pattern I see a critical mass of engineering orgs falling into right now, even orgs that are generally quite solid. That is when there is no shared alignment or even shared vocabulary between engineering and other stakeholders directors, VPs and SVPs, CTO, CIO, principal and distinguished engineers — on some pretty clutch questions. Such as:

  • “What is observability?”
  • “Who needs it?”
  • “What problem are we trying to solve?”

And my favorite: “Is observability still relevant in a post-AI era? Can’t agents do that stuff now?”

Even some generally excellent CTOs[1] have been heard saying things like, “yeah, observability is definitely very important, but all our top priorities are related to AI right now.”

Which gets causality exactly backwards. Because your ability to get any returns on your investments into AI will be limited by how swiftly you can validate your changes and learn from them. Another word for this is “OBSERVABILITY”.

Enough ranting. Want a peek? I’ll share the new table of contents, and a sentence or two about a couple of my own favorite chapters.

Part 6: “Observability Governance” (v2)

The new outline is organized to speak to technical decision-makers, starting at the top and loosely descending. What do CTOs need to know? What do VPs and distinguished engineers need to know? and so on. We start off abstract, and become more concrete.

Since every technical term (e.g. high cardinality, high dimensionality, etc) has become overloaded and undifferentiated by too much sales and marketing, we mostly avoid it. Instead, we use the language of systems and feedback loops.

Again, we are trying to help your most senior engineers and execs develop a shared understanding of “What problem are we solving?” and “What is our goal? Technical terms can actually detract and distract from that shared understanding.

  1. An Open Letter to CTOs: Why Organizational Learning Speed is Now Your Biggest Constraint. Organizations used to be limited by the speed of delivery; now they are limited by how swiftly they can validate and understand what they delivered.
  2. Systems Thinking for Software Delivery. Observability is the signal that connects the dots to make a feedback loop; no observability, no loop. What happens to amplifying or balancing loops when that signal is lossy, laggy, or missing?
  3. The Observability Landscape Through a Systems Lens. What feedback loops do developers need, and what feedback loops does ops need? How do these map to the tools on the market?
  4. The Business Case for Observability. Is your observability a cost center or an investment? How should you quantify your RoI?
  5. Diagnosing Your Observability Investment
  6. The Organizational Shift
  7. Build vs Buy (vs Open Source)
  8. The Art and Science of Vendor Partnerships. Internal transformations run on trust and credibility; vendor partnerships run on trust and reciprocity. We’ll talk about both of these, as well as how to run a strong POC.
  9. Instrumentation for Observability Teams
  10. Where to Go From Here

Hey, I have a lot of empathy right now for leaders and execs who feel like they’re behind on everything. I feel it too. Anyone who doesn’t is lying to themselves (or their name is Simon Willison).

But the role observability plays in complex sociotechnical systems is one of those foundational concepts you need to understand. You’re not gonna get this right by accident. You’re not going to win by doing the same thing you were doing five years ago. And if you screw up your observability, you screw up everything downstream of it too.

To those of you who do understand this, and are working hard to drive change in your organizations: I see you. It is hard, often thankless work, but it is work worth doing. If I can ever be of help: reach out.

A longer book, but a better book

The last few chapters are heading into tech review on Friday, February 20th. Finally. The last 3.5 months have been some of the most panicky and stressful of my life. I….just typed several paragraphs about how terrible this has been, and deleted them, because you do not need to listen to me whine. ☺️

Like I said, I have never felt especially proud of the first edition. I am not UN proud, it’s just…eh. I feel differently this time around. I think—I hope—it can be helpful to a lot of different people who are wrestling with adapting to our new AI-native reality, from a lot of different angles.[2]

Thanks, Christine. (Another for the folder marked ”NOW YOU TELL ME”)

I am incredibly grateful to my co-authors, collaborators, and our editor, Rita Fernando, without whom I never would have made it through.

But there’s one more group that deserves some credit, and it’s…you guys. I asked for help, and help I got. So many people wrote me such long, thought-provoking emails full of stories, advice and hard-earned wisdom. The better the email, the more I peppered you with followup questions, which is a great way to punish a good deed.

Blame these people

I am a tiny bit torn on whether to say “thank you” or “fuck you”, because my life would have been much nicer if I had stuck to the plan and wrapped in October.

But the following list of people were especially instrumental in forcing me to rethink my approach. It made the book much stronger, so if you catch one of them in the wild, please buy them a stiff drink. (Or buy yourself one, and throw it in their face with my sincere compliments.)

  • Abraham Ingersoll, the aforementioned “odd CISO”, who would be quoted in the book had his advice not been so consistently unprintable by the standards of respectable publications
  • Benjamin Mann of Delivery Hero, who I would work for in a heartbeat, and not just for the way he wields “NOPE” as a term of art
  • Marty Lindsay, who has spent more time explaining POCs and tech evals to me than anyone should have to. (If you need an o11y consultant, Marty should be your very first stop).
  • Sam Dwyer, whose stories seeded my original plan to write a set of chapters for observability engineering teams. (I hope the replacement plan is useful too!)

Many others sent me terrific advice, and endured multiple rounds of questions and more questions and clarifications on said questions. A few of them:

Matthew Sanabria, Chris Cooney, Glen Mailer, Austin Culbertson, John Scancella, John Doran, Bryan Finster, Hazel Weakly, Chris Ziehr, Thomas Owens, Mike Lee, Jay Gengelbach, Will Hegedus, Natasha Litt, Alonso Suarez, Jason McMunn, Evgeny Rubtsov, George Chamales, Ken Finnegan, Cliff Snyder, Robyn Hirano, Rita Canavarro, Matt Schouten, Shalini Samudri Ananda Rao (Sam).

I am definitely forgetting some names; I will try to update the list as I remember them.

But seriously: thank you, from the bottom of my heart. I loved hearing your stories, your complaints, your arguments about how the world should improve. Your DNA is in this book; I hope it does you justice.

~charity
💜💙💚💛🧡❤️💖

 

[1] It’s ironic (and makes me uncomfortably self-conscious), but some of the worst top-down decision-making processes I have ever seen have come from companies where CEO and CTO are both former engineers. The confidence they have in their own technical acumen may be not wholly unfounded, but it is often ten or more years out of date. We gotta update those priors, my friends. Stay humble.

[2] On the other hand, as my co-founder, Christine Yen, informed me last week: “Nobody reads books anymore.”

First I wrote the wrong book, then I wrote the right book (xpost)

Martin Fowler told me the second edition should be shorter (it’s twice as long) (xpost)

90% new material, a clearer mission, and a rogues gallery of contributors. The second edition of Observability Engineering is almost done.

Last week I got to meet Martin Fowler for the first time in person. This was an exciting moment for me. Martin ranks high on my personal pantheon; he is, so far as I can tell, hardly ever wrong.

Martin Fowler, me, and Nathen Harvey, at the happy hout for Gergely’s Pragmatic Summit! Not pictured: my glass of wine, Nathen’s box of fucks to give
After letting me know that he “only hugs on this side of the Rockies” (noted), the topic of writing books came up, to which he drawled,

“The second edition should always be shorter. I always made my second editions shorter. Shorter books are better books.”

God dammit, Martin.. Now you tell me. 🙈

In completely unrelated news, our last few chapters for “Observability Engineering, 2nd edition” go out to tech reviewers this week! This puts us on track for dead tree publication in June, although chapters will be available earlier for O’Reilly subscribers as well as behind an email gate on the Honeycomb site.

What’s different about the second edition?

Almost everything. The only chapters that carry over some material are the ones on sampling, retriever (columnar store), and a smattering of the SLO stuff—maybe 10% all told? And we’ve added a monstrous amount of new material.

So no, it will not be shorter than the first edition. It is almost twice as long.[1] (Sorry!)

The first edition was a spaghetti mess

Books, as I understand, are like children; if you made them, you are not allowed to say you aren’t proud of them.

So fine, I won’t say it. But I think we can all privately agree that the first edition was a bit of a hot mess.

No shade on my wonderful co-authors, Liz and George, or our O’Reilly editors, or myself for that matter. We did our best, but now, with the clarity of hindsight, it’s easy to see all the ways the ground was shifting under us as we wrote.

When we started the book in 2018, Honeycomb was the only observability company, and our definition of observability—high cardinality, high dimensionality, explorability—was the only definition. By the time the book came out in 2021, everyone was rebranding their products as observability, Gartner had waded into the fray.. it was a mess.

Perhaps the mature thing to do would be to have gone back and rewritten the book in light of the evolving market definition. But while I won’t speak for my co-authors, after 3.5 years, I was pretty fucking desperate to be done.

Artist’s rendering of the traditional authorial glow of pride, joy and deep satisfaction upon completing any book manuscript

I swore I would never go through that again. And when O’Reilly first approached us about writing a second edition, my first reaction was blind panic.

The second edition has a clearer mission

But once my lizard brain calmed down, I realized two things. Number one, it absolutely needed to be written; number two, I definitely wanted to help write it.

SO MUCH has changed. SO MUCH needs saying. When we met up in June to pull together a new outline, it seemed to just flow out of us.

A few of the many things that were not at all clear in 2018, but are crystal clear today:

  • Who we are writing for (software engineers)
  • What they are doing (instrumenting their code and analyzing it in production, with and without AI)
  • What observability means to analysts and the market at large (literally anything to do with software telemetry)
  • The integrations game is over, and OpenTelemetry has won
  • Most companies still don’t have real observability. And they don’t know it. 😕

I am excited and incredibly grateful for the opportunity to take a second whack at this book in the era of AI. Not how I thought I’d feel, but I will take it.

The first edition of “Observability Engineering” was translated into Japanese, Korean, and Chinese (I believe it’s Mandarin?).

We brought Austin Parker on as a fourth co-author very early, with a special emphasis on topics related to OpenTelemetry and AI.

We also invited a number of people we admire to contribute in a variety of formats… guest chapters, use cases, stories, embedded advice, and more:

  • Jeremy Morrell on how to instrument your code effectively
  • Hanson Ho and Matt Klein on observability for mobile and frontend
  • Kesha Mykhailov and Darragh Curran from Intercom on fast feedback loops and developing with LLMs
  • Dale McDiarmid on how to use Clickhouse for observability workloads
  • Rick Clark on the mechanics of driving organizational transformation in order to build and learn at the speed of AI
  • Frank Chen, a returning champion from our first edition, wrote about ontologies for your instrumentation chain
  • Phillip Carter wrote about eval pipelines and instrumenting LLMs
  • Mat Vine has a case study about moving ANZ from thresholds to SLIs/SLOs
  • Mike Kelly on managing telemetry pipelines for fun and profit
  • Hugo Santos on how to instrument your CI/CD pipelines
  • Peter Corless made our chapter on “Build vs Buy (vs Open Source)” immensely better and more well-rounded

What a fucking list, huh? 🙌

Truly, this book is a veritable rogues gallery of engineers and companies we look up to (including some of our own direct competitors 😉). The one thing all these people have in common (besides being great writers with a unique perspective, and people who are willing to return our emails) is that we share a similar vision for observability and the future of software development.

Spotted this week: Nathen Harvey, walking around, giving out fucks by the handful.

In addition to the sections written for software engineers on “Instrumentation Fundamentals” and “Analysis Workflows”, both with and without AI, we have a section on “Observability Use Cases” and another on “Technical Deep Dives”, which lets us cover even more ground.

Which brings us to the last section, the one that I personally signed up to write.

Part 6: “Observability Governance”

When we met in June, I successfully pitched Liz and George on adding one final section: “Observability Governance”. Unlike the rest of the book, these chapters would be written for the observability team, or the platform engineering team, or whoever is wrestling with problems like cost containment and tool migrations.

I sketched out a few ideas and started writing. July passed, August, September…I was cranking out one governance chapter per month, right on track, planning to wrap up well before November.

In September, halfway through my last chapter, I reached out to the internet for advice. “Are you an experienced software buyer? I could use some help.

The response was ✨tremendous✨; my inbox swelled with interesting stories, bitter rants, lessons learned, and practical tips from engineers and executives alike.

But when I tried to finish the chapter, my engine stalled out. I could not write. I kept doggedly blocking off time on my calendar, silencing interruptions, staring at drafts, writing and rewriting, trying every angle. Four weeks passed with no progress made.

Five weeks. Six.

Cliffhanger!

Tomorrow I’ll publish the second half of this story, in which the due date for my chapters comes and goes, and I end up throwing away everything I had written and starting over from scratch. Good times!

 

[1] If we ever write a third edition, I swear on the lives of my theoretical children that it will be MUCH shorter than this one.

Martin Fowler told me the second edition should be shorter (it’s twice as long) (xpost)