In logs as in life, the relationships are the most important part. AI doesn’t fix this. It makes it worse.

After twenty years of devops, most software engineers still treat observability like a fire alarm — something you check when things are already on fire.

Not a feedback loop you use to validate every change after shipping. Not the essential, irreplaceable source of truth on product quality and user experience.

This is not primarily a culture problem, or even a tooling problem. It’s a data problem. The dominant model for telemetry collection stores each type of signal in a different “pillar”, which rips the fabric of relationships apart — irreparably.

Your observability data is self-destructing at write time

The three pillars model works fine for infrastructure1, but it is catastrophic for software engineering use cases, and will not serve for agentic validation.

But why? It’s a flywheel of compounding factors, not just one thing, but the biggest one by far is this:

✨Data is made powerful by context✨

The more context you collect, the more powerful it becomes

Your data does not become linearly more powerful as you widen the dataset, it becomes exponentially more powerful. Or if you really want to get technical, it becomes combinatorially more powerful as you add more context.

I made a little Netlify app here where you can enter how many attributes you store per log or trace, to see how powerful your dataset is.

4 fields? 6 pairwise combos, 15 possible combinations.
8 fields? 28 pairwise combos, 255 possible combinations.
50 fields? 1.2K pairwise combos, 1.1 quadrillion (2^250) possible combinations, as seen in the screenshot below.

When you add another attribute to your structured log events, it doesn’t just give you “one more thing to query”. It gives you new combinations with every other field that already exists.

The wider your data is, the more valuable the data becomes. Click on the image to go futz around with the sliders yourself.

Note that this math is exclusively concerned with attribute keys. Once you account for values, the precision of your tooling goes higher still, especially if you handle high cardinality data.

Data is made valuable by relationships

“Data is made valuable by context” is another way of saying that the relationships between attributes are the most important part of any data set.

This should be intuitively obvious to anyone who uses data. How valuable is the string “Mike Smith”, or “21 years old”? Stripped of context, they hold no value.

By spinning your telemetry out into siloes based on signal type, the three pillars model ends up destroying the most valuable part of your data: its relational seams.

AI-SRE agents don’t seem to like three pillars data

I posted something on LinkedIn yesterday, and got a pile of interesting comments. One came from Kyle Forster, founder of an AI-SRE startup called RunWhen, who linked to an article he wrote called “Do Humans Still Read Logs?”

Humpty Dumpty traced every span, Humpty Dumpty had a great plan.

In his article, he noted that <30% of their AI SRE tools were to “traditional observability data”, i.e. metrics, logs and traces. Instead, they used the instrumentation generated by other AI tools to wrap calls and queries. His takeaway:

Good AI reasoning turns out to require far less observability data than most of us thought when it has other options.

My takeaway is slightly different. After all, the agent still needed instrumentation and telemetry in order to evaluate what was happening. That’s still observability, right?

But as Kyle tells it, the agents went searching for a richer signal than the three pillars were giving them. They went back to the source to get the raw, pre-digested telemetry with all its connective tissue intact. That’s how important it was to them.

Huh.

You can’t put Humpty back together again

I’ve been hearing a lot of “AI solves this”, and “now that we have MCPs, AI can do joins seamlessly across the three pillars”, and “this is a solved problem”.

Mmm. Joins across data siloes can be better than nothing, yes. But they don’t restore the relational seams. They don’t get you back to the mathy good place where every additional attribute makes every other attribute exponentially more valuable. At agentic speed, that reconstruction becomes a bottleneck and a failure surface.

Humpty Dumpty stored all the state, Humpty Dumpty forgot to replicate.

Our entire industry is trying to collectively work out the future of agentic development right now. The hardest and most interesting problems (I think) are around validation. How do we validate a change rate that is 10x, 100x, 1000x greater than before?

I don’t have all the answers, but I do know this: agents are going to need production observability with speed, flexibility, TONS of context, and some kind of ontological grounding via semantic conventions.

In short: agents are going to need precision tools. And context (and cardinality) are what feed precision.

Production is a very noisy place

Production is a noisy, rowdy place of chaos, particularly at scale. If you are trying to do anomaly detection with no a priori knowledge of what to look for, the anomaly has to be fairly large to be detected. (Or else you’re detecting hundreds of “anomalies” all the time.)

But if you do have some knowledge of intent, along with precision tooling, these anomalies can be tracked and validated even when they are exquisitely minute. Like even just a trickle of requests2 out of tens of millions per second.

Let’s say you work for a global credit card provider. You’re rolling out a code change to partner payments, which are “only” tens of thousands of requests per second — a fraction of your total request volume of tens of millions of req/sec, but an important one.

This is a scary change, no matter how many tests you ran in staging. To test this safely in production, you decide to start by rolling the new build out to a small group of employee test users, and oh, what the hell — you make another feature flag that lets any user opt in, and flip it on for your own account.

You wait a few days. You use your card a few times. It works (thank god).

On Monday morning you pull up your observability data and select all requests containing the new build_id or commit hash, as well as all of the feature flags involved. You break down by endpoint, then start looking at latency, errors, and distribution of request codes for these requests, comparing them to the baseline.

Hm — something doesn’t seem quite right. Your test requests aren’t timing out, but they are taking longer to complete than the baseline set. Not for all requests, but for some.

Further exploration lets you isolate the affected requests to a set with a particular query hash. Oops.. how’d that n+1 query slip in undetected??

You quickly submit a fix, ship a new build_id, and roll your change out to a larger group: this time, it’s going out to 1% of all users in a particular region.

The anomalous requests may have been only a few dozen per day, spread across many hours, in a system that served literally billions of requests in that time.

Humpty Dumpty: assembled, redeployed, A patchwork of features half-built, half-destroyed. “It’s not what we planned,” said the architect, grim. “But the monster is live — and the monster is him.”

Precision tooling makes them findable. Imprecise tooling makes them unfindable.

How do you expect your agents to validate each change, if the consequences of each change cannot be found?[3]

Well, one might ask, how have we managed so far? The answer is: by using human intuition to bridge the gaps. This will not work for agents. Our wisdom must be encoded into the system, or it does not exist.

Agents need speed, flexibility, context, and precision to validate in prod

In the past, excruciatingly precise staged rollouts like these have been mostly the province of your Googles and Facebooks. Progressive deployments have historically required a lot of tooling and engineering resources.

Agentic workflows are going to make these automated validation techniques much easier and more widely used; at the exact same time, agents developing to spec are going to require a dramatically higher degree of precision and automated validation in production.

It is not just the width of your data that matters when it comes to getting great results from AI. There’s a lot more involved in optimizing data for reasoning, attribution, or anomaly detection. But capturing and preserving relationships is at the heart of all of it.

In this situation, as in so many others, AI is both the sickness and the cure[4]. Better get used to it.

1 — Infrastructure teams use the three pillars for one extremely good reason: they have to operate a lot of code they did not write and can not change. They have to slurp up whatever metrics or logs the components emit and store them somewhere.

2 — Yes, there are some complications here that I am glossing past, ones that start with ‘s’ and rhyme with “ampling”. However, the rich data + sampling approach to the cost-usability balance is generally satisfied by dropping the least valuable data. The three pillars approach to the cost-usability problem is generally satisfied by dropping the MOST valuable data: cardinality and context.

3 — The needle-in-a-haystack is one visceral illustration of the value of rich context and precision tooling, but there are many others. Another example: wouldn’t it be nice if your agentic task force could check up on any diffs that involve cache key or schema changes, say, once a day for the next 6-12 months? These changes famously take a long time to manifest, by which time everyone has forgotten that they happened.

4 — One sentence I have gotten a ton of mileage out of lately: “AI, much like alcohol, is both the cause of and solution to all of life’s problems.”

My (hypothetical) SRECon26 keynote (xpost)

March 3, 2026 mipsytipsyai, sre, srecon2 Comments

My (hypothetical) SRECon26 keynote

One year ago, Fred Hebert and I delivered the closing keynote at SRECon25. Looking back on it now, I can hardly connect with how I felt then. Here’s what I’d say one year later.

Crossposted from here.

Hey, it’s almost time for SRECon 2026! (I can’t go, but YOU really should!)

Which means it was almost a year ago that Fred Hebert and I were up on stage, delivering the closing keynote1 at SRECon25.

We argued that SREs should get involved and skill up on generative AI tools and techniques, instead of being naysayers and peanut gallerians. You can get a feel for the overall vibe from the description:

It’s easy to be cynical when there’s this much hype and easy money flying around, but generative AI is not a fad; it’s here to stay.

Which means that even operators and cynics — no, especially operators and cynics — need to get off the sidelines and engage with it. How should responsible, forward-looking SREs evaluate the truth claims being made in the market without being reflexively antagonistic?

Yep, that was our big pitch. Don’t be reflexively antagonistic. You should learn AI so that your critiques will land with credibility.

That is not the message I would give today, if I were keynoting SRECon26.

I came out of a hole, and the world had changed

I’ve been in a bit of a hole for the past few months, trying to get the second edition of “Observability Engineering” written and shipped.

Maybe the hole is why this feels so abrupt and discontinuous to me. Or maybe it’s just having such a clear artifact of my views one year ago. I don’t know.

What I do know is that one year ago, I still thought of generative AI as one more really big integration or use case we had to support, whether we liked it or not. Like AI was a slop-happy toddler gone mad in our codebase, and our sworn duty as SREs was to corral and control it, while trying not be a total dick about it.

Today, it’s very clear to me that the center of gravity has shifted from cloud/automation workflows to AI/generation workflows, and that the agentic revolution has only just begun. That toddler is heading off to school. With a loaded gun.

When the facts change, I change my mind

I don’t know when exactly that bit flipped in my head, I only know that it did. And as soon as it did, I felt like the last person on earth to catch on. I can barely connect with my own views from eleven months ago.

Were my views unreasonably pessimistic? Was I willfully ignoring credible evidence in early 2025?

Hmm, perhaps. But Silicon Valley hype trains have not exactly covered themselves in glory in recent years. VR/AR, crypto/web3/NFTs, wearable tech, the Metaverse, 3D printing, the sharing economy…this is not an illustrious string of wins.2

Cloud computing, on the other hand: genuinely huge. So was the Internet. Sometimes the hype train brings you internets, sometimes the hype train brings you tulips.

So no, I don’t think it was obvious in early 2025 that AI generated code would soon grow out of its slop phase. Skepticism was reasonable for a time, and then it was not. I know a lot of technologists who flipped the same bit at some point in 2025.

The keynote I would give today

If I was giving the keynote at SRECon 2026, I would ditch the begrudging stance. I would start by acknowledging that AI is radically changing the way we build software. It’s here, it’s happening, and it is coming for us all.

1 — This is happening

It is very, very hard to adjust to change that is being forced on you. So please don’t wait for it to be forced on you. Swim out to meet it. Find your way in, find something to get excited about.

As Adam Jacob recently advised,

“If you’re an engineer or an operations person, there is only one move. You have to start working in this new way as much as you can. If you can’t do it at work, do it at home. You want to be on the frontier of this change, because the career risk to being a laggard is incredibly high.” — Adam Jacob

This AI shit is not hard. The early days of any technology are the simplest, and this technology more than most. Conquer the brain weasels in your head by learning the truth of this for yourself.

2 — Know thyself

At a time of elevated uncertainty and anxiety, our natural human tendency to drift into confirmation bias and disconfirmation bias is higher than ever. Whatever proof you instinctively seek out, you are guaranteed to find.

The best advice I can give anyone is: know your nature, and lean against it.

If you are a reflexive naysayer or a pessimist, know that, and force yourself to find a way in to wonder, surprise and delight.
If you are an optimist who gets very excited and tends to assume that everything will improve: know that, and force yourself to mind real cautionary tales.

Try to keep your aperture wide, and remain open to possibilities you find uncomfortable. Curate the ocean you swim in. Puncture your bubble.

3 — Don’t panic

Don’t panic, and don’t give in to despair. The future isn’t written yet, and nobody knows what’s going to happen. I sure as hell don’t. Neither do you.

The fact that AI has radically changed the way we develop software in very a short time, and seems poised to change it much more in the next year or two, is real and undeniable.

This does not mean that everything else predicted by AI optimists will come to pass.

Extraordinary claims still require extraordinary evidence. AGI is, at present, an elaborate thought experiment, one that contradicts all the evidence we currently have about how technological breakthroughs typically yield enormous change in the early days, and then plateau.

We are all technologists now

Here’s another Adam quote I really like:

The bright side is that it’s a technology shift, not a manufacturing shift – meaning you still have to have technologists to do it.

I’ve written a number of blog posts over the years where I have advised people to go into the second half of their career thinking of themselves not as “engineers” or as “managers”, but as “technologists”. 3

Every great technologist needs an arsenal of skills on top of their technical expertise. They need to understand how to navigate an organization, how to translate between the language of technology and the language of the business; how to wield influence and drive results across team, company, even industry lines.

These remain durable skills, in an era where good code can be generated practically for free.

This is the moment for pragmatists

Many people who love the art and craft of software are struggling in this moment, as the value of that craft is diminishing.

People who take a much more…functional…approach to software seem to be thriving in the present chaos. “Functional” describes most of the SREs I know, including myself.

After all, SREs have always been judged by outcomes — uptime, reliability, whether the thing kept running. An outcome orientation turns out to be excellent preparation for a world where the “how” of software is becoming less important than the what and the whether, across the board.

So maybe the advice we gave at SRECon wasn’t so bad after all. Especially this part:

Which means that even operators and cynics — no, especially operators and cynics — need to get off the sidelines and engage with it.

Who can build better guardrails for AI, than SREs and operators who have spent their entire careers building guardrails for software engineers and customers?

The industry needs us. But not begrudgingly, eyerollingly, pretending to get on board in order to slow things down from the inside. The industry needs our skills to help engineering teams go fast forever.

Don’t sit back and wait for change to reach you. Run towards the waves. It’s nice out here.

1 — Our talk was called “AIOps: Prove it! An Open Letter to Vendors Selling AI for SREs”. In retrospect, this was a terrible title. It was not an open letter to vendors at all; if anything, it was an open letter to SREs. It started out as one topic, but by the time the event rolled around, it had morphed into something entirely different. Ah well.

2 — I am not even listing the kooky religious shit like effective accelerationism, transhumanism, AI “alignment” or the Singularity, all of which has seeped into the water table around these parts.

3 — Omg, I have so many unwritten posts wriggling around in my brain right now on this topic.

First I wrote the wrong book, then I wrote the right book (xpost)

February 19, 2026 mipsytipsybook, observability engineering, writingLeave a comment

I’m not sure whether to say “thank you” or “HOW COULD YOU DO THIS TO ME”, but this one goes out to all the people who sent me advice on buying software last fall.

This is the second in a two-part episode. The first part ended on a ✨cliffhanger!!!✨ — so if you missed the first episode, catch up here:

Martin Fowler told me the second edition should be shorter (it’s twice as long)

Charity Majors

Six long weeks of writer’s block

I was merrily cranking away what I believed to be my last chapter when I asked the internet — YOU guys — for help the first time. “Are you an experienced software buyer? I could use some help,” went up on September 19th, 2025.

The response was overwhelming. I heard from software engineers, SREs, observability leads, CTOs, VPs, distinguished engineers, consultants, even the odd CISO. All these emails and responses and lengthy threads kept me busy for a while, but eventually I had to get back to writing. That’s when I discovered, to my unpleasant surprise, that I couldn’t seem to write anymore.

“Well,” I reasoned, “maybe I’ll just ask the internet for EVEN MORE advice” — and out popped Buffy-themed post number two, on October 13th.

Keep in mind, I thought I would be done by then. November was my stretch deadline, my just in case, I better leave myself some breathing room kind of deadline.

As November 1st came and went, my frustration began spiraling out into blind panic. What the hell is going on and why can I not finish this???

In which I finally listen to the advice I asked for

A week before Thanksgiving, I was up late tinkering with Claude. I imported all the emails and advice I had gotten from y’all, and started sorting into themes and picking out key quotes, and that is when it finally hit me: I had written the wrong thing.

No, this deserves a bigger font.

✨I wrote the wrong thing.✨

I wrote the wrong thing, for the wrong people, and none of it was going to move the needle in any meaningful way.

The chapters I had written were full of practical advice for observability engineering teams and platform engineering teams, wrestling with implementation challenges like instrumentation and cost overflows. Practical stuff.

The internet was right (this ONE time)

My inbox, on the other hand, was overflowing with stories like these:

“Many times [competitive research] is faked. One person has their favorite option and then they do just enough ‘competitive analysis’ to convince the sourcing folks that due diligence was done or to nullify the CIO/CTO/whoever is accepting this on to their budget”
“We [the observability team] spent six months exhaustively trialing three different solutions before we made a decision. The CEO of one of the losing vendors called our CEO, and he overruled our decision without even telling us.” (Does your CEO know anything at all about engineering??) “No.”
“Our SRE teams have vetoed any attempt to modernize our tool stack. ($Vendor) is part of their identity, and since they would have to help roll out and support any changes, we are stuck living in 2015 apparently forever.” (What does management have to say?) “It’s been twenty years since they touched a line of code.”
“We’re weird in that most of the company hates technology and really hates that we have to pay for it since they don’t understand the value it brings to the company. This is intentional ignorance, we make the value props continually and well, we just haven’t succeeded yet….We’re a little obsessed with trying to get champagne quality at Boone’s prices.”
“When it comes to dealing with salespeople and the enterprise sales process, the best tip for engineers is to not anthropomorphize sales professionals who are driven by commission. The best ones are like robot lawn mowers dressed in furry unicorn costumes. They may seem cute and nice but they do not care about anything besides closing the next deal….All of the best SaaS companies are full of these friendly fake unicorn zombies who suck cash instead of blood.”

Nearly all of the emails I got were either describing a terminally fucked up buying process from the top down, or the long term consequences of those fucked up decisions.

In other words: I was writing tactical advice for teams who were surviving in a strategic vacuum.

So I threw the whole thing out, and started over from scratch. 😭

Even good teams are struggling right now

As Tolstoy once wrote, “Happy teams are all alike; every fucked up team is fucked up in its own precious way.”

There is an infinity of ways to screw something up. But there is one pattern I see a critical mass of engineering orgs falling into right now, even orgs that are generally quite solid. That is when there is no shared alignment or even shared vocabulary between engineering and other stakeholders directors, VPs and SVPs, CTO, CIO, principal and distinguished engineers — on some pretty clutch questions. Such as:

“What is observability?”
“Who needs it?”
“What problem are we trying to solve?”

And my favorite: “Is observability still relevant in a post-AI era? Can’t agents do that stuff now?”

Even some generally excellent CTOs[1] have been heard saying things like, “yeah, observability is definitely very important, but all our top priorities are related to AI right now.”

Which gets causality exactly backwards. Because your ability to get any returns on your investments into AI will be limited by how swiftly you can validate your changes and learn from them. Another word for this is “OBSERVABILITY”.

Enough ranting. Want a peek? I’ll share the new table of contents, and a sentence or two about a couple of my own favorite chapters.

Part 6: “Observability Governance” (v2)

The new outline is organized to speak to technical decision-makers, starting at the top and loosely descending. What do CTOs need to know? What do VPs and distinguished engineers need to know? and so on. We start off abstract, and become more concrete.

Since every technical term (e.g. high cardinality, high dimensionality, etc) has become overloaded and undifferentiated by too much sales and marketing, we mostly avoid it. Instead, we use the language of systems and feedback loops.

Again, we are trying to help your most senior engineers and execs develop a shared understanding of “What problem are we solving?” and “What is our goal? Technical terms can actually detract and distract from that shared understanding.

An Open Letter to CTOs: Why Organizational Learning Speed is Now Your Biggest Constraint. Organizations used to be limited by the speed of delivery; now they are limited by how swiftly they can validate and understand what they delivered.
Systems Thinking for Software Delivery. Observability is the signal that connects the dots to make a feedback loop; no observability, no loop. What happens to amplifying or balancing loops when that signal is lossy, laggy, or missing?
The Observability Landscape Through a Systems Lens. What feedback loops do developers need, and what feedback loops does ops need? How do these map to the tools on the market?
The Business Case for Observability. Is your observability a cost center or an investment? How should you quantify your RoI?
Diagnosing Your Observability Investment
The Organizational Shift
Build vs Buy (vs Open Source)
The Art and Science of Vendor Partnerships. Internal transformations run on trust and credibility; vendor partnerships run on trust and reciprocity. We’ll talk about both of these, as well as how to run a strong POC.
Instrumentation for Observability Teams
Where to Go From Here

Hey, I have a lot of empathy right now for leaders and execs who feel like they’re behind on everything. I feel it too. Anyone who doesn’t is lying to themselves (or their name is Simon Willison).

But the role observability plays in complex sociotechnical systems is one of those foundational concepts you need to understand. You’re not gonna get this right by accident. You’re not going to win by doing the same thing you were doing five years ago. And if you screw up your observability, you screw up everything downstream of it too.

To those of you who do understand this, and are working hard to drive change in your organizations: I see you. It is hard, often thankless work, but it is work worth doing. If I can ever be of help: reach out.

A longer book, but a better book

The last few chapters are heading into tech review on Friday, February 20th. Finally. The last 3.5 months have been some of the most panicky and stressful of my life. I….just typed several paragraphs about how terrible this has been, and deleted them, because you do not need to listen to me whine. ☺️

Like I said, I have never felt especially proud of the first edition. I am not UN proud, it’s just…eh. I feel differently this time around. I think—I hope—it can be helpful to a lot of different people who are wrestling with adapting to our new AI-native reality, from a lot of different angles.[2]

Thanks, Christine. (Another for the folder marked ”NOW YOU TELL ME”)

I am incredibly grateful to my co-authors, collaborators, and our editor, Rita Fernando, without whom I never would have made it through.

But there’s one more group that deserves some credit, and it’s…you guys. I asked for help, and help I got. So many people wrote me such long, thought-provoking emails full of stories, advice and hard-earned wisdom. The better the email, the more I peppered you with followup questions, which is a great way to punish a good deed.

Blame these people

I am a tiny bit torn on whether to say “thank you” or “fuck you”, because my life would have been much nicer if I had stuck to the plan and wrapped in October.

But the following list of people were especially instrumental in forcing me to rethink my approach. It made the book much stronger, so if you catch one of them in the wild, please buy them a stiff drink. (Or buy yourself one, and throw it in their face with my sincere compliments.)

Abraham Ingersoll, the aforementioned “odd CISO”, who would be quoted in the book had his advice not been so consistently unprintable by the standards of respectable publications
Benjamin Mann of Delivery Hero, who I would work for in a heartbeat, and not just for the way he wields “NOPE” as a term of art
Marty Lindsay, who has spent more time explaining POCs and tech evals to me than anyone should have to. (If you need an o11y consultant, Marty should be your very first stop).
Sam Dwyer, whose stories seeded my original plan to write a set of chapters for observability engineering teams. (I hope the replacement plan is useful too!)

Many others sent me terrific advice, and endured multiple rounds of questions and more questions and clarifications on said questions. A few of them:

Matthew Sanabria, Chris Cooney, Glen Mailer, Austin Culbertson, John Scancella, John Doran, Bryan Finster, Hazel Weakly, Chris Ziehr, Thomas Owens, Mike Lee, Jay Gengelbach, Will Hegedus, Natasha Litt, Alonso Suarez, Jason McMunn, Evgeny Rubtsov, George Chamales, Ken Finnegan, Cliff Snyder, Robyn Hirano, Rita Canavarro, Matt Schouten, Shalini Samudri Ananda Rao (Sam).

I am definitely forgetting some names; I will try to update the list as I remember them.

But seriously: thank you, from the bottom of my heart. I loved hearing your stories, your complaints, your arguments about how the world should improve. Your DNA is in this book; I hope it does you justice.

~charity
💜💙💚💛🧡❤️💖

[1] It’s ironic (and makes me uncomfortably self-conscious), but some of the worst top-down decision-making processes I have ever seen have come from companies where CEO and CTO are both former engineers. The confidence they have in their own technical acumen may be not wholly unfounded, but it is often ten or more years out of date. We gotta update those priors, my friends. Stay humble.

[2] On the other hand, as my co-founder, Christine Yen, informed me last week: “Nobody reads books anymore.”

On Friday Deploys: Sometimes that Puppy Needs Murdering (xpost)

December 24, 2025December 24, 2025 mipsytipsydeploys, devops, friday deploys, sre1 Comment

(Cross posted from its original source)

‘Tis the season that one of my favorite blog posts gets pulled out and put in rotation, much like “White Christmas” on the radio station. I’m speaking, of course, of “Friday Deploy Freezes are Exactly Like Murdering Puppies” (old link on WP).

This feels like as good a time as any to note that I am not as much of an extremist as people seem to think I am when it comes to Friday deploys, or deploy freezes in general.

(Sometimes I wonder why people think I’m such an extremist, and then I remember that I did write a post about murdering puppies. Ok, ok. Point taken.)

Take this recent thread from LinkedIn, where Michael Davis posted an endorsement of my Puppies article along with his own thoughts on holiday code freezes, followed by a number of smart, thoughtful comments on why this isn’t actually attainable for everyone. Payam Azadi talks about an “icing” and “defrosting” period where you ease into and out of deploy freezes (never heard of this, I like it!), and a few other highly knowledgeable folks chime in with their own war stories and cautionary tales.

It’s a great thread, with lots of great points. I recommend reading it. I agree with all of them!!

For the record, I do not believe that everyone should get rid of deploy freezes, on Fridays or otherwise.

If you do not have the ability to move swiftly with confidence, which in practice means “you can generally find problems in your new code before your customers do”, which generally comes down to the quality and usability of your observability tooling, and your ability to explore high cardinality dimensions in real time (which most teams do not have), then deploy freezes before a holiday or a big event, or hell, even weekends, are probably the sensible thing to do.

If you can’t do the “right” thing, you find a workaround. This is what we do, as engineers and operators.

Deploy freezes are a hack, not a virtue

Look, you know your systems better than I do. If you say you need to freeze deploys, I believe you.

Honestly, I feel like I’ve always been fairly pragmatic about this. The one thing that does get my knickers in a twist is when people adopt a holier-than-thou posture towards their Friday deploy freezes. Like they’re doing it because they Care About People and it’s the Right Thing To Do and some sort of grand moral gesture. Dude, it’s a fucking hack. Just admit it.

It’s the best you can do with the hand you’ve been dealt, and there’s no shame in that! That is ALL I’m saying. Don’t pat yourself on the back, act a little sheepish, and I am so with you.

I think we can have nice things

I think there’s a lot of wisdom in saying “hey, it’s the holidays, this is not the time to be rushing new shit out the door absent some specific forcing function, alright?”

My favorite time of year to be at work (back when I worked in an office) was always the holidays. It was so quiet and peaceful, the place was empty, my calendar was clear, and I could switch gears and work on completely different things, out of the critical line of fire. I feel like I often peaked creatively during those last few weeks of the year.

I believe we can have the best of both worlds: a yearly period of peace and stability, with relatively low change rate, and we can evade the high stakes peril of locks and freezes and terrifying January recoveries.

How? Two things.

Don’t freeze deploys. Freeze merges.

To a developer, ideally, the act of merging their changes back to main and those changes being deployed to production should feel like one singular atomic action, the faster the better, the less variance the better. You merge, it goes right out. You don’t want it to go out, you better not merge.

The worst of both worlds is when you let devs keep merging diffs, checking items off their todo lists, closing out tasks, for days or weeks. All these changes build up like a snowdrift over a pile of grenades. You aren’t going to find the grenades til you plow into the snowdrift on January 5th, and then you’ll find them with your face. Congrats!

If you want to freeze deploys, freeze merges. Let people work on other things. I assure you, there is plenty of other valuable work to be done.

Don’t freeze deploys unless your goal is to test deploy freezes

The second thing is a corollary. Don’t actually freeze deploys, unless your SREs and on call folks are bored and sitting around together, going “wouldn’t this be a great opportunity to test for memory leaks and other systemic issues that we don’t know about due to the frequency and regularity of our deploys?”

If that’s you, godspeed! Park that deploy engine and sit on the hood, let’s see what happens!

People always remember the outages and instability that we trigger with our actions. We tend to forget about the outages and instability we trigger with our inaction. But if you’re used to deploying every day, or many times a day: first, good for you. Second, I bet you a bottle of whiskey that something’s gonna break if you go for two weeks without deploying.

I bet you the good shit. Top shelf. 🥃

This one is so easy to mitigate, too. Just run the deploy process every day or two, but don’t ship new code out.

Alright. Time for me to go fly to my sister’s house. Happy holidays everyone! May your pagers be silent and your bellies be full, and may no one in your family or friend group mention politics this year!

💜💙💚💛🧡❤️💖
charity

P.S. The title is hyperbole! I was frustrated! I felt like people were intentionally misrepresenting my point and my beliefs, so I leaned into it. Please remember that I grew up on a farm and we ended up eating most of our animals. Possibly I am still adjusting to civilized life. Also, I have two cats and I love them very much and have not eaten either of them yet.

A few other things I’ve written on the topic:

2025 was for AI what 2010 was for cloud (xpost)

December 22, 2025 mipsytipsy3 Comments

The satellite, experimental technology has become the mainstream, foundational tech. (At least in developer tools.) (xposted from new home)

I was at my very first job, Linden Lab, when EC2 and S3 came out in 2006. We were running Second Life out of three datacenters, where we racked and stacked all the servers ourselves. At the time, we were tangling with a slightly embarrassing data problem in that there was no real way for users to delete objects (the Trash folder was just another folder), and by the time we implemented a delete function, our ability to run garbage collection couldn’t keep up with the rate of asset creation. In desperation, we spun up an experimental project to try using S3 as our asset store. Maybe we could make this Amazon’s problem and buy ourselves some time?

Why yes, we could. Other “experimental” projects sprouted up like weeds: rebuilding server images in the cloud, running tests, storing backups, load testing, dev workstations. Everybody had shit they wanted to do that exceeded our supply of datacenter resources.

By 2010, the center of gravity had shifted. Instead of “mainstream engineering” (datacenters) and “experimental” (cloud), there was “mainstream engineering” (cloud) and “legacy, shut it all down” (datacenters).

Why am I talking about the good old days? Because I have a gray beard and I like to stroke it, child. (Rude.)

And also: it was just eight months ago that Fred Hebert and I were delivering the closing keynote at SRECon. The title is “AIOps: Prove It! An Open Letter to Vendors Selling AI for SREs”, which makes it sound like we’re talking to vendors, but we’re not; we’re talking to our fellow SREs, begging them to engage with AI on the grounds that it’s not ALL hype.

We’re saying to a room of professional technological pessimists that AI needs them to engage. That their realism and attention to risk is more important than ever, but in order for their critique to be relevant and accurate and be heard, it has to be grounded in expertise and knowledge. Nobody cares about the person outside taking potshots.

This talk recently came up in conversation, and it made me realize—with a bit of a shock—how far my position has come since then.

That was just eight months ago, and AI still felt like it was somehow separable, or a satellite of tech mainstream. People would gripe about conferences stacking the lineup with AI sessions, and AI getting shoehorned into every keynote.

I get it. I too love to complain about technology, and this is certainly an industry that has seen its share of hype trains: dotcom, cloud, crypto, blockchain, IoT, web3, metaverse, and on and on. I understand why people are cynical—why some are even actively looking for reasons to believe it’s a mirage.

But for me, this year was for AI what 2010 was for the cloud: the year when AI stopped being satellite, experimental tech and started being the mainstream, foundational technology. At least in the world of developer tools.

It doesn’t mean there isn’t a bubble. Of COURSE there’s a fucking bubble. Cloud was a bubble. The internet was a bubble. Every massive new driver of innovation has come with its own frothy hype wave.

But the existence of froth doesn’t disprove the existence of value.

Maybe y’all have already gotten there, and I’m the laggard. 😉 (Hey, it’s an SRE’s job to mind the rear guard.) But I’m here now, and I’m excited. It’s an exciting time to be a builder.

Hello World (xpost from substack)

December 19, 2025December 19, 2025 mipsytipsyethics, substack3 Comments

I recently posted a short note about moving from WordPress to Substack after ten years on WP. A number of people replied, commented, or DM’d me to express their dismay, along the lines of “why are you supporting Nazis?”. A few begged me to reconsider.

So I did. I paused the work I was doing on migration and setup, and I paused the post I was drafting on Substack. I read the LeaveSubstack site, and talked with its author (thank you, Sean 💜). I had a number of conversations with people I consider experts in content creation, and people I consider stakeholders (my coworkers and customers), as well as my own personal Jiminy Cricket, Liz Fong-Jones. I also slept on it.

I’ve decided to stay.

I said I would share my thinking once I made a decision, and it comes down to this: I have a job to do, and I haven’t been doing it.

I have not been doing my job 💔

I’ve gone increasingly dark on social media over the past few years, and while this has been delightful from a personal perspective, I have developed an uncomfortable conviction that I have also been abdicating a core function of my job in doing so.

The world of software is changing—fast. It’s exciting. But it is not enough to have interesting ideas and say them once, or write them down in a single book. You need to be out there mixing it up with the community every day, or at least every week. You need to be experimenting with what language works for people, what lands, what sparks a light in people’s eyes.

You (by which I mean me) also need to be listening more— reading and interacting with other people’s thoughts, volleying back and forth, polishing each other like diamonds.

How many times did we define observability or high cardinality or the sins of aggregation? Cool. How many times have we talked about the ways that AI has made the honeycomb vision technologically realizable for the first time? Uh, less, by an order of thousands.

Write more, engage with mainstream tech

My primary goal is to get back into the mainstream of technical discussion and mix it up a lot more. Unfortunately, to the extent there is a tech mainstream, it still exists on X. I am not ruling out the possibility of returning, but I would strongly prefer not to. I’m going to see if I can do my job by being much more active on LinkedIn and Substack.

My secondary goal is to remove friction and barriers to posting. WordPress just feels so heavyweight. Like I’m trying to craft a web page, not write a quick post. Substack feels more like writing an email. I’ve been trying to make myself post more all year on WP, and it hasn’t happened. I have a lot of shit backed up to talk about, and I think this will help grease the wheels.

There are platforms that are outside the pale, that exist solely to platform and support Nazis and violent extremists—your Gabs, your Parlers. Substack is very far from being one of those. All of these content platforms exist on some continuum of grey, and governance is hard, hard, hard in an era of mainstreaming right wing extremism.

Substack may not make all the decisions I would make, but I feel like it is a light dove grey, all things considered.

Some mitigations

I have received some tips and done some research on how to minimize the value of my writing to Substack. Here they are.

Substack makes money from paid subscriptions, so I don’t accept money. Ever.
I am told that if you use email or RSS, it benefits Substack less than if you use the app. RSS feed here.
I will set up an auto-poster from Substack to WordPress (at some point… probably whenever I find the time to fix the url rewriter and change domain pointer)

I hope these will allow conscientous objectors continue to read and engage with my work, but I also understand if not.

A vegan friend of mine once used an especially vivid metaphor to indignantly tell us why no, he could NOT just pick the meat and dairy off his plate and eat the vegetables and grains left behind (they were not cooked together). He said, “If somebody shit on your plate, would you just pick the shit off and keep eating?”

So. If Substack is the shit in your social media plate, and you feel morally obligated to reject anything that has ever so much as touched the domain, I can respect that.

Everyone has to decide which battles are theirs to fight. This one is not mine.

💜💙💚💛🧡❤️💖,
charity.

In Praise of “Normal” Engineers

June 19, 2025June 19, 2025 mipsytipsy10x engineers, crossposted, engineering culture, high performing teams, teams21 Comments

This article was originally commissioned by Luca Rossi (paywalled) for refactoring.fm, on February 11th, 2025. Luca edited a version of it that emphasized the importance of building “10x engineering teams” . It was later picked up by IEEE Spectrum (!!!), who scrapped most of the teams content and published a different, shorter piece on March 13th.

This is my personal edit. It is not exactly identical to either of the versions that have been publicly released to date. It contains a lot of the source material for the talk I gave last week at #LDX3 in London, “In Praise of ‘Normal’ Engineers” (slides), and a couple weeks ago at CraftConf.

In Praise of “Normal” Engineers

Most of us have encountered a few engineers who seem practically magician-like, a class apart from the rest of us in their ability to reason about complex mental models, leap to non-obvious yet elegant solutions, or emit waves of high quality code at unreal velocity.

I have run into any number of these incredible beings over the course of my career. I think this is what explains the curious durability of the “10x engineer” meme. It may be based on flimsy, shoddy research, and the claims people have made to defend it have often been risible (e.g. “10x engineers have dark backgrounds, are rarely seen doing UI work, are poor mentors and interviewers”), or blatantly double down on stereotypes (“we look for young dudes in hoodies that remind us of Mark Zuckerberg”). But damn if it doesn’t resonate with experience. It just feels true.

The problem is not the idea that there are engineers who are 10x as productive as other engineers. I don’t have a problem with this statement; in fact, that much seems self-evidently true. The problems I do have are twofold.

Measuring productivity is fraught and imperfect

First: how are you measuring productivity? I have a problem with the implication that there is One True Metric of productivity that you can standardize and sort people by. Consider, for a moment, the sheer combinatorial magnitude of skills and experiences at play:

Are you working on microprocessors, IoT, database internals, web services, user experience, mobile apps, consulting, embedded systems, cryptography, animation, training models for gen AI… what?
Are you using golang, python, COBOL, lisp, perl, React, or brainfuck? What version, which libraries, which frameworks, what data models? What other software and build dependencies must you have mastered?
What adjacent skills, market segments, or product subject matter expertise are you drawing upon…design, security, compliance, data visualization, marketing, finance, etc?
What stage of development? What scale of usage? What matters most — giving good advice in a consultative capacity, prototyping rapidly to find product-market fit, or writing code that is maintainable and performant over many years of amortized maintenance? Or are you writing for the Mars Rover, or shrinkwrapped software you can never change?

Also: people and their skills and abilities are not static. At one point, I was a pretty good DBRE (I even co-wrote the book on it). Maybe I was even a 10x DB engineer then, but certainly not now. I haven’t debugged a query plan in years.

“10x engineer” makes it sound like 10x productivity is an immutable characteristic of a person. But someone who is a 10x engineer in a particular skill set is still going to have infinitely more areas where they are normal or average (or less). I know a lot of world class engineers, but I’ve never met anyone who is 10x better than everyone else across the board, in every situation.

Engineers don’t own software, teams own software

Second, and even more importantly: So what? It doesn’t matter. Individual engineers don’t own software, teams own software. The smallest unit of software ownership and delivery is the engineering team. It doesn’t matter how fast an individual engineer can write software, what matters is how fast the team can collectively write, test, review, ship, maintain, refactor, extend, architect, and revise the software that they own.

Everyone uses the same software delivery pipeline. If it takes the slowest engineer at your company five hours to ship a single line of code, it’s going to take the fastest engineer at your company five hours to ship a single line of code. The time spent writing code is typically dwarfed by the time spent on every other part of the software development lifecycle.

If you have services or software components that are owned by a single engineer, that person is a single point of failure.

I’m not saying this should never happen. It’s quite normal at startups to have individuals owning software, because the biggest existential risk that you face is not moving fast enough, not finding product market fit, and going out of business. But as you start to grow up as a company, as users start to demand more from you, and you start planning for the survival of the company to extend years into the future…ownership needs to get handed over to a team. Individual engineers get sick, go on vacation, and leave the company, and the business has got to be resilient to that.

If teams own software, then the key job of any engineering leader is to craft high-performing engineering teams. If you must 10x something, 10x this. Build 10x engineering teams.

The best engineering orgs are the ones where normal engineers can do great work

When people talk about world-class engineering orgs, they often have in mind teams that are top-heavy with staff and principal engineers, or recruiting heavily from the ranks of ex-FAANG employees or top universities.

But I would argue that a truly great engineering org is one where you don’t HAVE to be one of the “best” or most pedigreed engineers in the world to get shit done and have a lot of impact on the business.

I think it’s actually the other way around. A truly great engineering organization is one where perfectly normal, workaday software engineers, with decent software engineering skills and an ordinary amount of expertise, can consistently move fast, ship code, respond to users, understand the systems they’ve built, and move the business forward a little bit more, day by day, week by week.

Any asshole can build an org where the most experienced, brilliant engineers in the world can build product and make progress. That is not hard. And putting all the spotlight on individual ability has a way of letting your leaders off the hook for doing their jobs. It is a HUGE competitive advantage if you can build sociotechnical systems where less experienced engineers can convert their effort and energy into product and business momentum.

A truly great engineering org also happens to be one that mints world-class software engineers. But we’re getting ahead of ourselves, here.

Let’s talk about “normal” for a moment

A lot of technical people got really attached to our identities as smart kids. The software industry tends to reflect and reinforce this preoccupation at every turn, from Netflix’s “we look for the top 10% of global talent” to Amazon’s talk about “bar-raising” or Coinbase’s recent claim to “hire the top .1%”. (Seriously, guys? Ok, well, Honeycomb is going to hire only the top .00001%!)

In this essay, I would like to challenge us to set that baggage to the side and think about ourselves as normal people.

It can be humbling to think of ourselves as normal people, but most of us are in fact pretty normal people (albeit with many years of highly specialized practice and experience), and there is nothing wrong with that. Even those of us who are certified geniuses on certain criteria are likely quite normal in other ways — kinesthetic, emotional, spatial, musical, linguistic, etc.

Software engineering both selects for and develops certain types of intelligence, particularly around abstract reasoning, but nobody is born a great software engineer. Great engineers are made, not born. I just don’t think there’s a lot more we can get out of thinking of ourselves as a special class of people, compared to the value we can derive from thinking of ourselves collectively as relatively normal people who have practiced a fairly niche craft for a very long time.

Build sociotechnical systems with “normal people” in mind

When it comes to hiring talent and building teams, yes, absolutely, we should focus on identifying the ways people are exceptional and talented and strong. But when it comes to building sociotechnical systems for software delivery, we should focus on all the ways people are normal.

Normal people have cognitive biases — confirmation bias, recency bias, hindsight bias. We work hard, we care, and we do our best; but we also forget things, get impatient, and zone out. Our eyes are inexorably drawn to the color red (unless we are colorblind). We develop habits and ways of doing things, and resist changing them. When we see the same text block repeatedly, we stop reading it.

We are embodied beings who can get overwhelmed and fatigued. If an alert wakes us up at 3 am, we are much more likely to make mistakes while responding to that alert than if we tried to do the same thing at 3pm. Our emotional state can affect the quality of our work. Our relationships impact our ability to get shit done.

When your systems are designed to be used by normal engineers, all that excess brilliance they have can get poured into the product itself, instead of wasting it on navigating the system itself.

How do you turn normal engineers into 10x engineering teams?

None of this should be terribly surprising; it’s all well known wisdom. In order to build the kind of sociotechnical systems for software delivery that enable normal engineers to move fast, learn continuously, and deliver great results as a team, you should:

Shrink the interval between when you write the code and when the code goes live.

Make it as short as possible; the shorter the better. I’ve written and given talks about this many, many times. The shorter the interval, the lower the cognitive carrying costs. The faster you can iterate, the better. The more of your brain can go into the product instead of the process of building it.

One of the most powerful things you can do is have a short, fast enough deploy cycle that you can ship one commit per deploy. I’ve referred to this as the “software engineering death spiral” … when the deploy cycle takes so long that you end up batching together a bunch of engineers’ diffs in every build. The slower it gets, the more you batch up, and the harder it becomes to figure out what happened or roll back. The longer it takes, the more people you need, the higher the coordination costs, and the more slowly everyone moves.

Deploy time is the feedback loop at the heart of the development process. It is almost impossible to overstate the centrality of keeping this short and tight.

Make it easy and fast to roll back or recover from mistakes.

Developers should be able to deploy their own code, figure out if it’s working as intended or not, and if not, roll forward or back swiftly and easily. No muss, no fuss, no thinking involved.

Make it easy to do the right thing and hard to do the wrong thing.

Wrap designers and design thinking into all the touch points your engineers have with production systems. Use your platform engineering team to think about how to empower people to swiftly make changes and self-serve, but also remember that a lot of times people will be engaging with production late at night or when they’re very stressed, tired, and possibly freaking out. Build guard rails. The fastest way to ship a single line of code should also be the easiest way to ship a single line of code.

Invest in instrumentation and observability.

You’ll never know — not really — what the code you wrote does just by reading it. The only way to be sure is by instrumenting your code and watching real users run it in production. Good, friendly sociotechnical systems invest heavily in tools for sense-making.

Being able to visualize your work is what makes engineering abstractions accessible to actual engineers. You shouldn’t have to be a world-class engineer just to debug your own damn code.

Devote engineering cycles to internal tooling and enablement.

If fast, safe deploys, with guard rails, instrumentation, and highly parallelized test suites are “everybody’s job”, they will end up nobody’s job. Engineering productivity isn’t something you can outsource. Managing the interfaces between your software vendors and your own teams is both a science and an art. Making it look easy and intuitive is really hard. It needs an owner.

Build an inclusive culture.

Growth is the norm, growth is the baseline. People do their best work when they feel a sense of belonging. An inclusive culture is one where everyone feels safe to ask questions, explore, and make mistakes; where everyone is held to the same high standard, and given the support and encouragement they need to achieve their goals.

Diverse teams are resilient teams.

Yeah, a team of super-senior engineers who all share a similar background can move incredibly fast, but a monoculture is fragile. Someone gets sick, someone gets pregnant, you start to grow and you need to integrate people from other backgrounds and the whole team can get derailed — fast.

When your teams are used to operating with a mix of genders, racial backgrounds, identities, age ranges, family statuses, geographical locations, skill sets, etc — when this is just table stakes, standard operating procedure — you’re better equipped to roll with it when life happens.

Assemble engineering teams from a range of levels.

The best engineering teams aren’t top-heavy with staff engineers and principal engineers. The best engineering teams are ones where nobody is running on autopilot, banging out a login page for the 300th time; everyone is working on something that challenges them and pushes their boundaries. Everyone is learning, everyone is teaching, everyone is pushing their own boundaries and growing. All the time.

By the way — all of that work you put into making your systems resilient, well-designed, and humane is the same work you would need to do to help onboard new engineers, develop junior talent, or let engineers move between teams.

It gets used and reused. Over and over and over again.

The only meaningful measure of productivity is impact to the business

The only thing that actually matters when it comes to engineering productivity is whether or not you are moving the business materially forward.

Which means…we can’t do this in a vacuum. The most important question is whether or not we are working on the right thing, which is a problem engineering can’t answer without help from product, design, and the rest of the business.

Software engineering isn’t about writing lots of lines of code, it’s about solving business problems using technology.

Senior and intermediate engineers are actually the workhorses of the industry. They move the business forward, step by step, day by day. They get to put their heads down and crank instead of constantly looking around the org and solving coordination problems. If you have to be a staff+ engineer to move the product forward, something is seriously wrong.

Great engineering orgs mint world-class engineers

A great engineering org is one where you don’t HAVE to be one of the best engineers in the world to have a lot of impact. But — rather ironically — great engineering orgs mint world class engineers like nobody’s business.

The best engineering orgs are not the ones with the smartest, most experienced people in the world, they’re the ones where normal software engineers can consistently make progress, deliver value to users, and move the business forward, day after day.

Places where engineers can get shit done and have a lot of impact are a magnet for top performers. Nothing makes engineers happier than building things, solving problems, making progress.

If you’re lucky enough to have world-class engineers in your org, good for you! Your role as a leader is to leverage their brilliance for the good of your customers and your other engineers, without coming to depend on their brilliance. After all, these people don’t belong to you. They may walk out the door at any moment, and that has to be okay.

These people can be phenomenal assets, assuming they can be team players and keep their egos in check. Which is probably why so many tech companies seem to obsess over identifying and hiring them, especially in Silicon Valley.

But companies categorically overindex on finding these people after they’ve already been minted, which ends up reinforcing and replicating all the prejudices and inequities of the world at large. Talent may be evenly distributed across populations, but opportunity is not.

Don’t hire the “best” people. Hire the right people.

We (by which I mean the entire human race) place too much emphasis on individual agency and characteristics, and not enough on the systems that shape us and inform our behaviors.

I feel like a whole slew of issues (candidates self-selecting out of the interview process, diversity of applicants, etc) would be improved simply by shifting the focus on engineering hiring and interviewing away from this inordinate emphasis on hiring the BEST PEOPLE and realigning around the more reasonable and accurate RIGHT PEOPLE.

It’s a competitive advantage to build an environment where people can be hired for their unique strengths, not their lack of weaknesses; where the emphasis is on composing teams rather than hiring the BEST people; where inclusivity is a given both for ethical reasons and because it raises the bar for performance for everyone. Inclusive culture is what actual meritocracy depends on.

This is the kind of place that engineering talent (and good humans) are drawn to like a moth to a flame. It feels good to ship. It feels good to move the business forward. It feels good to sharpen your skills and improve your craft. It’s the kind of place that people go when they want to become world class engineers. And it’s the kind of place where world class engineers want to stick around, to train up the next generation.

<3, charity

There Is Only One Key Difference Between Observability 1.0 and 2.0

November 19, 2024December 21, 2024 mipsytipsyobservability 2.0, unified storageLeave a comment

Originally posted on the Honeycomb blog on November 19th, 2024

We’ve been talking about observability 2.0 a lot lately; what it means for telemetry and instrumentation, its practices and sociotechnical implications, and the dramatically different shape of its cost model. With all of these details swimming about, I’m afraid we’re already starting to lose sight of what matters.

The distinction between observability 1.0 and observability 2.0 is not a laundry list, it’s not marketing speak, and it’s not that complicated or hard to understand. The distinction is a technical one, and it’s actually quite simple:

Observability 1.0 has three pillars and many sources of truth, scattered across disparate tools and formats.
Observability 2.0 has one source of truth, wide structured log events, from which you can derive all the other data types.

That’s it. That’s what defines each generation, respectively. Everything else is a consequence that flows from this distinction.

Multiple “pillars” are an observability 1.0 phenomenon

We’ve all heard the slogan, “metrics, logs, and traces are the three pillars of observability.” Right?

Well, that’s half true; it’s true of observability 1.0 tools. You might even say that pillars define the observability 1.0 generation. For every request that enters your system, you write logs, increment counters, and maybe trace spans; then you store telemetry in many places. You probably use some subset (or superset) of tools including APM, RUM, unstructured logs, structured logs, infra metrics, tracing tools, profiling tools, product analytics, marketing analytics, dashboards, SLO tools, and more. Under the hood, these are stored in various metrics formats: unstructured logs (strings), structured logs, time-series databases, columnar databases, and other proprietary storage systems.

Observability 1.0 tools force you to make a ton of decisions at write time about how you and your team would use the data in the future. They silo off different types of data and different kinds of questions into entirely different tools, as many different tools as you have use cases.

Many pillars, many tools.

An observability 2.0 tool does not have pillars.

Your observability 2.0 tool has one unified source of truth

Your observability 2.0 tool stores the telemetry for each request in one place, in one format: arbitrarily-wide structured log events.

These log events are not fired off willy-nilly as the request executes. They are specifically composed to describe all of the context accompanying a unit of work. Some common patterns include canonical logs, organized around each hop of the request; traces and spans, organized around application logic; or traces emitted as pulses for long-running jobs, queues, CI/CD pipelines, etc.

Structuring your data in this way preserves as much context and connective tissue as possible about the work being done. Once your data is gathered up this way, you can:

Derive metrics from your log events
Visualize them over time, as a trace
Zoom into individual requests, zoom out to long-term trends
Derive SLOs and aggregates
Collect system, application, product, and business telemetry together
Slice and dice and explore your data in an open-ended way
Swiftly compute outliers and identify correlations
Capture and preserve as much high-cardinality data as you want

The beauty of observability 2.0 is that it lets you collect your telemetry and store it—once—in a way that preserves all that rich context and relational data, and make decisions at read time about how you want to query and use the data. Store it once, and use it for everything.

Everything else is a consequence of this differentiator

Yeah, there’s a lot more to observability 2.0 than whether your data is stored in one place or many. Of course there is. But everything else is unlocked and enabled by this one core difference.

Here are some of the other aspects of observability 2.0, many of which have gotten picked up and discussed elsewhere in recent weeks:

Observability 1.0 is how you operate your code; observability 2.0 is about how you develop your code

Observability 1.0 has historically been infra-centric, and often makes do with logs and metrics software already emits, or that can be extracted with third-party tools
Observability 2.0 is oriented around your application code, the software at the core of your business

Observability 1.0 is traditionally focused on MTTR, MTTD, errors, crashes, and downtime
Observability 2.0 includes those things, but it’s about holistically understanding your software and your users—not just when things are broken

To control observability 1.0 costs, you typically focus on limiting the cardinality of your data, reducing your log levels, and reducing the cost multiplier by eliminating tools.
To control observability 2.0 costs, you typically reach for tail-based or head-based sampling
Observability 2.0 complements and supercharges the effectiveness of other modern development best practices like feature flags, progressive deployments, and chaos engineering.

The reason observability 2.0 is so much more effective at enabling and accelerating the entire software development lifecycle is because the single source of truth and wide, dense, cardinality-rich data allow you do things you can’t in an observability 1.0 world: slice and dice on arbitrary high-cardinality dimensions like build_id, feature flags, user_id, etc. to see precisely what is happening as people use your code in production.

In the same way that whether a database is a document store, a relational database, or a columnar database has an enormous impact on the kinds of workloads it can do, what it excels at and which teams end up using it, the difference between observability 1.0 and 2.0 is a technical distinction that has enduring consequences for how people use it.

These are not hard boundaries; data is data, telemetry is telemetry, and there will always be a certain amount of overlap. You can adopt some of these observability 2.0-ish behaviors (like feature flags) using 1.0 tools, to some extent—and you should try!—but the best you can do with metrics-backed tools will always be percentile aggregates and random exemplars. You need precision tools to unlock the full potential of observability 2.0.

Observability 1.0 is a dinner knife; 2.0 is a scalpel.

Why now? What changed?

If observability 2.0 is so much better, faster, cheaper, simpler, and more powerful, then why has it taken this long to emerge on the landscape?

Observability 2.0-shaped tools (high cardinality, high dimensionality, explorable interfaces, etc.) have actually been de rigeur on the business side of the house for years. You can’t run a business without them! It was close to 20 years ago that columnar stores like Vertica came on the scene for data warehouses. But those tools weren’t built for software engineers, and they were prohibitively expensive at production scale.

FAANG companies have also been using tools like this internally for a very long time. Facebook’s Scuba was famously the inspiration for Honeycomb—however, Scuba ran on giant RAM disks as recently as 2015, which means it was quite an expensive service to run. The falling cost of storage, bandwidth, and compute has made these technologies viable as commodity SaaS platforms, at the same time as the skyrocketing complexity of systems due to microservices, decoupled architecture patterns has made them mandatory.

Three big reasons the rise of observability 2.0 is inevitable

Number one: our systems are exploding in complexity along with power and capabilities. The idea that developing your code and operating your code are two different practices that can be done by two different people is no longer tenable. You can’t operate your code as a black box, you have to instrument it. You also can’t predict how things are going to behave or break, and one of the defining characteristics of observability 1.0 was that you had to make those predictions up front, at write time.

Number two: the cost model of observability 1.0 is brutally unsustainable. Instead of paying to store your data once, you pay to store it again and again and again, in as many different pillars or formats or tools as you have use cases. The post-ZIRP era has cast a harsh focus on a lot of teams’ observability bills—not only the outrageous costs, but also the reality that as costs go up, the value you get out of them is going down.

Yet the cost multiplier angle is in some ways the easiest to fix: you bite the bullet and sacrifice some of your tools. Cardinality is even more costly, and harder to mitigate. You go to bed Friday night with a $150k Datadog bill and wake up Monday morning with a million dollar bill, without changing a single line of code. Many observability engineering teams spend an outright majority of their time just trying to manage the cardinality threshold—enough detail to understand their systems and solve users’ problems, not so much detail that they go bankrupt.

And that is the most expensive part of all: engineering cycles. The cost of the time engineers spend laboring below the value line—trying to understand their code, their telemetry, their user behaviors—is astronomical. Poor observability is the dark matter of engineering teams. It’s why everything we do feels so incredibly, grindingly slow, for no apparent reason. Good observability empowers teams to ship swiftly, consistently, and with confidence.

Number three: a critical mass of developers have seen what observability 2.0 can do. Once you’ve tried developing with observability 2.0, you can’t go back. That was what drove Christine and me to start Honeycomb, after we experienced this at Facebook. It’s hard to describe the difference in words, but once you’ve built software with fast feedback loops and real-time, interactive visibility into what your code is doing, you simply won’t go back.

It’s not just Honeycomb; observability 2.0 tools are going mainstream

We’re starting to see a wave of early startups building tools based on these principles. You’re seeing places like Shopify build tools in-house using something like Clickhouse as a backing store. DuckDB is now available in the open-source realm. I expect to see a blossoming of composable solutions in the next year or two, in the vein of ELK stacks for o11y 2.0.

Jeremy Morrell recently published the comprehensive guide to observability 2.0 instrumentation, and it includes a vendor-neutral overview of your options in the space.

There are still valid reasons to go with a 1.0 vendor. Those tools are more mature, fully featured, and most importantly, they have a more familiar look and feel to engineers who have been working with metrics and logs their whole career. But engineers who have tried observability 2.0 are rarely willing to go back.

Beware observability 2.0 marketing claims

You do have to be a little bit wary here. There are lots of observability 1.0 vendors who talk about having a “unified observability platform” or having all your data in one place. But what they actually mean is that you can pay for all your tools in one unified bill, or present all the different data sources in one unified visualization.

The best of these vendors have built a bunch of elaborate bridges between their different tools and storage systems, so you can predefine connection points between e.g. a particular metric and your logging tool or your tracing tool. This is a massive improvement over having no connection points between datasets, no doubt. But a unified presentation layer is not the same thing as a unified data source.

So if you’re trying to clear a path through all the sales collateral and marketing technobabble, you only need to ask one question: how many times is your data going to be stored?

Is there one source of truth, or many?

Generative AI is not going to build your engineering team for you

June 10, 2024December 21, 2024 mipsytipsyartificial intelligence, code generation, generative AI, hiring, junior engineers, llmsLeave a comment

Originally posted on the Stack Overflow blog on June 10th, 2024

When I was 19 years old, I dropped out of college and moved to San Francisco. I had a job offer in hand to be a Unix sysadmin for Taos Consulting. However, before my first day of work I was lured away to a startup in the city, where I worked as a software engineer on mail subsystems.

I never questioned whether or not I could find work. Jobs were plentiful, and more importantly, hiring standards were very low. If you knew how to sling HTML or find your way around a command line, chances were you could find someone to pay you.

Was I some kind of genius, born with my hands on a computer keyboard? Assuredly not. I was homeschooled in the backwoods of Idaho. I didn’t touch a computer until I was sixteen and in college. I escaped to university on a classical performance piano scholarship, which I later traded in for a peripatetic series of nontechnical majors: classical Latin and Greek, musical theory, philosophy. Everything I knew about computers I learned on the job, doing sysadmin work for the university and CS departments.

In retrospect, I was so lucky to enter the industry when I did. It makes me blanch to think of what would have happened if I had come along a few years later. Every one of the ladders my friends and I took into the industry has long since vanished.

To some extent, this is just what happens as an industry matures. The early days of any field are something of a Wild West, where the stakes are low, regulation nonexistent, and standards nascent. If you look at the early history of other industries—medicine, cinema, radio—the similarities are striking.

There is a magical moment with any young technology where the boundaries between roles are porous and opportunity can be seized by anyone who is motivated, curious, and willing to work their asses off.

It never lasts. It can’t; it shouldn’t. The amount of prerequisite knowledge and experience you must have before you can enter the industry swells precipitously. The stakes rise, the magnitude of the mission increases, the cost of mistakes soars. We develop certifications, trainings, standards, legal rites. We wrangle over whether or not software engineers are really engineers.

Nowadays, you wouldn’t want a teenaged dropout like me to roll out of junior year and onto your pager rotation. The prerequisite knowledge you need to enter the industry has grown, the pace is faster, and the stakes are much higher, so you can no longer learn literally everything on the job, as I once did.

However, it’s not like you can learn everything you need to know at college either. A CS degree typically prepares you better for a life of computing research than life as a workaday software engineer. A more practical path into the industry may be a good coding bootcamp, with its emphasis on problem solving and learning a modern toolkit. In either case, you don’t so much learn “how to do the job” as you do “learn enough of the basics to understand and use the tools you need to use to learn the job.”

Software is an apprenticeship industry. You can’t learn to be a software engineer by reading books. You can only learn by doing…and doing, and doing, and doing some more. No matter what your education consists of, most learning happens on the job—period. And it never ends! Learning and teaching are lifelong practices; they have to be, the industry changes so fast.

It takes a solid seven-plus years to forge a competent software engineer. (Or as most job ladders would call it, a “senior software engineer”.) That’s many years of writing, reviewing, and deploying code every day, on a team alongside more experienced engineers. That’s just how long it seems to take.

Here is where I often get some very indignant pushback to my timelines, e.g.:

“Seven years?! Pfft, it took me two years!”

“I was promoted to Senior Software Engineer in less than five years!”

Good for you. True, there is nothing magic about seven years. But it takes time and experience to mature into an experienced engineer, the kind who can anchor a team. More than that, it takes practice.

I think we have come to use “Senior Software Engineer” as shorthand for engineers who can ship code and be a net positive in terms of productivity, and I think that’s a huge mistake. It implies that less senior engineers must be a net negative in terms of productivity, which is untrue. And it elides the real nature of the work of software engineering, of which writing code is only a small part.

To me, being a senior engineer is not primarily a function of your ability to write code. It has far more to do with your ability to understand, maintain, explain, and manage a large body of software in production over time, as well as the ability to translate business needs into technical implementation. So much of the work is around crafting and curating these large, complex sociotechnical systems, and code is just one representation of these systems.

What does it mean to be a senior engineer? It means you have learned how to learn, first and foremost, and how to teach; how to hold these models in your head and reason about them, and how to maintain, extend, and operate these systems over time. It means you have good judgment, and instincts you can trust.

Which brings us to the matter of AI.

It is really, really tough to get your first role as an engineer. I didn’t realize how hard it was until I watched my little sister (new grad, terrific grades, some hands on experience, fiendishly hard worker) struggle for nearly two years to land a real job in her field. That was a few years ago; anecdotally, it seems to have gotten even harder since then.

This past year, I have read a steady drip of articles about entry-level jobs in various industries being replaced by AI. Some of which absolutely have merit. Any job that consists of drudgery such as converting a document from one format to another, reading and summarizing a bunch of text, or replacing one set of icons with another, seems pretty obviously vulnerable. This doesn’t feel all that revolutionary to me, it’s just extending the existing boom in automation to cover textual material as well as mathy stuff.

Recently, however, a number of execs and so-called “thought leaders” in tech seem to have genuinely convinced themselves that generative AI is on the verge of replacing all the work done by junior engineers. I have read so many articles about how junior engineering work is being automated out of existence, or that the need for junior engineers is shriveling up. It has officially driven me bonkers.

All of this bespeaks a deep misunderstanding about what engineers actually do. By not hiring and training up junior engineers, we are cannibalizing our own future. We need to stop doing that.

People act like writing code is the hard part of software. It is not. It never has been, it never will be. Writing code is the easiest part of software engineering, and it’s getting easier by the day. The hard parts are what you do with that code—operating it, understanding it, extending it, and governing it over its entire lifecycle.

A junior engineer begins by learning how to write and debug lines, functions, and snippets of code. As you practice and progress towards being a senior engineer, you learn to compose systems out of software, and guide systems through waves of change and transformation.

Sociotechnical systems consist of software, tools, and people; understanding them requires familiarity with the interplay between software, users, production, infrastructure, and continuous changes over time. These systems are fantastically complex and subject to chaos, nondeterminism and emergent behaviors. If anyone claims to understand the system they are developing and operating, the system is either exceptionally small or (more likely) they don’t know enough to know what they don’t know. Code is easy, in other words, but systems are hard.

The present wave of generative AI tools has done a lot to help us generate lots of code, very fast. The easy parts are becoming even easier, at a truly remarkable pace. But it has not done a thing to aid in the work of managing, understanding, or operating that code. If anything, it has only made the hard jobs harder.

If you read a lot of breathless think pieces, you may have a mental image of software engineers merrily crafting prompts for ChatGPT, or using Copilot to generate reams of code, then committing whatever emerges to GitHub and walking away. That does not resemble our reality.

The right way to think about tools like Copilot is more like a really fancy autocomplete or copy-paste function, or maybe like the unholy love child of Stack Overflow search results plus Google’s “I feel lucky”. You roll the dice, every time.

These tools are at their best when there’s already a parallel in the file, and you want to just copy-paste the thing with slight modifications. Or when you’re writing tests and you have a giant block of fairly repetitive YAML, and it repeats the pattern while inserting the right column and field names, like an automatic template.

However, you cannot trust generated code. I can’t emphasize this enough. AI-generated code always looks quite plausible, but even when it kind of “works”, it’s rarely congruent with your wants and needs. It will happily generate code that doesn’t parse or compile. It will make up variables, method names, function calls; it will hallucinate fields that don’t exist. Generated code will not follow your coding practices or conventions. It is not going to refactor or come up with intelligent abstractions for you. The more important, difficult or meaningful a piece of code is, the less likely you are to generate a usable artifact using AI.

You may save time by not having to type the code in from scratch, but you will need to step through the output line by line, revising as you go, before you can commit your code, let alone ship it to production. In many cases this will take as much or more time as it would take to simply write the code—especially these days, now that autocomplete has gotten so clever and sophisticated. It can be a LOT of work to bring AI-generated code into compliance and coherence with the rest of your codebase. It isn’t always worth the effort, quite frankly.

Generating code that can compile, execute, and pass a test suite isn’t especially hard; the hard part is crafting a code base that many people, teams, and successive generations of teams can navigate, mutate, and reason about for years to come.

So that’s the TLDR: you can generate a lot of code, really fast, but you can’t trust what comes out. At all. However, there are some use cases where generative AI consistently shines.

For example, it’s often easier to ask chatGPT to generate example code using unfamiliar APIs than by reading the API docs—the corpus was trained on repositories where the APIs are being used for real life workloads, after all.

Generative AI is also pretty good at producing code that is annoying or tedious to write, yet tightly scoped and easy to explain. The more predictable a scenario is, the better these tools are at writing the code for you. If what you need is effectively copy-paste with a template—any time you could generate the code you want using sed/awk or vi macros—generative AI is quite good at this.

It’s also very good at writing little functions for you to do things in unfamiliar languages or scenarios. If you have a snippet of Python code and you want the same thing in Java, but you don’t know Java, generative AI has got your back.

Again, remember, the odds are 50/50 that the result is completely made up. You always have to assume the results are incorrect until you can verify it by hand. But these tools can absolutely accelerate your work in countless ways.

One of the engineers I work with, Kent Quirk, describes generative AI as “an excitable junior engineer who types really fast”. I love that quote—it leaves an indelible mental image.

Generative AI is like a junior engineer in that you can’t roll their code off into production. You are responsible for it—legally, ethically, and practically. You still have to take the time to understand it, test it, instrument it, retrofit it stylistically and thematically to fit the rest of your code base, and ensure your teammates can understand and maintain it as well.

The analogy is a decent one, actually, but only if your code is disposable and self-contained, i.e. not meant to be integrated into a larger body of work, or to survive and be read or modified by others.

And hey—there are corners of the industry like this, where most of the code is write-only, throwaway code. There are agencies that spin out dozens of disposable apps per year, each written for a particular launch or marketing event and then left to wither on the vine. But that is not most software. Disposable code is rare; code that needs to work over the long term is the norm. Even when we think a piece of code will be disposable, we are often (urf) wrong.

In that particular sense—generating code that you know is untrustworthy—GenAI is a bit like a junior engineer. But in every other way, the analogy fails. Because adding a person who writes code to your team is nothing like autogenerating code. That code could have come from anywhere—Stack Overflow, Copilot, whatever. You don’t know, and it doesn’t really matter. There’s no feedback loop, no person on the other end trying iteratively to learn and improve, and no impact to your team vibes or culture.

To state the supremely obvious: giving code review feedback to a junior engineer is not like editing generated code. Your effort is worth more when it is invested into someone else’s apprenticeship. It’s an opportunity to pass on the lessons you’ve learned in your own career. Even just the act of framing your feedback to explain and convey your message forces you to think through the problem in a more rigorous way, and has a way of helping you understand the material more deeply.

And adding a junior engineer to your team will immediately change team dynamics. It creates an environment where asking questions is normalized and encouraged, where teaching as well as learning is a constant. We’ll talk more about team dynamics in a moment.

The time you invest into helping a junior engineer level up can pay off remarkably quickly. Time flies. ☺️ When it comes to hiring, we tend to valorize senior engineers almost as much as we underestimate junior engineers. Neither stereotype is helpful.

People seem to think that once you hire a senior engineer, you can drop them onto a team and they will be immediately productive, while hiring a junior engineer will be a tax on team performance forever. Neither are true. Honestly, most of the work that most teams have to do is not that difficult, once it’s been broken down into its constituent parts. There’s plenty of room for lower level engineers to execute and flourish.

The grossly simplified perspective of your accountant goes something like this. “Why should we pay $100k for a junior engineer to slow things down, when we could pay $200k for a senior engineer to speed things up?” It makes no sense!

But you know and I know—every engineer who is paying attention should know—that’s not how engineering works. This is an apprenticeship industry, and productivity is defined by the output and carrying capacity of each team, not each person.

There are lots of ways a person can contribute to the overall velocity of a team, just like there are lots of ways a person can sap the energy out of a team or add friction and drag to everyone around them. These do not always correlate with the person’s level (at least not in the direction people tend to assume), and writing code is only one way.

Furthermore, every engineer you hire requires ramp time and investment before they can contribute. Hiring and training new engineers is a costly endeavor, no matter what level they are. It will take any senior engineer time to build up their mental model of the system, familiarize themselves with the tools and technology, and ramp up to speed. How long? It depends on how clean and organized the codebase is, past experience with your tools and technologies, how good you are at onboarding new engineers, and more, but likely around 6-9 months. They probably won’t reach cruising altitude for about a year.

Yes, the ramp will be longer for a junior engineer, and yes, it will require more investment from the team. But it’s not indefinite. Your junior engineer should be a net positive within roughly the same time frame, six months to a year, and they develop far more rapidly than more senior contributors. (Don’t forget, their contributions may vastly exceed the code they personally write.)

In terms of writing and shipping features, some of the most productive engineers I’ve ever known have been intermediate engineers. Not yet bogged down with all the meetings and curating and mentoring and advising and architecture, their calendars not yet pockmarked with interruptions, they can just build stuff. You see them put their headphones on first thing in the morning, write code all day, and cruise out the door in the evening having made incredible progress.

Intermediate engineers sit in this lovely, temporary state where they have gotten good enough at programming to be very productive, but they are still learning how to build and care for systems. All they do is write code, reams and reams of code.

And they’re energized…engaged. They’re having fun! They aren’t bored with writing a web form or a login page for the 1000th time. Everything is new, interesting, and exciting, which typically means they will do a better job, especially under the light direction of someone more experienced. Having intermediate engineers on a team is amazing. The only way you get them is by hiring junior engineers.

Having junior and intermediate engineers on a team is a shockingly good inoculation against overengineering and premature complexity. They don’t yet know enough about a problem to imagine all the infinite edge cases that need to be solved for. They help keep things simple, which is one of the hardest things to do.

If you ask, nearly everybody will wholeheartedly agree that hiring junior engineers is a good thing…and someone else should do it. This is because the long-term arguments for hiring junior engineers are compelling and fairly well understood.

We need more senior engineers as an industry
Somebody has to train them
Junior engineers are cheaper
They may add some much-needed diversity
They are often very loyal to companies who invest in training them, and will stick around for years instead of job hopping
Did we already mention that somebody needs to do it?

But long-term thinking is not a thing that companies, or capitalism in general, are typically great at. Framed this way, it makes it sound like you hire junior engineers as a selfless act of public service, at great cost to yourself. Companies are much more likely to want to externalize costs like those, which is how we got to where we are now.

However, there are at least as many arguments to be made for hiring junior engineers in the short term—selfish, hard-nosed, profitable reasons for why it benefits the team and the company to do so. You just have to shift your perspective slightly, from individuals to teams, to bring them into focus.

Let’s start here: hiring engineers is not a process of “picking the best person for the job”. Hiring engineers is about composing teams. The smallest unit of software ownership is not the individual, it’s the team. Only teams can own, build, and maintain a corpus of software. It is inherently a collaborative, cooperative activity.

If hiring engineers was about picking the “best people”, it would make sense to hire the most senior, experienced individual you can get for the money you have, because we are using “senior” and “experienced” as a proxy for “productivity”. (Questionable, but let’s not nitpick.) But the productivity of each individual is not what we should be optimizing for. The productivity of the team is all that matters.

And the best teams are always the ones with a diversity of strengths, perspectives, and levels of expertise. A monoculture can be spectacularly successful in the short term—it may even outperform a diverse team. But they do not scale well, and they do not adapt to unfamiliar challenges gracefully. The longer you wait to diversify, the harder it will be.

We need to hire junior engineers, and not just once, but consistently. We need to keep feeding the funnel from the bottom up. Junior engineers only stay junior for a couple years, and intermediate engineers turn into senior engineers. Super-senior engineers are not actually the best people to mentor junior engineers; the most effective mentor is usually someone just one level ahead, who vividly remembers what it was like in your shoes.

A healthy team is an ecosystem. You wouldn’t staff a product engineering team with six DB experts and one mobile developer. Nor should you staff it with six staff+ engineers and one junior developer. A good team is composed of a range of skills and levels.

Have you ever been on a team packed exclusively with staff or principal engineers? It is not fun. That is not a high-functioning team. There is only so much high-level architecture and planning work to go around, there are only so many big decisions that need to be made. These engineers spend most of their time doing work that feels boring and repetitive, so they tend to over-engineer solutions and/or cut corners—sometimes at the same time. They compete for the “fun” stuff and find reasons to pick technical fights with each other. They chronically under-document and under-invest in the work that makes systems simple and tractable.

Teams that only have intermediate engineers (or beginners, or seniors, or whatever) will have different pathologies, but similar problems with contention and blind spots. The work itself has a wide range in complexity and difficulty—from simple, tightly scoped functions to tough, high-stakes architecture decisions. It makes sense for the people doing the work to occupy a similar range.

The best teams are ones where no one is bored, because every single person is working on something that challenges them and pushes their boundaries. The only way you can get this is by having a range of skill levels on the team.

The bottleneck we face now is not our ability to train up new junior engineers and give them skills. Nor is it about juniors learning to hustle harder; I see a lot of solid, well-meaning advice on this topic, but it’s not going to solve the problem. The bottleneck is giving them their first jobs. The bottleneck consists of companies who see them as a cost to externalize, not an investment in their—the company’s—future.

After their first job, an engineer can usually find work. But getting that first job, from what I can see, is murder. It is all but impossible—if you didn’t graduate from a top college, and you aren’t entering the feeder system of Big Tech, then it’s a roll of the dice, a question of luck or who has the best connections. It was rough before the chimera of “Generative AI can replace junior engineers” rose up from the swamp. And now…oof.

Where would you be, if you hadn’t gotten into tech when you did?

I know where I would be, and it is not here.

The internet loves to make fun of Boomers, the generation that famously coasted to college, home ownership, and retirement, then pulled the ladder up after them while mocking younger people as snowflakes. “Ok, Boomer” may be here to stay, but can we try to keep “Ok, Staff Engineer” from becoming a thing?

Lots of people seem to think we don’t need junior engineers, but nobody is arguing that we need fewer senior engineers, or will need fewer senior engineers in the foreseeable future.

I think it’s safe to assume that anything deterministic and automatable will eventually be automated. Software engineering is no different—we are ground zero! Of course we’re always looking for ways to automate and improve efficiency, as we should be.

But large software systems are unpredictable and nondeterministic, with emergent behaviors. The mere existence of users injects chaos into the system. Components can be automated, but complexity can only be managed.

Even if systems could be fully automated and managed by AI, the fact that we cannot understand how AI makes decisions is a huge, possibly insurmountable problem. Running your business on a system that humans can’t debug or understand seems like a risk so existential that no security, legal or finance team would ever sign off on it. Maybe some version of this future will come to pass, but it’s hard to see it from here. I would not bet my career or my company on it happening.

In the meantime, we still need more senior engineers. The only way to grow them is by fixing the funnel.

No. You need to be able to set them up for success. Some factors that disqualify you from hiring junior engineers:

You have less than two years of runway
Your team is constantly in firefighting mode, or you have no slack in your system
You have no experienced managers, or you have bad managers, or no managers at all
You have no product roadmap
Nobody on your team has any interest in being their mentor or point person

The only thing worse than never hiring any junior engineers is hiring them into an awful experience where they can’t learn anything. (I wouldn’t set the bar quite as high as Cindy does in this article; while I understand where she’s coming from, it is so much easier to land your second job than your first job that I think most junior engineers would frankly choose a crappy first job over none at all.)

Being a fully distributed company isn’t a complete dealbreaker, but it does make things even harder. I would counsel junior engineers to seek out office jobs if at all possible. You learn so much faster when you can soak up casual conversations and technical chatter, and you lose that working from home. If you are a remote employer, know that you will need to work harder to compensate for this. I suggest connecting with others who have done this successfully (they exist!) for advice.

I also advise companies not to start by hiring a single junior engineer. If you’re going to hire one, hire two or three. Give them a cohort of peers, so it’s a little less intimidating and isolating.

I have come to believe that the only way this will ever change is if engineers and engineering managers across our industry take up this fight and make it personal.

Most of the places I know that do have a program for hiring and training entry level engineers, have it only because an engineer decided to fight for it. Engineers—sometimes engineering managers—were the ones who made the case and pushed for resources, then designed the program, interviewed and hired the junior engineers, and set them up with mentors. This is not an exotic project, it is well within the capabilities of most motivated, experienced engineers (and good for your career as well).

Finance isn’t going to lobby for this. Execs aren’t likely to step in. The more a person’s role inclines them to treat engineers like fungible resources, the less likely they are to understand why this matters.

AI is not coming to solve all our problems and write all our code for us—and even if it was, it wouldn’t matter. Writing code is but a sliver of what professional software engineers do, and arguably the easiest part. Only we have the context and the credibility to drive the changes we know form the bedrock for great teams and engineering excellence..

Great teams are how great engineers get made. Nobody knows this better than engineers and EMs. It’s time for us to make the case, and make it happen.

The Cost Crisis in Observability Tooling

January 24, 2024December 21, 2024 mipsytipsycost control, metrics, observability 2.0, three pillarsLeave a comment

Originally posted on the Honeycomb blog on January 24th, 2024

The cost of services is on everybody’s mind right now, with interest rates rising, economic growth slowing, and organizational budgets increasingly feeling the pinch. But I hear a special edge in people’s voices when it comes to their observability bill, and I don’t think it’s just about the cost of goods sold. I think it’s because people are beginning to correctly intuit that the value they get out of their tooling has become radically decoupled from the price they are paying.

In the happiest cases, the price you pay for your tools is “merely” rising at a rate several times faster than the value you get out of them. But that’s actually the best case scenario. For an alarming number of people, the value they get actually decreases as their bill goes up.

Observability 1.0 and the cost multiplier effect

Are you familiar with this chestnut?

“Observability has three pillars: metrics, logs, and traces.”

This isn’t exactly true, but it’s definitely true of a particular generation of tools—one might even say definitionally true of a particular generation of tools. Let’s call it “observability 1.0.”

From an evolutionary perspective, you can see how we got here. Everybody has logs… so we spin up a service for log aggregation. But logs are expensive and everybody wants dashboards… so we buy a metrics tool. Software engineers want to instrument their applications… so we buy an APM tool. We start unbundling the monolith into microservices, and pretty soon we can’t understand anything without traces… so we buy a tracing tool. The front-end engineers point out that they need sessions and browser data… so we buy a RUM tool. On and on it goes.

Logs, metrics, traces, APM, RUM. You’re now paying to store telemetry five different ways, in five different places, for every single request. And a 5x multiplier is on the modest side of the spectrum, given how many companies pay for multiple overlapping tools in the same category. You may also also be collecting:

Profiling data
Product analytics
Business intelligence data
Database monitoring/query profiling tools
Mobile app telemetry
Behavioral analytics
Crash reporting
Language-specific profiling data
Stack traces
CloudWatch or hosting provider metrics
…and so on.

So, how many times are you paying to store data about your user requests? What’s your multiplier? (If you have one consolidated vendor bill, this may require looking at your itemized bill.)

There are many types of tools, each gathering slightly different data for a slightly different use case, but underneath the hood there are really only three basic data types: the metric, unstructured logs, and structured logs. Each of these have their own distinctive trade-offs when it comes to how much they cost and how much value you can get out of them.

Metrics

Metrics are the great-granddaddy of telemetry formats; tiny, fast, and cheap. A “metric” consists of a single number, often with tags appended. All of the context of the request gets discarded at write time; each individual metric is emitted separately. This means you can never correlate one metric with another from the same request, or select all the metrics for a given request ID, user, or app ID, or ask arbitrary new questions about your metrics data.

Metrics-based tools include vendors like Datadog and open-source projects like Prometheus. RUM tools are built on top of metrics to understand browser user sessions; APM tools are built on top of metrics to understand application performance.

When you set up a metrics tool, it generally comes prepopulated with a bunch of basic metrics, but the useful ones are typically the custom metrics you emit from your application.

Your metrics bill is usually dominated by the cost of these custom metrics. At minimum, your bill goes up linearly with the number of custom metrics you create. Which is unfortunate, because to restrain your bill from unbounded growth, you have to regularly audit your metrics, do your best to guess which ones are going to be useful in the future, and prune any you think you can afford to go without. Even in the hands of experts, these tools require significant oversight.

Linear cost growth is the goal, but it’s rarely achieved. The cost of each metric varies wildly depending on how you construct it, what the values are, how often it gets hit, etc. I’ve seen a single custom metric cost $30k per month. You probably have dozens of custom metrics per service, and it’s almost impossible to tell how much each of them costs you. Metrics bills tend to be incredibly opaque (possibly by design).

Nobody can understand their software or their systems with a metrics tool alone, because the metric is extremely limited in what it can do. No context, no cardinality, no strings… only basic static dashboards. For richer data, we must turn to logs.

Unstructured logs

You can understand much more about your code with logs than you can with metrics. Logs are typically emitted multiple times throughout the execution of the request, with one or a small number of nouns per log line, plus the request ID. Unstructured logs are still the default, although this is slowly changing.

The cost of unstructured logs is driven by a few things:

Write amplification. If you want to capture lots of rich context about the request, you need to emit a lot of log lines. If you are printing out just 10 log lines per request, per service, and you have half a dozen services, that’s 60 log events for every request.
Noisiness. It’s extremely easy to accidentally blow up your log footprint yet add no value—e.g., by putting a print statement inside a loop instead of outside the loop. Here, the usefulness of the data goes down as the bill shoots up.
Constraints on physical resources. Due to the write amplification of log lines per request, it’s often physically impossible to log everything you want to log for all requests or all users—it would saturate your NIC or disk. Therefore, people tend to use blunt instruments like these to blindly slash the log volume:
- Log levels
- Consistent hashes
- Dumb sample rates

When you emit multiple log lines per request, you end up duplicating a lot of raw data; sometimes over half the bits are consumed by request ID, process ID, timestamp. This can be quite meaningful from a cost perspective.

All of these factors can be annoying. But the worst thing about unstructured logs is that the only thing you can do to query them is full text search. The more data you have, the slower it becomes to search that data, and there’s not much you can do about it.

Searching your logs over any meaningful length of time can take minutes or even hours, which means experimenting and looking around for unknown-unknowns is prohibitively time-consuming. You have to know what to look for in order to find it. Once again, as your logging bill goes up, the value goes down.

Structured logs

Structured logs are gaining adoption across the industry, especially as OpenTelemetry picks up steam. The nice thing about structured logs is that you can actually do things with the data other than slow, dumb string searches. If you’ve structured your data properly, you can perform calculations! Compute percentiles! Generate heatmaps!

Tools built on structured logs are so clearly the future. But just taking your existing logs and adding structure isn’t quite good enough. If all you do is stuff your existing log lines into key/value pairs, the problems of amplification, noisiness, and physical constraints remain unchanged—you can just search more efficiently and do some math with your data.

There are a number of things you can and should do to your structured logs in order to use them more effectively and efficiently. In order of achievability:

Instrument your code using the principles of canonical logs, which collects all the vital characteristics of a request into one wide, dense event. It is difficult to overstate the value of doing this, for reasons of usefulness and usability as well as cost control.
Add trace IDs and span IDs so you can trace your code using the same events instead of having to use an entirely separate tool.
Feed your data into a columnar storage engine so you don’t have to predefine a schema or indexes to decide which dimensions future you can search or compute based on.
Use a storage engine that supports high cardinality, with an explorable interface.

If you go far enough down this path of enriching your structured events, instrumenting your code with the right data, and displaying it in real time, you will reach an entirely different set of capabilities, with a cost model so distinct it can only be described as “observability 2.0.” More on that in a second.

Ballooning costs are baked into observability 1.0

To recap: high costs are baked into the observability 1.0 model. Every pillar has a price.

You have to collect and store your data—and pay to store it—again and again and again, for every single use case. Depending on how many tools you use, your observability bill may be growing at a rate 3x faster than your traffic is growing, or 5x, or 10x, or even more.

It gets worse. As your costs go up, the value you get out of your tools goes down.

Your logs get slower and slower to search.
You have to know what you’re searching for in order to find it.
You have to use blunt force sampling technique to keep log volume from blowing up.
Any time you want to be able to ask a new question, you first have to commit new code and deploy it.
You have to guess which custom metrics you’ll need and which fields to index in advance.
As volume goes up, your ability to find a needle in the haystack—any unknown-unknowns—goes down commensurately.

And nothing connects any of these tools. You cannot correlate a spike in your metrics dashboard with the same requests in your logs, nor can you trace one of the errors. It’s impossible. If your APM and metrics tools report different error rates, you have no way of resolving this confusion. The only thing connecting any of these tools is the intuition and straight-up guesses made by your most senior engineers. Which means that the cognitive costs are immense, and your bus factor risks are very real. The most important connective data in your system—connecting metrics with logs, and logs with traces—exists only in the heads of a few people.

At the same time, the engineering overhead required to manage all these tools (and their bills) rises inexorably. With metrics, an engineer needs to spend time auditing your metrics, tracking people down to fix poorly constructed metrics, and reaping those that are too expensive or don’t get used. With logs, an engineer needs to spend time monitoring the log volume, watching for spammy or duplicate log lines, pruning or consolidating them, choosing and maintaining indexes.

But all this the time spent wrangling observability 1.0 data types isn’t even the costliest part. The most expensive part is the unseen costs inflicted on your engineering organization as development slows down and tech debt piles up, due to low visibility and thus low confidence.

Is there an alternative? Yes.

The cost model of observability 2.0 is very different

Observability 2.0 has no three pillars; it has a single source of truth. Observability 2.0 tools are built on top of arbitrarily-wide structured log events, also known as spans. From these wide, context-rich structured log events you can derive the other data types (metrics, logs, or traces).

Since there is only one data source, you can correlate and cross-correlate to your heart’s content. You can switch fluidly back and forth between slicing and dicing, breaking down or grouping by events, and viewing them as a trace waterfall. You don’t have to worry about cardinality or key space limitations.

You also effectively get infinite custom metrics, since you can append as many as you want to the same events. Not only does your cost not go up linearly as you add more custom metrics, your telemetry just gets richer and more valuable the more key-value pairs you add! Nor are you limited to numbers; you can add any and all types of data, including valuable high-cardinality fields like “App Id” or “Full Name.”

Observability 2.0 has its own amplification factor to consider. As you instrument your code with more spans per request, the number of events you have to send (and pay for) goes up. However, you have some very powerful tools for dealing with this: you can perform dynamic head-based sampling or even tail-based sampling, where you decide whether or not to keep the event after it’s finished, allowing you to capture 100% of slow requests and other outliers.

Engineering time is your most precious resource

But the biggest difference between observability 1.0 and 2.0 won’t show up on any invoice. The difference shows up in your engineering team’s ability to move quickly, with confidence.

Modern software engineering is all about hooking up fast feedback loops. And observability 2.0 tooling is what unlocks the kind of fine-grained, exploratory experience you need in order to accelerate those feedback loops.

Where observability 1.0 is about MTTR, MTTD, reliability, and operating software, observability 2.0 is what underpins the entire software development lifecycle, setting the bar for how swiftly you can build and ship software, find problems, and iterate on them. Observability 2.0 is about being in conversation with your code, understanding each user’s experience, and building the right things.

Observability 2.0 isn’t exactly cheap either, although it is often less expensive. But the key difference between o11y 1.0 and o11y 2.0 has never been that either is cheap; it’s that with observability 2.0, when your bill goes up, the value you derive from your telemetry goes up too. You pay more money, you get more out of your tools. As you should.

Interested in learning more? We’ve written at length about the technical prerequisites for observability with a single source of truth (“observability 2.0” as we’ve called it here). Honeycomb was built to this spec; ServiceNow (formerly Lightstep) and Baselime are other vendors that qualify. Click here to get a Honeycomb demo.

CORRECTION: The original version of this document said that “nothing connects any of these tools.” If you are using a single unified vendor for your metrics, logging, APM, RUM, and tracing tools, this is not strictly true. Vendors like New Relic or Datadog now let you define certain links between your traces and metrics, which allows you to correlate between data types in a few limited, predefined ways. This is better than nothing! But it’s very different from the kind of fluid, open-ended correlation capabilities that we describe with o11y 2.0. With o11y 2.0, you can slice and dice, break down, and group by your complex data sets, then grab a trace that matches any specific set of criteria at any level of granularity. With o11y 1.0, you can define a metric up front, then grab a random exemplar of that metric, and that’s it. All the limitations of metrics still apply; you can’t correlate any metric with any other metric from that request, app, user, etc, and you certainly can’t trace arbitrary criteria. But you’re right, it’s not nothing. 😸

LLMs Demand Observability-Driven Development

September 20, 2023December 21, 2024 mipsytipsyai, artificial intelligence, llms, observability-driven developmentLeave a comment

Originally posted on the Honeycomb blog on September 20th, 2023

Our industry is in the early days of an explosion in software using LLMs, as well as (separately, but relatedly) a revolution in how engineers write and run code, thanks to generative AI.

Many software engineers are encountering LLMs for the very first time, while many ML engineers are being exposed directly to production systems for the very first time. Both types of engineers are finding themselves plunged into a disorienting new world—one where a particular flavor of production problem they may have encountered occasionally in their careers is now front and center.

Namely, that LLMs are black boxes that produce nondeterministic outputs and cannot be debugged or tested using traditional software engineering techniques. Hooking these black boxes up to production introduces reliability and predictability problems that can be terrifying. It’s important to understand this, and why.

100% debuggable? Maybe not

Software is traditionally assumed to be testable, debuggable, and reproducible, depending on the flexibility and maturity of your tooling and the complexity of your code. The original genius of computing was one of constraint; that by radically constraining language and mathematics to a defined set, we could create algorithms that would run over and over and always return the same result. In theory, all software is debuggable. However, there are lots of things that can chip away at that beauteous goal and make your software mathematically less than 100% debuggable, like:

Adding concurrency and parallelism.
Certain types of bugs.
Stacking multiple layers of abstractions (e.g., containers).
Randomness.
Using JavaScript (HA HA).

There is a much longer list of things that make software less than 100% debuggable in practice. Some of these things are related to cost/benefit tradeoffs, but most are about weak telemetry, instrumentation, and tooling.

If you have only instrumented your software with metrics, for example, you have no way of verifying that a spike in api_requests and an identical spike in 503 errors are for the same events (i.e., you are getting a lot of api_requests returning 503) or for a disjoint set of events (the spike in api_requests is causing general congestion causing a spike in 503s across ALL events). It is mathematically impossible; all you can do is guess. But if you have a log line that emits both the request_path and the error_code, and a tool that lets you break down and group by arbitrary dimensions, this would be extremely easy to answer. Or if you emit a lot of events or wide log lines but cannot trace them, or determine what order things executed in, there will be lots of other questions you won’t be able to answer.

There is another category of software errors that are logically possible to debug, but prohibitively expensive in practice. Any time you see a report from a big company that tracked down some obscure error in a kernel or an ethernet device, you’re looking at one of the rare entities with 1) enough traffic for these one in a billion errors to be meaningful, and 2) enough raw engineering power to dedicate to something most of us just have to live with.

But software is typically understandable because we have given it structure and constraints.

IF (); THEN (); ELSE () is testable and reproducible. Natural languages, on the other hand, are infinitely more expressive than programming languages, query languages, or even a UI that users interact with. The most common and repeated patterns may be fairly predictable, but the long tail your users will create is very long, and they expect meaningful results there, as well. For complex reasons that we won’t get into here, LLMs tend to have a lot of randomness in the long tail of possible results.

So with software, if you ask the exact same question, you will always get the exact same answer. With LLMs, you might not.

LLMs are their own beast

Unit testing involves asserting predictable outputs for defined inputs, but this obviously cannot be done with LLMs. Instead, ML teams typically build evaluation systems to evaluate the effectiveness of the model or prompt. However, to get an effective evaluation system bootstrapped in the first place, you need quality data based on real use of an ML model. With software, you typically start with tests and graduate to production. With ML, you have to start with production to generate your tests.

Even bootstrapping with early access programs or limited user testing can be problematic. It might be ok for launching a brand new feature, but it’s not good enough for a real production use case.

Early access programs and user testing often fail to capture the full range of user behavior and potential edge cases that may arise in real-world usage when there are a wide range of users. All these programs do is delay the inevitable failures you’ll encounter when an uncontrolled and unprompted group of end users does things you never expected them to do.

Instead of relying on an elaborate test harness to give you confidence in your software a priori, it’s a better idea to embrace a “ship to learn” mentality and release features earlier, then systematically learn from what is shipped and wrap that back into your evaluation system. And once you have a working evaluation set, you also need to figure out how quickly the result set is changing.

Phillip gives this list of things to be aware of when building with LLMs:

Failure will happen—it’s a question of when, not if.
Users will do things you can’t possibly predict.
You will ship a “bug fix” that breaks something else.
You can’t really write unit tests for this (nor practice TDD).
Latency is often unpredictable.
Early access programs won’t help you.

Sound at all familiar? 😂

Observability-driven development is necessary with LLMs

Over the past decade or so, teams have increasingly come to grips with the reality that the only way to write good software at scale is by looping in production via observability—not by test-driven development, but observability-driven development. This means shipping sooner, observing the results, and wrapping your observations back into the development process.

Modern applications are dramatically more complex than they were a decade ago. As systems get increasingly more complex, and nondeterministic outputs and emergent properties become the norm, the only way to understand them is by instrumenting the code and observing it in production. LLMs are simply on the far end of a spectrum that has become ever more unpredictable and unknowable.

Observability—both as a practice and a set of tools—tames that complexity and allows you to understand and improve your applications. We have written a lot about what differentiates observability from monitoring and logging, but the most important bits are 1) the ability to gather and store telemetry as very wide events, ordered in time as traces, and 2) the ability to break down and group by any arbitrary, high-cardinality dimension. This allows you to explore your data and group by frequency, input, or result.

In the past, we used to warn developers that their software usage patterns were likely to be unpredictable and change over time; now we inform you that if you use LLMs, your data set is going to be unpredictable, and it will absolutely change over time, and you must have a way of gathering, aggregating, and exploring that data without locking it into predefined data structures.

With good observability data, you can use that same data to feed back into your evaluation system and iterate on it in production. The first step is to use this data to evaluate the representativity of your production data set, which you can derive from the quantity and diversity of use cases.

You can make a surprising amount of improvements to an LLM based product without even touching any prompt engineering, simply by examining user interactions, scoring the quality of the response, and acting on the correctable errors (mainly data model mismatches and parsing/validation checks). You can fix or handle for these manually in the code, which will also give you a bunch of test cases that your corrections actually work! These tests will not verify that a particular input always yields a correct final output, but they will verify that a correctable LLM output can indeed be corrected.

You can go a long way in the realm of pure software, without reaching for prompt engineering. But ultimately, the only way to improve LLM-based software is by adjusting the prompt, scoring the quality of the responses (or relying on scores provided by end users), and readjusting accordingly. In other words, improving software that uses LLMs can only be done by observability and experimentation. Tweak the inputs, evaluate the outputs, and every now and again, consider your dataset for representivity drift.

Software engineers who are used to boolean/discrete math and TDD now need to concern themselves with data quality, representivity, and probabilistic systems. ML engineers need to spend more time learning how to develop products and concern themselves with user interactions and business use cases. Everyone needs to think more holistically about business goals and product use cases. There’s no such thing as a LLM that gives good answers that don’t serve the business reason it exists, after all.

So, what do you need to get started with LLMs?

Do you need to hire a bunch of ML experts in order to start shipping LLM software? Not necessarily. You cannot (there aren’t enough of them), you should not (this is something everyone needs to learn), and you don’t want to (these are changes that will make software engineers categorically more effective at their jobs). Obviously you will need ML expertise if your goal is to build something complex or ambitious, but entry-level LLM usage is well within the purview of most software engineers. It is definitely easier for software engineers to dabble in using LLMs than it is for ML engineers to dabble in writing production applications.

But learning to write and maintain software in the manner of LLMs is going to transform your engineers and your engineering organizations. And not a minute too soon.

The hardest part of software has always been running it, maintaining it, and understanding it—in other words, operating it. But this reality has been obscured for many years by the difficulty and complexity of writing software. We can’t help but notice the upfront cost of writing software, while the cost of operating it gets amortized over many years, people, and teams, which is why we have historically paid and valued software engineers who write code more than those who own and operate it. When people talk about the 10x engineer, everyone automatically assumes it means someone who churns out 10x as many lines of code, not someone who can operate 10x as much software.

But generative AI is about to turn all of these assumptions upside down. All of a sudden, writing software is as easy as sneezing. Anyone can use ChatGPT or other tools to generate reams of code in seconds. But understanding it, owning it, operating it, extending and maintaining it… all of these are more challenging than ever, because in the past, the way most of us learned to understand software was by writing it.

What can we possibly do to make sure our code makes sense and works, and is extendable and maintainable (and our code base is consistent and comprehensible) when we didn’t go through the process of writing it? Well, we are in the early days of figuring that out, too. 🙃

If you’re an engineer who cares about your craft: do code reviews. Follow coding standards and conventions. Write (or generate) tests for it. But ultimately, the only way you can know for sure whether or not it works is to ship it to production and watch what happens.

This has always been true, by the way. It’s just more true now.

If you’re an engineer adjusting to the brave new era: take some of that time you used to spend writing lines of code and reinvest it back into understanding, shipping under controlled circumstances, and observing. This means instrumenting your code with intention, and inspecting its output. This means shipping as soon as possible into the production environment. This means using feature flags to decouple deploys from releases and gradually roll new functionality out in a controlled fashion. Invest in these—and other—guardrails to make the process of shipping software more safe, fine-grained, and controlled.

Most of all, it means developing the habit of looking at your code in production, through the lens of your telemetry, and asking yourself: does this do what I expected it to do? Does anything else look weird?

Or maybe I should say “looking at your systems” instead of “looking at your code,” since people might confuse the latter with an admonition to “read the code.” The days when you could predict how your system would behave simply by reading lines of code are long, long gone. Software behaves in unpredictable, emergent ways, and the important part is observing your code as it’s running in production, while users are using it. Code in a buffer can tell you very little.

This future is a breath of fresh air

This, for once, is not a future I am afraid of. It’s a future I cannot wait to see manifest. For years now, I’ve been giving talks on modern best practices for software engineering—developers owning their code in production, testing in production, observability-driven development, continuous delivery in a tight feedback loop, separating deploys from releases using feature flags. No one really disputes that life is better, code is better, and customers are happier when teams adopt these practices. Yet, only 11% of teams can deploy their code in less than a day, according to the DORA report. Only a tiny fraction of teams are operating in the way everybody agrees we all should!

Why? The answers often boil down to organizational roadblocks, absurd security/compliance policies, or lack of buy-in/prioritizing. Saddest of all are the ones who say something like, “our team just isn’t that good” or “our people just aren’t that smart” or “that only works for world-class teams like the Googles of the world.” Completely false. Do you know what’s hard? Trying to build, run, and maintain software on a two month delivery cycle. Running with a tight feedback loop is so much easier.

Just do the thing

So how do teams get over this hump and prove to themselves that they can have nice things? In my experience, only one thing works: when someone joins the team who has seen it work before, has confidence in the team’s abilities, and is empowered to start making progress against those metrics (which they tend to try to do, because people who have tried writing code the modern way become extremely unwilling to go back to the bad old ways).

And why is this relevant?

I hypothesize that over the course of the next decade, developing with LLMs will stop being anything special, and will simply be one skill set of many, alongside mobile development, web development, etc. I bet most engineers will be writing code that interacts with an LLM. I bet it will become not quite as common as databases, but up there. And while they’re doing that, they will have to learn how to develop using short feedback loops, testing in production, observability-driven development, etc. And once they’ve tried it, they too may become extremely unwilling to go back.

In other words, LLMs might ultimately be the Trojan Horse that drags software engineering teams into the modern era of development best practices. (We can hope.)

In short, LLMs demand we modify our behavior and tooling in ways that will benefit even ordinary, deterministic software development. Ultimately, these changes are a gift to us all, and the sooner we embrace them the better off we will be.

Ritual Brilliance: How a pair of Shrek ears shaped Linden Lab culture by making failure funny — and safe

May 30, 2023February 19, 2025 mipsytipsyblameless retros, culture, devops, linden lab2 Comments

[Originally posted on the now-defunct “Roadmap: A Magazine About Work” website, on May 30th, 2023. A pretty, nicely-formatted PDF version of this article can be downloaded here. Thanks to Molly McArdle for editing!]

If you talk to former Lindens about the company’s culture—and be careful, because we will do so at length—you will eventually hear about the Shrek ears.

When you saw a new person wearing the Shrek ears, a matted green-felt headband with ogre ears on it, you introduced yourself, congratulated them warmly, and begged to hear the story of how they came to be wearing them. Then you welcomed the new person to the team (“You’re truly one of us now!”) and shared a story about a time when you did something even dumber than they did.

My first job after (dropping out of) college was at Linden Lab, the home of Second Life. I joined in 2004 and stayed for nearly six years, during which the company grew from around 25 nerds in a room to around 400 employees who worked out of offices in Brighton, San Francisco, Menlo Park, and Singapore, or their own homes—wherever they were.

When I think back on that time now, almost two decades later, I’m puzzled by the Shrek ears phenomenon. I wasn’t exactly powerful then, at barely 20 years old. Not only was this my first real job, I was also the first woman engineer, and I made tons of mistakes. Shouldn’t I have found the practice of being systematically singled out and spotlighted for my errors humiliating, shaming, and traumatic?

Yet I remember loving the tradition and participating with joy and vigor. Everyone else seemed to love it, too. The practice spread beyond engineering and out into the rest of the company, not by fiat but because individual people would voluntarily track down the Shrek ears and put them on their own head. (I’m not imagining this, right?)

Step 1, break production; Step 2, put on Shrek ears

Here’s how it worked: The first time an engineer broke production or caused major outage, they would seek out the ears and put them on for the day. The ears weren’t a mark of shame—they were a badge of honor! Everyone breaks production eventually, if they’re working on something meaningful.

If people saw you wearing the ears, they would eagerly ask, “What happened? How did you find the problem? What was the fix?” Then they would regale you with their own stories of breaking production or tell you about the first outage they caused. If the person was self-flagellating or being too hard on themselves, the Shrek ears gave their colleagues an excuse to kindly but firmly correct it on the spot. It was Linden’s way of saying, Hey, we don’t do that here: “You did the reasonable thing! How can we make the system better, so the next person doesn’t stumble into the same trap?”

In those days, Linden was running a massively distributed system across multiple data centers on three continents, and doing so without the help of DevOps, CI/CD, GitHub, virtualization, the cloud, or infrastructure as code. We had an incredibly high-performing operations team, with a thousand-to-one server-to-ops engineer ratio, which was a real achievement in the days when the role required doing everything from racking and stacking boxes in the colocation center to developing your own automation software.

Failures were just fucking inevitable. In a world like that, devoid of the entire toolchain ecosystem we’ve come to rely on, you just had to learn to roll with it, absorb the hits, and keep moving fast. You could only test so much in staging; it was more important to get it out into production and watch it—understand it—there. It was better to invest in swift recovery, graceful degradation, and decoupling services than to focus on trying to prevent anything from going wrong. (Still is, as a matter of fact.)

This might all sound a little overwrought to you—maybe even dangerous or irresponsible. Didn’t we care about quality? Were we bad engineers?

The Shrek ears were “blameless retros” before there were blameless retros

I assure you, we cared. The engineers I worked with at Linden were of at least as high a caliber as the engineers I later worked with at Facebook (and a whole lot more diverse). In this specific place and time, the Shrek ears were what we needed to alleviate paralysis and fear of production, and to encourage the sharing of knowledge—even if anecdotal—about our systems.

In retrospect, the Shrek ears were a brilliant piece of social jujitsu. There was an element of shock value or contrarianism in celebrating outages instead of getting all worked up about them. But the larger purpose of the ears was to reset people’s expectations (especially in the case of new hires) and reprogram them with a different set of values: Linden’s values.

In the years since those early days at Linden, the industry has developed an entire language and set of practices around dealing with the aftermath of incidents: blameless post mortems, retrospectives, and so on. But those tools weren’t available to us at the time. What we did have was the Shrek ears. A couple of times a month something would break, the ears would be claimed, and we would all go around reminding one another that failure is both inevitable and ridiculous, and that no one is going to get mad at you or fire you when it happens.

Failure is always a question of when, not if

It’s important to note that you never saw anyone get teased or shamed for wearing the ears or for breaking production. There was a script to follow, and we all knew it. We learned it from watching others put on the ears, or by donning them ourselves. On a day when the Shrek ears had appeared, people would gather around at lunch or at the bar after work and swap war stories, one-upping one another and laughing uproariously.

Every new engineer was told, “If you never break production, you probably aren’t
doing anything that really matters or taking enough risks.”

It’s also important to emphasize that the ears were opt-in, not opt-out. You didn’t have to do it. And if you did take them, you could expect a wave of sympathy, good humor, and support. It affirmed that you deserved to be here, that you were part of the team.

And though the Shrek ears started in engineering, people in sales, marketing, accounting, and other departments picked them up over the years. It was a process of voluntary adoption, not a top-down policy. Someone would announce in IRC that they were wearing the ears today, and why. The camaraderie and laughter that ensued were infectious—and made it easier and easier over time for people to be transparent about what wasn’t working.

Rituals exist to instill values and train culture

In Rituals for Work, Kursat Ozenc defines rituals as “actions that a person or group does repeatedly, following a similar pattern or script, in which they’ve imbued symbolism and meaning.” Ritual exists to instill a value, create a mindset, or train a reflex.

And this particular ritual was extremely effective at taking lots of scared engineers and teaching them, very quickly:

✨ It is safe to fail✨
✨ Failure is constant✨
✨ Failure is fucking hilarious✨

At Linden, failure was not something to be ashamed of or to hide from your teammates. We understood that it’s not something that happens only to careless or inexperienced people. In fact, the senior people have the funniest fuckups—because what they are trying to do is insanely hard. The Shrek ears taught us that you fail, you laugh, you drink whiskey, you move on.

Other companies had similar rituals around the same time—Etsy famously had the “three-armed sweater,” which they would pass around to whoever had last broken production. But I’ve never again worked at a place where mistakes were discussed as freely and easily across the entire company as they were at Linden Lab. And I think the Shrek ears had a lot to do with that.

Their point was never to single out the person who had made a mistake and humiliate them, but the exact opposite. By putting on the ears, you said not just “Hi, I made a mistake” but also “I’m going to be brave about it, so we can all collectively learn and improve.” It was a ritualized act of bravery rewarded by affirmation, empathy, and acceptance. At Linden, the Shrek ears weren’t just a terrific tool for promoting team coherence and creating a sense of belonging. They also provided structure to help individuals and teams recover from scary events, and even traumas.

In so many ways, Linden Lab was ahead of its time

Linden was an extremely strange workplace when I was there, and it inspired unusually strong devotion, which we self-deprecatingly referred to as “the Kool-Aid.” It can be difficult to convey just how radical and weird it was at the time because the world has changed so much since then, and so many of the company’s “weird” philosophies have since gone mainstream. (Though not all: using “Kool-Aid” as a casual phrase to denote “excessive enthusiasm” or “cult-like devotion” is now recognized by many as being in poor taste. After all, people actually died at the Jonestown massacre.)

In a lot of ways, Linden culture (and Second Life technology) was profoundly, recognizably modern, and similar to the best workplaces of today, 20+ years later.

Philip Rosedale, Linden’s founder and CEO, is an inventor and technologist who believed it was every inch as interesting and important to experiment with company culture as with the virtual worlds we built. Except we did it all from scratch: building the technology and the culture together. And this led us down some weird rabbit holes, such as a cron job that rsynced the entire file system down over thousands of live servers every night. And the Shrek ears.

There was a period when “Choose your own work” was a company core value, and there were effectively no managers. (Not every experiment worked!) We went all-in on a fully distributed company culture at a time when practically no one else had. We ran a massively distributed, high-concurrency virtual world at a time before microservices, sharded databases, config management virtualization, AWS, or SRE and DevOps.

I can understand why people now find this story horrifying

With the distance of time, I get why the Shrek ears might make you recoil. If you think “That sounds awful! What kind of monsters would do that to each other?”—you are far from alone. Any time I mention the story in public, a sizable minority of people are aghast and appalled. Representative quotes include:

“I hope you realize how many people you traumatized by doing this to them.”

“I wonder how many introverted people found this excruciating but were too
afraid to say so.”

“Office bullying is fucked up even with cute Shrek ears.”

Even:

“We heard about the Shrek ears from an engineer we interviewed. He was telling us how great they were, but we were all so horrified that we declined to hire him because of it.”

And they’re right. It sounds awful to us now. It really does! It sounds like we were singling people out for their failures, like a dunce cap. I wouldn’t be surprised to someday learn that, in fact, a small number of people did felt pressured into using the ears, or hated them and were too afraid to say something. But how do we account for the fact that this tradition was so deeply beloved by so many—and that we are still fondly reminiscing about it more than 15 years later? It had a purpose.

Linden Lab was an incredibly progressive company for its time: very anti-hierarchical, very much about empowering people to be creative and independent. It also was by far the most diverse company I’ve ever worked in (other than Honeycomb, which I cofounded and where I’m CTO), with lots of women and genderqueer and trans people and people of color. We were way out on the sensitive branch relative to tech at that time. It’s tough to square this knowledge of what Linden was like as a place with the reactions some people outside the organization have to the Shrek ears.

I think this is, above all, a sign of progress. So many questionable practices that were ordinary back then—like referring to everyone as “guys,” using terms like “master/slave” for replication, or throwing alcohol-sloshed parties—are now rightfully frowned upon. We have become more sensitive to people’s differences and more clued into the power dynamics of the workplace. It’s far from perfect, but it is a lot better.

As a ritual, the Shrek ears were powerful and did the job. They were also fun—proving once again that making something goofy is the best way to make it stick. But I can’t imagine plopping Shrek ears on a new hire who has just broken production in 2023. And honestly, I think that’s probably a good thing. It’s time for new rituals.

Deploys Are The ✨WRONG✨ Way To Change User Experience

March 8, 2023July 14, 2023 mipsytipsycontinuous deployment, crossposted, deploys, engineering, feature flags, friday deploys2 Comments

This piece was first published on the honeycomb.io blog on 2023-03-08.

….

I’m no stranger to ranting about deploys. But there’s one thing I haven’t sufficiently ranted about yet, which is this: Deploying software is a terrible, horrible, no good, very bad way to go about the process of changing user-facing code.

It sucks even if you have excellent, fast, fully automated deploys (which most of you do not). Relying on deploys to change user experience is a problem because it fundamentally confuses and scrambles up two very different actions: Deploys and releases.

Deploy

“Deploying” refers to the process of building, testing, and rolling out changes to your production software. Deploying should happen very often, ideally several times a day. Perhaps even triggered every time an engineer lands a change.

Everything we know about building and changing software safely points to the fact that speed is safety and smaller changes make for safer deploys. Every deploy should apply a small diff to your software, and deploys should generally be invisible to users (other than minor bug fixes).

Release

“Releasing” refers to the process of changing user experience in a meaningful way. This might mean anything from adding functionality to adding entire product lines. Most orgs have some concept of above or below the fold where this matters. For example, bug fixes and small requests can ship continuously, but larger changes call for a more involved process that could mean anything from release notes to coordinating a major press release.

A tale of cascading failures

Have you ever experienced anything like this?

Your company has been working on a major new piece of functionality for six months now. You have tested it extensively in staging and dev environments, even running load tests to simulate production. You have a marketing site ready to go live, and embargoed articles on TechCrunch and The New Stack that will be published at 10:00 a.m. PST. All you need to do now is time your deploy so the new product goes live at the same time.

It takes about three hours to do a full build, test, and deploy of your entire system. You’ve deployed as much as possible in advance, and you’ve already built and tested the artifacts, so all you have to do is a streamlined subset of the deploy process in the morning. You’ve gotten it down to just about an hour. You are paranoid, so you decide to start an hour early. So you kick off the deploy script at 8:00 a.m. PST… and sit there biting your nails, waiting for it to finish.

SHIT! 20 minutes through the deploy, there’s a random flaky SSH timeout that causes the whole thing to cancel and roll back. You realize that by running a non-standard subset of the deploy process, some of your error handling got bypassed. You frantically fix it and restart the whole process.

Your software finishes deploying at 9:30 a.m., 30 minutes before the embargoed articles go live. Visitors to your website might be confused in the meantime, but better to finish early than to finish late, right? 😬

Except… as 10:00 a.m. rolls around, and new users excitedly begin hitting your new service, you suddenly find that a path got mistyped, and many requests are returning 500. You hurriedly merge a fix and begin the whole 3-hour long build/test/deploy process from scratch. How embarrassing! 🙈

Deploys are a terrible way to change user experience

The build/release/deploy process generally has a lot of safeguards and checks baked in to make sure it completes correctly. But as a result…

It’s slow
It’s often flaky
It’s unreliable
It’s staggered
The process itself is untestable
It can be nearly impossible to time it right
It’s very all or nothing—the norm is to roll back completely upon any error
Fixing a single character mistake takes the same amount of time as doubling the feature set!

Changing user-visible behaviors and feature sets using the deploy process is a great way to get egg on your face. Because the process is built for distributing large code distributions or artifacts; user experience gets changed only as a side effect.

So how should you change user experience?

By using feature flags.

Feature flags: the solution to many of life’s software’s problems

You should deploy your code continuously throughout the day or week. But you should wrap any large, user-visible behavior changes behind a feature flag, so you can release that code by flipping a flag.

This enables you to develop safely without worrying about what your users see. It also means that turning a feature on and off no longer requires a diff, a code review, or a deploy. Changing user experience is no longer an engineering task at all.

Deploys are an engineering task, but releases can be done by product managers—even marketing teams. Instead of trying to calculate when to begin deploying by working backwards from 10:00 a.m., you simply flip the switch at 10:00 a.m.

Testing in production, progressive delivery

The benefits of decoupling deploys and releases extend far beyond timely launches. Feature flags are a critical tool for apostles of testing in production (spoiler alert: everybody tests in production, whether they admit it or not; good teams are aware of this and build tools to do it safely). You can use feature flags to do things like:

Enable the code for internal users only
Show it to a defined subset of alpha testers, or a randomized few
Slowly ramp up the percentage of users who see the new code gradually. This is super helpful when you aren’t sure how much load it will place on a backend component
Build a new feature, but only turning it on for a couple “early access” customers who are willing to deal with bugs
Make a perf improvement that should be bulletproof logically (and invisible to the end user), but safely. Roll it out flagged off, and do progressive delivery starting with users/customers/segments that are low risk if something’s fucked up
Doing something timezone-related in a batch process, and testing it out on New Zealand (small audience, timezone far away from your engineers in PST) first

Allowing beta testing, early adoption, etc. is a terrific way to prove out concepts, involve development partners, and have some customers feel special and extra engaged. And feature flags are a veritable Swiss Army Knife for practicing progressive delivery.

It becomes a downright superpower when combined with an observability tool (a real one that supports high cardinality, etc.), because you can:

Break down and group by flag name plus build id, user id, app id, etc.
Compare performance, behavior, or return code between identical requests with different flags enabled
For example, “requests to /export with flag “USE_CACHING” enabled are 3x slower than requests to /export without that flag, and 10% of them now return ‘402’”

It’s hard to emphasize enough just how powerful it is when you have the ability to break down by build ID and feature flag value and see exactly what the difference is between requests where a given flag is enabled vs. requests where it is not.

It’s very challenging to test in production safely without feature flags; the possibilities for doing so with them are endless. Feature flags are a scalpel, where deploys are a chainsaw. Both complement each other, and both have their place.

“But what about long-lived feature branches?”

Long-lived branches are the traditional way that teams develop features, and do so without deploying or releasing code to users. This is a familiar workflow to most developers.

But there is much to be said for continuously deploying code to production, even if you aren’t exposing new surface area to the world. There are lots of subterranean dependencies and interactions that you can test and validate all along.

There’s also something very psychologically different between working with branches. As one of our engineering directors, Jess Mink, says:

There’s something very different, stress and motivation-wise. It’s either, ‘my code is in a branch, or staging env. We’re releasing, I really hope it works, I’ll be up and watching the graphs and ready to respond,’ or ‘oh look! A development customer started using my code. This is so cool! Now we know what to fix, and oh look at the observability. I’ll fix that latency issue now and by the time we scale it up to everyone it’s a super quiet deploy.’

Which brings me to another related point. I know I just said that you should use feature flags for shipping user-facing stuff, but being able to fix things quickly makes you much more willing to ship smaller user-facing fixes. As our designer, Sarah Voegeli, said:

With frequent deploys, we feel a lot better about shipping user-facing changes via deploy (without necessarily needing a feature flag), because we know we can fix small issues and bugs easily in the next one. We’re much more willing to push something out with a deploy if we know we can fix it an hour or two later if there’s an issue.

Everything gets faster, which instills more confidence, which means everything gets faster. It’s an accelerating feedback loop at the heart of your sociotechnical system.

“Great idea, but this sounds like a huge project. Maybe next year.”

I think some people have the idea that this has to be a huge, heavyweight project that involves signing up for a SaaS, forking over a small fortune, and changing everything about the way they build software. While you can do that—and we’re big fans/users of LaunchDarkly in particular—you don’t have to, and you certainly don’t have to start there.

As Mike Terhar from our customer success team says, “When I build them in my own apps, it’s usually just something in a ‘configuration’ database table. You can make a config that can enable/disable, or set a scope by team, user, region, etc.”

You don’t have to get super fancy to decouple deploys from releases. You can start small. Eliminate some pain today.

In conclusion

Decoupling your deploys and releases frees your engineering teams to ship small changes continuously, instead of sitting on branches for a dangerous length of time. It empowers other teams to own their own roadmaps and move at their own pace. It is better, faster, safer, and more reliable than trying to use deploys to manage user-facing changes.

If you don’t have feature flags, you should embrace them. Do it today! 🌈

Your observability data is self-destructing at write time

✨Data is made powerful by context✨

The more context you collect, the more powerful it becomes

Data is made valuable by relationships

AI-SRE agents don’t seem to like three pillars data

You can’t put Humpty back together again

Production is a very noisy place

Agents need speed, flexibility, context, and precision to validate in prod

My (hypothetical) SRECon26 keynote

I came out of a hole, and the world had changed

When the facts change, I change my mind

The keynote I would give today

1 — This is happening

2 — Know thyself

3 — Don’t panic

We are all technologists now

This is the moment for pragmatists

Six long weeks of writer’s block

In which I finally listen to the advice I asked for

✨I wrote the wrong thing.✨

The internet was right (this ONE time)

Even good teams are struggling right now

Part 6: “Observability Governance” (v2)

A longer book, but a better book

Blame these people

Deploy freezes are a hack, not a virtue

I think we can have nice things

Don’t freeze deploys. Freeze merges.

Don’t freeze deploys unless your goal is to test deploy freezes

I have not been doing my job 💔

Write more, engage with mainstream tech

Some mitigations

In Praise of “Normal” Engineers

Measuring productivity is fraught and imperfect

Engineers don’t own software, teams own software

The best engineering orgs are the ones where normal engineers can do great work

Let’s talk about “normal” for a moment

Build sociotechnical systems with “normal people” in mind

How do you turn normal engineers into 10x engineering teams?

Shrink the interval between when you write the code and when the code goes live.

Make it easy and fast to roll back or recover from mistakes.

Make it easy to do the right thing and hard to do the wrong thing.

Invest in instrumentation and observability.

Devote engineering cycles to internal tooling and enablement.

Build an inclusive culture.

Diverse teams are resilient teams.

Assemble engineering teams from a range of levels.

The only meaningful measure of productivity is impact to the business

Great engineering orgs mint world-class engineers

Don’t hire the “best” people. Hire the right people.

Multiple “pillars” are an observability 1.0 phenomenon

Your observability 2.0 tool has one unified source of truth

Everything else is a consequence of this differentiator

Why now? What changed?

Three big reasons the rise of observability 2.0 is inevitable

It’s not just Honeycomb; observability 2.0 tools are going mainstream

Beware observability 2.0 marketing claims

The software industry is growing up

Software is an apprenticeship industry

What does it mean to be a “senior engineer”?

We need to stop cannibalizing our own future

Writing code is the easy part

It’s easy to generate code, and hard to generate good code

How working engineers really use generative AI

Generative AI is a little bit like a junior engineer

But generative AI is not a member of your team

We underestimate the cost of hiring seniors, and overestimate the cost of hiring juniors

You do not have to be a senior engineer to add value

The long term arguments for hiring junior engineers

The short term arguments for hiring junior engineers

A healthy, high-performing team has a range of levels

The bottleneck we face is hiring, not training

Nobody thinks we need fewer senior engineers

Should every company hire junior engineers?

Nobody is coming to fix our problems for us

Observability 1.0 and the cost multiplier effect

Metrics

Unstructured logs

Structured logs

Ballooning costs are baked into observability 1.0