In logs as in life, the relationships are the most important part. AI doesn’t fix this. It makes it worse.

After twenty years of devops, most software engineers still treat observability like a fire alarm — something you check when things are already on fire.

Not a feedback loop you use to validate every change after shipping. Not the essential, irreplaceable source of truth on product quality and user experience.

This is not primarily a culture problem, or even a tooling problem. It’s a data problem. The dominant model for telemetry collection stores each type of signal in a different “pillar”, which rips the fabric of relationships apart — irreparably.

Your observability data is self-destructing at write time

The three pillars model works fine for infrastructure1, but it is catastrophic for software engineering use cases, and will not serve for agentic validation.

But why? It’s a flywheel of compounding factors, not just one thing, but the biggest one by far is this:

✨Data is made powerful by context✨

The more context you collect, the more powerful it becomes

Your data does not become linearly more powerful as you widen the dataset, it becomes exponentially more powerful. Or if you really want to get technical, it becomes combinatorially more powerful as you add more context.

I made a little Netlify app here where you can enter how many attributes you store per log or trace, to see how powerful your dataset is.

4 fields? 6 pairwise combos, 15 possible combinations.
8 fields? 28 pairwise combos, 255 possible combinations.
50 fields? 1.2K pairwise combos, 1.1 quadrillion (2^250) possible combinations, as seen in the screenshot below.

When you add another attribute to your structured log events, it doesn’t just give you “one more thing to query”. It gives you new combinations with every other field that already exists.

The wider your data is, the more valuable the data becomes. Click on the image to go futz around with the sliders yourself.

Note that this math is exclusively concerned with attribute keys. Once you account for values, the precision of your tooling goes higher still, especially if you handle high cardinality data.

Data is made valuable by relationships

“Data is made valuable by context” is another way of saying that the relationships between attributes are the most important part of any data set.

This should be intuitively obvious to anyone who uses data. How valuable is the string “Mike Smith”, or “21 years old”? Stripped of context, they hold no value.

By spinning your telemetry out into siloes based on signal type, the three pillars model ends up destroying the most valuable part of your data: its relational seams.

AI-SRE agents don’t seem to like three pillars data

I posted something on LinkedIn yesterday, and got a pile of interesting comments. One came from Kyle Forster, founder of an AI-SRE startup called RunWhen, who linked to an article he wrote called “Do Humans Still Read Logs?”

Humpty Dumpty traced every span, Humpty Dumpty had a great plan.

In his article, he noted that <30% of their AI SRE tools were to “traditional observability data”, i.e. metrics, logs and traces. Instead, they used the instrumentation generated by other AI tools to wrap calls and queries. His takeaway:

Good AI reasoning turns out to require far less observability data than most of us thought when it has other options.

My takeaway is slightly different. After all, the agent still needed instrumentation and telemetry in order to evaluate what was happening. That’s still observability, right?

But as Kyle tells it, the agents went searching for a richer signal than the three pillars were giving them. They went back to the source to get the raw, pre-digested telemetry with all its connective tissue intact. That’s how important it was to them.

Huh.

You can’t put Humpty back together again

I’ve been hearing a lot of “AI solves this”, and “now that we have MCPs, AI can do joins seamlessly across the three pillars”, and “this is a solved problem”.

Mmm. Joins across data siloes can be better than nothing, yes. But they don’t restore the relational seams. They don’t get you back to the mathy good place where every additional attribute makes every other attribute exponentially more valuable. At agentic speed, that reconstruction becomes a bottleneck and a failure surface.

Humpty Dumpty stored all the state, Humpty Dumpty forgot to replicate.

Our entire industry is trying to collectively work out the future of agentic development right now. The hardest and most interesting problems (I think) are around validation. How do we validate a change rate that is 10x, 100x, 1000x greater than before?

I don’t have all the answers, but I do know this: agents are going to need production observability with speed, flexibility, TONS of context, and some kind of ontological grounding via semantic conventions.

In short: agents are going to need precision tools. And context (and cardinality) are what feed precision.

Production is a very noisy place

Production is a noisy, rowdy place of chaos, particularly at scale. If you are trying to do anomaly detection with no a priori knowledge of what to look for, the anomaly has to be fairly large to be detected. (Or else you’re detecting hundreds of “anomalies” all the time.)

But if you do have some knowledge of intent, along with precision tooling, these anomalies can be tracked and validated even when they are exquisitely minute. Like even just a trickle of requests2 out of tens of millions per second.

Let’s say you work for a global credit card provider. You’re rolling out a code change to partner payments, which are “only” tens of thousands of requests per second — a fraction of your total request volume of tens of millions of req/sec, but an important one.

This is a scary change, no matter how many tests you ran in staging. To test this safely in production, you decide to start by rolling the new build out to a small group of employee test users, and oh, what the hell — you make another feature flag that lets any user opt in, and flip it on for your own account.

You wait a few days. You use your card a few times. It works (thank god).

On Monday morning you pull up your observability data and select all requests containing the new build_id or commit hash, as well as all of the feature flags involved. You break down by endpoint, then start looking at latency, errors, and distribution of request codes for these requests, comparing them to the baseline.

Hm — something doesn’t seem quite right. Your test requests aren’t timing out, but they are taking longer to complete than the baseline set. Not for all requests, but for some.

Further exploration lets you isolate the affected requests to a set with a particular query hash. Oops.. how’d that n+1 query slip in undetected??

You quickly submit a fix, ship a new build_id, and roll your change out to a larger group: this time, it’s going out to 1% of all users in a particular region.

The anomalous requests may have been only a few dozen per day, spread across many hours, in a system that served literally billions of requests in that time.

Humpty Dumpty: assembled, redeployed, A patchwork of features half-built, half-destroyed. “It’s not what we planned,” said the architect, grim. “But the monster is live — and the monster is him.”

Precision tooling makes them findable. Imprecise tooling makes them unfindable.

How do you expect your agents to validate each change, if the consequences of each change cannot be found?[3]

Well, one might ask, how have we managed so far? The answer is: by using human intuition to bridge the gaps. This will not work for agents. Our wisdom must be encoded into the system, or it does not exist.

Agents need speed, flexibility, context, and precision to validate in prod

In the past, excruciatingly precise staged rollouts like these have been mostly the province of your Googles and Facebooks. Progressive deployments have historically required a lot of tooling and engineering resources.

Agentic workflows are going to make these automated validation techniques much easier and more widely used; at the exact same time, agents developing to spec are going to require a dramatically higher degree of precision and automated validation in production.

It is not just the width of your data that matters when it comes to getting great results from AI. There’s a lot more involved in optimizing data for reasoning, attribution, or anomaly detection. But capturing and preserving relationships is at the heart of all of it.

In this situation, as in so many others, AI is both the sickness and the cure[4]. Better get used to it.

1 — Infrastructure teams use the three pillars for one extremely good reason: they have to operate a lot of code they did not write and can not change. They have to slurp up whatever metrics or logs the components emit and store them somewhere.

2 — Yes, there are some complications here that I am glossing past, ones that start with ‘s’ and rhyme with “ampling”. However, the rich data + sampling approach to the cost-usability balance is generally satisfied by dropping the least valuable data. The three pillars approach to the cost-usability problem is generally satisfied by dropping the MOST valuable data: cardinality and context.

3 — The needle-in-a-haystack is one visceral illustration of the value of rich context and precision tooling, but there are many others. Another example: wouldn’t it be nice if your agentic task force could check up on any diffs that involve cache key or schema changes, say, once a day for the next 6-12 months? These changes famously take a long time to manifest, by which time everyone has forgotten that they happened.

4 — One sentence I have gotten a ton of mileage out of lately: “AI, much like alcohol, is both the cause of and solution to all of life’s problems.”

From Cloudwashing to O11ywashing

November 24, 2025 mipsytipsymonitoring, rageguy, three pillars, unified storage, vendors6 Comments

I was just watching a panel on observability, with a handful of industry executives and experts who shall remain nameless and hopefully duly obscured—their identities are not the point, the point is that this is a mainstream view among engineering executives and my head is exploding.

Scene: the moderator asked a fairly banal moderator-esque question about how happy and/or disappointed each exec has been with their observability investments.

One executive said that as far as traditional observability tools are concerned (“are there faults in our systems?”), that stuff “generally works well.”

However, what they really care about is observing the quality of their product from the customer’s perspective. EACH customer’s perspective.

Nines don't matter if users aren't happy — Nines don’t matter if users aren’t happy

“Did you know,” he mused, “that there are LOTS of things that can interrupt service or damage customer experience that won’t impact your nines of availability?”

(I begin screaming helplessly into my monitor.)

“You could have a dependency hiccup,” he continued, oblivious to my distress. “There could be an issue with rendering latency in your mobile app. All kinds of things.”

(I look down and realize that I am literally wearing this shirt.)

He finishes with,“And that is why we have invested in our own custom solution to measure key workflows through startup payment and success.”

(I have exploded. Pieces of my head now litter this office while my headless corpse types on and on.)

It’s twenty fucking twenty five. How have we come to this point?

Observability is now a billion dollar market for a meaningless term

My friends, I have failed you.

It is hard not to register this as a colossal fucking failure on a personal level when a group of modern, high performing tech execs and experts can all sit around a table nodding their heads at the idea that “traditional observability” is about whether your systems are UP👆 or DOWN👇, and that the idea of observing the quality of service from each customer’s perspective remains unsolved! unexplored! a problem any modern company needs to write custom tooling from scratch to solve.

This guy is literally describing the original definition of observability, and he doesn’t even know it. He doesn’t know it so hard that he went and built his own thing.

You guys know this, right? When he says “traditional observability tools”, he means monitoring tools. He means the whole three fucking pillars model: metrics, logging, and tracing, all separate things. As he notes, these traditional tools are entirely capable of delivering on basic operational outcomes (are we up, down, happy, sad?). They can DO this. They are VERY GOOD tools if that is your goal.

But they are not capable of solving the problem he wants to solve, because that would require combining app, business, and system telemetry in a unified way. Data that is traceable, but not just tracing. With the ability to slice and dice by any customer ID, site location, device ID, blah blah. Whatever shall we call THAT technological innovation, when someone invents it? Schmobservability, perhaps?

So anyway, “traditional observability” is now part of the mainstream vernacular. Fuck. What are we going to do about it? What CAN be done about it?

From cloudwashing to o11ywashing

I learned a new term yesterday: cloudwashing. I learned this from Rick Clark, who tells a hilarious story about the time IBM got so wound up in the enthusiasm for cloud computing that they reclassified their Z series mainframe as “cloud” back in 2008.

(Even more hilarious: asking Google about the precipitating event, and following the LLM down a decade-long wormhole of incredibly defensive posturing from the IBM marketing department and their paid foot soldiers in tech media about how this always gets held up as an example of peak cloudwashing but it was NOT AT ALL cloudwashing due to being an extension of the Z/Series Mainframe rather than a REPLACEMENT of the Z/Series Mainframe, and did you know that Mainframes are bigger business and more relevant today than ever before?)

(Sorry, but I lost a whole afternoon to this nonsense, I had to bring you along for the ride.)

Rick says the same thing is happening right now with observability. And of course it is. It’s too big of a problem, with too big a budget: an irresistible target. It’s not just the legacy behemoths anymore. Any vendor that does anything remotely connected to telemetry is busy painting on a fresh coat of o11ywashing. From a marketing perspective, It would be irresponsible not to.

How to push back on *-washing

Anyway, here are the key takeaways from my weekend research into cloudwashing.

This o11ywashing problem isn’t going away. It is only going to get bigger, because the problem keeps getting bigger, because the traditional vendors aren’t solving it, because they can’t solve it.
The Gartners of the world will help users sort this out someday, maybe, but only after we win. We can’t expect them to alienate multibillion dollar companies in the pursuit of technical truth, justice and the American Way. If we ever want to see “Industry Experts” pitching in to help users spot o11ywashing, as they eventually did with cloudwashing (see exhibit A), we first need to win in the market.
Exhibit A: “How to Spot Cloudwashing”
And (this is the only one that really matters.) we have to do a better job of telling this story to engineering executives, not just engineers. Results and outcomes, not data structures and algorithms.

(I don’t want to make this sound like an epiphany we JUST had…we’ve been working hard on this for a couple years now, and it’s starting to pay off. But it was a powerful confirmation.)

Talking to execs is different than talking to engineers

When Christine and I started Honeycomb, nearly ten years ago, we were innocent, doe-eyed engineers who truly believed on some level that if we just explained the technical details of cardinality and dimensionality clearly and patiently enough to the world, enough times, the consequences to the business would become obvious to everyone involved.

It has now been ten years since I was a hands-on engineer every day (say it again, like pressing on a bruise makes it hurt less), and I would say I’ve been a decently functioning exec for about the last three or four of those years.

What I’ve learned in that time has actually given me a lot of empathy for the different stresses and pressures that execs are under.

I wouldn’t say it’s less or more than the stresses of being an SRE on call for some of the world’s biggest databases, but it is a deeply and utterly different kind of stress, the kind of stress less expiable via fine whiskey and poor life choices. (You just wake up in the morning with a hangover, and the existential awareness of your responsibilities looming larger than ever.)

This is a systems problem, not an operational one

There is a lot of noise in the field, and executives are trying to make good decisions that satisfy all parties and constraints amidst the unprecedented stress-panic-opportunity-terror of AI changing everything. That takes storytelling skills and sales discipline on our part, in addition to technical excellence.

Companies are dumping more and more and more money into their so-called observability tools, and not getting any closer to a solution. Nor will they, so long as they keep thinking about observability in terms of operational outcomes (and buying operational tools). Observability is a systems problem. It’s the most powerful lever in your arsenal when it comes to disrupting software doom spirals and turning them into positive feedback loops. Or it should be.

As Fred Hebert might say, it’s great you’re so good at firefighting, but maybe it’s time to go read the city fire codes.

Execs don’t know what they don’t know, because we haven’t been speaking to them. But we’re starting to.

What will be the next term that gets invented and coopted in the search to solve this problem?

Where to start, with a project so big? Google’s AI says that “experts suggest looking for specific features to identify true ~~cloud~~ observability solutions versus ~~cloudwashed~~ o11ywashed ones.”

I guess this is a good place to start as any: If your “observability” tooling doesn’t help you understand the quality of your product from the customer’s perspective, EACH customer’s perspective, it isn’t fucking observability.

It’s just monitoring dressed up in marketing dollars.

Call it o11ywashing.

How many pillars of observability can you fit on the head of a pin?

October 30, 2025October 30, 2025 mipsytipsybunnies, multiple pillars, o11y 2.0, observability 2.0, three pillarsLeave a comment

My day started off with an innocent question, from an innocent soul.

“Hey Charity, is profiling a pillar?”

I hadn’t even had my coffee yet.

“Someone was just telling me that profiling is the fourth pillar of observability now. I said I think profiling is a great tool, but I don’t know if it quite rises to the level of pillar. What do you think?”

What….do.. I think.

What I think is, there are no pillars. I think the pillars are a fucking lie, dude. I think the language of pillars does a lot of work to keep good engineers trapped inside a mental model from the 1980s, paying outrageous sums of money for tooling that can’t keep up with the chaos and complexity of modern systems.

Here is a list of things I have recently heard people refer to as the “fourth pillar of observability”:

Profiling
Tokens (as in LLMs)
Errors, exceptions
Analytics
Cost

Is it a pillar, is it not a pillar? Are they all pillars? How many pillars are there?? How many pillars CAN there be? Gaahhh!

This is not a new argument. Take this ranty little tweet thread of mine from way back in 2018, for starters.

✨THERE ARE NO✨
✨THREE PILLARS OF✨
✨OBSERVABILITY.✨

and the fact that everybody keeps blindly repeating this mantra (and cargo culting these primitives) is probably why our observability tooling is 10 years behind the rest of our software tool chain. https://t.co/94yDBPuDRv
— Charity Majors (@mipsytipsy) September 25, 2018

Or perhaps you have heard of TEMPLE: Traces, Events, Metrics, Profiles, Logs, and Exceptions?

Or the “braid” of observability data, or “They Aren’t Pillars, They’re Lenses”, or the Lightstep version: “Three Pillars, Zero Answers” (that title is a personal favorite).

Alright, alright. Yes, this has been going on for a long time. I’m older now and I’m tireder now, so here’s how I’ll sum it up.

Pillar is a marketing term.
Signal is a technical term.

So “is profiling a pillar?” is a valid question, but it’s not a technical question. It’s a question about the marketing claims being made by a given company. Some companies are building a profiling product right now, so yes, to them, it is vitally important to establish profiling as a “pillar” of observability, because you can charge a hell of a lot more for a “pillar” than you can charge for a mere “feature”. And more power to them. But it doesn’t mean anything from a technical point of view.

On the other hand, “signal” is absolutely a technical term. The OpenTelemetry Signals documentation, which I consider canon, says that OTel currently supports Traces, Metrics, Logs, and Baggage as signal types, with Events and Profiles at the proposal/development stage. So yes, profiling is a type of signal.

The OTel docs define a telemetry signal as “a type of data transmitted remotely for monitoring and analysis”, and they define a pillar as … oh, they don’t even mention pillars? like at all??

I guess there’s your answer.

And this is probably where I should end my piece. (Why am I still typing…. 🤔)

Pillars vs signals

First of all, I want to stress that it does not bother me when engineers go around talking about pillars. Nobody needs to look at me guiltily and apologize for using the term ‘pillar’ at the bar after a conference because they think I’m mad at them. I am not the language police, it is not my job to go around enforcing correct use of technical terms. (I used to, I know, and I’m sorry! 😆)

When engineers talk about pillars of observability, they’re just talking about signals and signal types, and “pillar” is a perfectly acceptable colloquialism for “signal”.

When a vendor starts talking about pillars, though — as in the example above! — it means they are gearing up to sell you something: another type of signal, siloed off from all the other signals you send them. Your cost multiplier is about to increment again, and then they’re going to start talking about how Important it is that you buy a product for each and every one of the Pillars they happen to have.

As a refresher: there are two basic architecture models used by observability companies, the multiple pillars model and the unified storage model (aka o11y 2.0). The multiple pillars model is to store every type of signal in a different siloed storage location — metrics, logs, traces, profiling, exceptions, etc, everybody gets a database! The unified storage model is to store all signals together in ONE database, preserving context and relationships, so you can treat data like data: slice and dice, zoom in, zoom out, etc.

Most of the industry giants were built using the pillars model, but Honeycomb (and every other observability company founded post-2019) has built using the unified storage model, building wide, structured log events on a columnar storage engine with high cardinality support, and so on.

Bunny-hopping from pillar to pillar

When you use each signal type as a standalone pillar, this leads to an experience I think of as “bunny products” 🐇 where the user is always hopping from pillar to pillar. You see something on your metrics dashboard that looks scary? hop-hop to your logs and try to find it there, using grep and search and matching by timestamps. If you can find the right logs, then you need to trace it, so you hop-hop-hop to your traces and repeat your search there. With profiling as a pillar, maybe you can hop over to that dataset too.🐇🐰

The amount of data duplication involved in this model is mind boggling. You are literally storing the same information in your metrics TSDB as you are in your logs and your traces, just formatted differently. (I never miss an opportunity to link to Jeremy Morrell’s masterful doc on instrumenting your code for wide events, which also happens to illustrate this nicely.) This is insanely expensive. Every request that enters your system gets stored how many times, in how many signals? Count it up; that’s your cost multiplier.

Worse, much of the data that connects each “pillar” exists only in the heads of the most senior engineers, so they can guess or intuit their way around the system, but anyone who relies on actual data is screwed. Some vendors have added an ability to construct little rickety bridges post hoc between pillars, e.g. “this metric is derived from this value in this log line or trace”, but now you’re paying for each of those little bridges in addition to each place you store the data (and it goes without saying, you can only do this for things you can predict or hook up in the first place).

The multiple pillars model (formerly known as observability 1.0) relies on you believing that each signal type must be stored separately and treated differently. That’s what the pillars language is there to reinforce. Is it a Pillar or not?? It doesn’t matter because pillars don’t exist. Just know that if your vendor is calling it a Pillar, you are definitely going to have to Pay for it. 😉

Zooming in and out

But all this data is just.. data. There is no good reason to silo signals off from each other, and lots of good reasons not to. You can derive metrics from rich, structured data blobs, or append your metrics to wide, structured log events. You can add span IDs and visualize them as a trace. The unified storage model (“o11y 2.0”) says you should store your data once, and do all the signal processing in the collection or analysis stages. Like civilized folks.

Anya Bunny Quote - Etsy — All along, Anya was right

From the perspective of the developer, not much changes. It just gets easier (a LOT easier), because nobody is harping on you about whether this nit of data should be a metric, a log, a trace, or all of the above, or if it’s low cardinality or high cardinality, or whether the cardinality of the data COULD someday blow up, or whether it’s a counter, a gauge, a heatmap, or some other type of metric, or when the counter is going to get reset, or whether your heatmap buckets are defined at useful intervals, or…or…

Instead, it’s just a blob of json. Structured data.. If you think it might be interesting to you someday, you dump it in, and if not, you don’t. That’s all. Cognitive load drops way down..

On the backend side, we store it once, retaining all the signal type information and connective tissue.

It’s the user interface where things change most dramatically. No more bunny hopping around from pillar to pillar, guessing and copy-pasting IDs and crossing your fingers. Instead, it works more like the zoom function on PDFs or Google maps.

You start with SLOs, maybe, or a familiar-looking metrics dashboard. But instead of hopping, you just.. zoom in. The SLOs and metrics are derived from the data you need to debug with, so you’re just like.. “Ah what’s my SLO violation about? Oh, it’s because of these events.” Want to trace one of them? Just click on it. No hopping, no guessing, no pasting IDs around, no lining up time stamps.

Zoom in, zoom out, it’s all connected. Same fucking data.

“But OpenTelemetry FORCES you to use three pillars”

There’s a misconception out there that OpenTelemetry is very pro-three pillars, and very anti o11y 2.0. This is a) not true and b) actually the opposite. Austin Parker has written a voluminous amount of material explaining that actually, under the hood, OTel treats everything like one big wide structured event log.

As Austin puts it, “OpenTelemetry, fundamentally, unifies telemetry signals through shared, distributed context.” However:

“The project doesn’t require you to do this. Each signal is usable more or less independently of the other. If you want to use OpenTelemetry data to feed a traditional ‘three pillars’ system where your data is stored in different places, with different query semantics, you can. Heck, quite a few very successful observability tools let you do that today!”

“This isn’t just ‘three pillars but with some standards on top,’ it’s a radical departure from the traditional ‘log everything and let god sort it out’ approach that’s driven observability practices over the past couple of decades.”

You can use OTel to reinforce a three pillars mindset, but you don’t have to. Most vendors have chosen to implement three pillarsy crap on top of it, which you can’t really hold OTel responsible for. One[1] might even argue that OTel is doing as much as it can to influence you in the opposite direction, while still meeting Pillaristas where they’re at.

A postscript on profiling

What will profiling mean in a unified storage world? It just means you’ll be able to zoom in to even finer and lower-level resolution, down to syscalls and kernel operations instead of function calls. Like when Google Maps got good enough that you could read license plates instead of just rooftops.

Admittedly, we don’t have profiling yet at Honeycomb. When we did some research into the profiling space, what we learned was that most of the people who think they’re in desperate need of a profiling tool are actually in need of a good tracing tool. Either they didn’t have distributed tracing or their tracing tools just weren’t cutting it, for reasons that are not germane in a Honeycomb tracing world.

We’ll get to profiling, hopefully in the near-ish future, but for the most part, if you don’t need syscall level data, you probably don’t need profiling data either. Just good traces.

Also… I did not make this site or have any say whatsoever in the building of it, but I did sign the manifesto[2] and every day that I remember it exists is a day I delight in the joy and fullness of being alive: kill3pill.com 📈

Hop hop, little friends,
~charity

[1] Austin argues this. I’m talking about Austin, if not clear enough.
[2] Thank you, John Gallagher!!

The Cost Crisis in Observability Tooling

January 24, 2024December 21, 2024 mipsytipsycost control, metrics, observability 2.0, three pillarsLeave a comment

Originally posted on the Honeycomb blog on January 24th, 2024

The cost of services is on everybody’s mind right now, with interest rates rising, economic growth slowing, and organizational budgets increasingly feeling the pinch. But I hear a special edge in people’s voices when it comes to their observability bill, and I don’t think it’s just about the cost of goods sold. I think it’s because people are beginning to correctly intuit that the value they get out of their tooling has become radically decoupled from the price they are paying.

In the happiest cases, the price you pay for your tools is “merely” rising at a rate several times faster than the value you get out of them. But that’s actually the best case scenario. For an alarming number of people, the value they get actually decreases as their bill goes up.

Observability 1.0 and the cost multiplier effect

Are you familiar with this chestnut?

“Observability has three pillars: metrics, logs, and traces.”

This isn’t exactly true, but it’s definitely true of a particular generation of tools—one might even say definitionally true of a particular generation of tools. Let’s call it “observability 1.0.”

From an evolutionary perspective, you can see how we got here. Everybody has logs… so we spin up a service for log aggregation. But logs are expensive and everybody wants dashboards… so we buy a metrics tool. Software engineers want to instrument their applications… so we buy an APM tool. We start unbundling the monolith into microservices, and pretty soon we can’t understand anything without traces… so we buy a tracing tool. The front-end engineers point out that they need sessions and browser data… so we buy a RUM tool. On and on it goes.

Logs, metrics, traces, APM, RUM. You’re now paying to store telemetry five different ways, in five different places, for every single request. And a 5x multiplier is on the modest side of the spectrum, given how many companies pay for multiple overlapping tools in the same category. You may also also be collecting:

Profiling data
Product analytics
Business intelligence data
Database monitoring/query profiling tools
Mobile app telemetry
Behavioral analytics
Crash reporting
Language-specific profiling data
Stack traces
CloudWatch or hosting provider metrics
…and so on.

So, how many times are you paying to store data about your user requests? What’s your multiplier? (If you have one consolidated vendor bill, this may require looking at your itemized bill.)

There are many types of tools, each gathering slightly different data for a slightly different use case, but underneath the hood there are really only three basic data types: the metric, unstructured logs, and structured logs. Each of these have their own distinctive trade-offs when it comes to how much they cost and how much value you can get out of them.

Metrics

Metrics are the great-granddaddy of telemetry formats; tiny, fast, and cheap. A “metric” consists of a single number, often with tags appended. All of the context of the request gets discarded at write time; each individual metric is emitted separately. This means you can never correlate one metric with another from the same request, or select all the metrics for a given request ID, user, or app ID, or ask arbitrary new questions about your metrics data.

Metrics-based tools include vendors like Datadog and open-source projects like Prometheus. RUM tools are built on top of metrics to understand browser user sessions; APM tools are built on top of metrics to understand application performance.

When you set up a metrics tool, it generally comes prepopulated with a bunch of basic metrics, but the useful ones are typically the custom metrics you emit from your application.

Your metrics bill is usually dominated by the cost of these custom metrics. At minimum, your bill goes up linearly with the number of custom metrics you create. Which is unfortunate, because to restrain your bill from unbounded growth, you have to regularly audit your metrics, do your best to guess which ones are going to be useful in the future, and prune any you think you can afford to go without. Even in the hands of experts, these tools require significant oversight.

Linear cost growth is the goal, but it’s rarely achieved. The cost of each metric varies wildly depending on how you construct it, what the values are, how often it gets hit, etc. I’ve seen a single custom metric cost $30k per month. You probably have dozens of custom metrics per service, and it’s almost impossible to tell how much each of them costs you. Metrics bills tend to be incredibly opaque (possibly by design).

Nobody can understand their software or their systems with a metrics tool alone, because the metric is extremely limited in what it can do. No context, no cardinality, no strings… only basic static dashboards. For richer data, we must turn to logs.

Unstructured logs

You can understand much more about your code with logs than you can with metrics. Logs are typically emitted multiple times throughout the execution of the request, with one or a small number of nouns per log line, plus the request ID. Unstructured logs are still the default, although this is slowly changing.

The cost of unstructured logs is driven by a few things:

Write amplification. If you want to capture lots of rich context about the request, you need to emit a lot of log lines. If you are printing out just 10 log lines per request, per service, and you have half a dozen services, that’s 60 log events for every request.
Noisiness. It’s extremely easy to accidentally blow up your log footprint yet add no value—e.g., by putting a print statement inside a loop instead of outside the loop. Here, the usefulness of the data goes down as the bill shoots up.
Constraints on physical resources. Due to the write amplification of log lines per request, it’s often physically impossible to log everything you want to log for all requests or all users—it would saturate your NIC or disk. Therefore, people tend to use blunt instruments like these to blindly slash the log volume:
- Log levels
- Consistent hashes
- Dumb sample rates

When you emit multiple log lines per request, you end up duplicating a lot of raw data; sometimes over half the bits are consumed by request ID, process ID, timestamp. This can be quite meaningful from a cost perspective.

All of these factors can be annoying. But the worst thing about unstructured logs is that the only thing you can do to query them is full text search. The more data you have, the slower it becomes to search that data, and there’s not much you can do about it.

Searching your logs over any meaningful length of time can take minutes or even hours, which means experimenting and looking around for unknown-unknowns is prohibitively time-consuming. You have to know what to look for in order to find it. Once again, as your logging bill goes up, the value goes down.

Structured logs

Structured logs are gaining adoption across the industry, especially as OpenTelemetry picks up steam. The nice thing about structured logs is that you can actually do things with the data other than slow, dumb string searches. If you’ve structured your data properly, you can perform calculations! Compute percentiles! Generate heatmaps!

Tools built on structured logs are so clearly the future. But just taking your existing logs and adding structure isn’t quite good enough. If all you do is stuff your existing log lines into key/value pairs, the problems of amplification, noisiness, and physical constraints remain unchanged—you can just search more efficiently and do some math with your data.

There are a number of things you can and should do to your structured logs in order to use them more effectively and efficiently. In order of achievability:

Instrument your code using the principles of canonical logs, which collects all the vital characteristics of a request into one wide, dense event. It is difficult to overstate the value of doing this, for reasons of usefulness and usability as well as cost control.
Add trace IDs and span IDs so you can trace your code using the same events instead of having to use an entirely separate tool.
Feed your data into a columnar storage engine so you don’t have to predefine a schema or indexes to decide which dimensions future you can search or compute based on.
Use a storage engine that supports high cardinality, with an explorable interface.

If you go far enough down this path of enriching your structured events, instrumenting your code with the right data, and displaying it in real time, you will reach an entirely different set of capabilities, with a cost model so distinct it can only be described as “observability 2.0.” More on that in a second.

Ballooning costs are baked into observability 1.0

To recap: high costs are baked into the observability 1.0 model. Every pillar has a price.

You have to collect and store your data—and pay to store it—again and again and again, for every single use case. Depending on how many tools you use, your observability bill may be growing at a rate 3x faster than your traffic is growing, or 5x, or 10x, or even more.

It gets worse. As your costs go up, the value you get out of your tools goes down.

Your logs get slower and slower to search.
You have to know what you’re searching for in order to find it.
You have to use blunt force sampling technique to keep log volume from blowing up.
Any time you want to be able to ask a new question, you first have to commit new code and deploy it.
You have to guess which custom metrics you’ll need and which fields to index in advance.
As volume goes up, your ability to find a needle in the haystack—any unknown-unknowns—goes down commensurately.

And nothing connects any of these tools. You cannot correlate a spike in your metrics dashboard with the same requests in your logs, nor can you trace one of the errors. It’s impossible. If your APM and metrics tools report different error rates, you have no way of resolving this confusion. The only thing connecting any of these tools is the intuition and straight-up guesses made by your most senior engineers. Which means that the cognitive costs are immense, and your bus factor risks are very real. The most important connective data in your system—connecting metrics with logs, and logs with traces—exists only in the heads of a few people.

At the same time, the engineering overhead required to manage all these tools (and their bills) rises inexorably. With metrics, an engineer needs to spend time auditing your metrics, tracking people down to fix poorly constructed metrics, and reaping those that are too expensive or don’t get used. With logs, an engineer needs to spend time monitoring the log volume, watching for spammy or duplicate log lines, pruning or consolidating them, choosing and maintaining indexes.

But all this the time spent wrangling observability 1.0 data types isn’t even the costliest part. The most expensive part is the unseen costs inflicted on your engineering organization as development slows down and tech debt piles up, due to low visibility and thus low confidence.

Is there an alternative? Yes.

The cost model of observability 2.0 is very different

Observability 2.0 has no three pillars; it has a single source of truth. Observability 2.0 tools are built on top of arbitrarily-wide structured log events, also known as spans. From these wide, context-rich structured log events you can derive the other data types (metrics, logs, or traces).

Since there is only one data source, you can correlate and cross-correlate to your heart’s content. You can switch fluidly back and forth between slicing and dicing, breaking down or grouping by events, and viewing them as a trace waterfall. You don’t have to worry about cardinality or key space limitations.

You also effectively get infinite custom metrics, since you can append as many as you want to the same events. Not only does your cost not go up linearly as you add more custom metrics, your telemetry just gets richer and more valuable the more key-value pairs you add! Nor are you limited to numbers; you can add any and all types of data, including valuable high-cardinality fields like “App Id” or “Full Name.”

Observability 2.0 has its own amplification factor to consider. As you instrument your code with more spans per request, the number of events you have to send (and pay for) goes up. However, you have some very powerful tools for dealing with this: you can perform dynamic head-based sampling or even tail-based sampling, where you decide whether or not to keep the event after it’s finished, allowing you to capture 100% of slow requests and other outliers.

Engineering time is your most precious resource

But the biggest difference between observability 1.0 and 2.0 won’t show up on any invoice. The difference shows up in your engineering team’s ability to move quickly, with confidence.

Modern software engineering is all about hooking up fast feedback loops. And observability 2.0 tooling is what unlocks the kind of fine-grained, exploratory experience you need in order to accelerate those feedback loops.

Where observability 1.0 is about MTTR, MTTD, reliability, and operating software, observability 2.0 is what underpins the entire software development lifecycle, setting the bar for how swiftly you can build and ship software, find problems, and iterate on them. Observability 2.0 is about being in conversation with your code, understanding each user’s experience, and building the right things.

Observability 2.0 isn’t exactly cheap either, although it is often less expensive. But the key difference between o11y 1.0 and o11y 2.0 has never been that either is cheap; it’s that with observability 2.0, when your bill goes up, the value you derive from your telemetry goes up too. You pay more money, you get more out of your tools. As you should.

Interested in learning more? We’ve written at length about the technical prerequisites for observability with a single source of truth (“observability 2.0” as we’ve called it here). Honeycomb was built to this spec; ServiceNow (formerly Lightstep) and Baselime are other vendors that qualify. Click here to get a Honeycomb demo.

CORRECTION: The original version of this document said that “nothing connects any of these tools.” If you are using a single unified vendor for your metrics, logging, APM, RUM, and tracing tools, this is not strictly true. Vendors like New Relic or Datadog now let you define certain links between your traces and metrics, which allows you to correlate between data types in a few limited, predefined ways. This is better than nothing! But it’s very different from the kind of fluid, open-ended correlation capabilities that we describe with o11y 2.0. With o11y 2.0, you can slice and dice, break down, and group by your complex data sets, then grab a trace that matches any specific set of criteria at any level of granularity. With o11y 1.0, you can define a metric up front, then grab a random exemplar of that metric, and that’s it. All the limitations of metrics still apply; you can’t correlate any metric with any other metric from that request, app, user, etc, and you certainly can’t trace arbitrary criteria. But you’re right, it’s not nothing. 😸

charity.wtf

charity wtf's about technology, databases, startups, engineering management, and whiskey.

three pillars

Your Data is Made Powerful By Context (so stop destroying it already) (xpost)

Your Data Is Made Powerful By Context (so stop destroying it already)