Is It Time To Version Observability? (Signs Point To Yes)

August 7, 2024August 9, 2024 mipsytipsycanonical logs, columnar storage, databases, devops, metrics, monitoring, observability 2.010 Comments

Augh! I am so behind on so much writing, I’m even behind on writing shit that I need to reference in order to write other pieces of writing. Like this one. So we’re just gonna do this quick and dirty on the personal blog, and not bother bringing it up to the editorial standards of…anyone else’s sites. 😬

If you’d rather consume these ideas in other ways:

I gave a keynote at SRECon in March
Here is a slide deck of my slides from CTO Craft Con London in May
A Screaming In The Cloud podcast with Corey Quinn in April
My piece earlier in the year on the Cost Crisis in Observability Tooling touched on some of the concepts too
Matt Sanabria wrote a great piece comparing us and a bunch of other observability vendors in 2024.

What does observability mean? No one knows

In 2016, we first borrowed the term “observability” from the wikipedia entry for control systems observability, where it is a measure of your ability to understand internal system states just by observing its outputs. We (Honeycomb) then spent a couple of years trying to work out how that definition might apply to software systems. Many twitter threads, podcasts, blog posts and lengthy laundry lists of technical criteria emerged from that work, including a whole ass book.

In 2018, Peter Bourgon wrote a blog post proposing that “observability has three pillars: metrics, logs and traces. Ben Sigelman did a masterful job of unpacking why metrics, logs and traces are just telemetry. However, lots of people latched on to the three pillars language: vendors because they (coincidentally!) had metrics products, logging products, and tracing products to sell, engineers because it described their daily reality.

Since then the industry has been stuck in kind of a weird space, where the language used to describe the problems and solutions has evolved, but the solutions themselves are largely the same ones as five years ago, or ten years ago. They’ve improved, of course — massively improved — but structurally they’re variations on the same old pre-aggregated metrics.

It has gotten harder and harder to speak clearly about different philosophical approaches and technical solutions without wading deep into the weeds, where no one but experts should reasonably have to go.

This is what semantic versioning was made for

Look, I am not here to be the language police. I stopped correcting people on twitter back in 2019. We all do observability! One big happy family. 👍

I AM here to help engineers think clearly and crisply about the problems in front of them. So here we go. Let’s call the metrics, logs and traces crowd — the “three pillars” generation of tooling — that’s “Observability 1.0“. Tools like Honeycomb, which are built based on arbitrarily-wide structured log events, a single source of truth — that’s “Observability 2.0“.

Here is the twitter thread where I first teased out the differences between these generations of tooling (all the way back in December, yes, that’s how long I’ve been meaning to write this 😅).

About two months ago I wrote this thread about how we lost the battle to define ✨observability✨ -- to give it a real, specific, falsifiable technical definition, distinct from monitoring or telemetry.

I complained, I argued, I grieved...and now I'm over it. SO over it. 🙄 https://t.co/QH3U0iZZ8G
— Charity Majors (@mipsytipsy) December 22, 2023

This is literally the problem that semantic versioning was designed to solve, by the way. Major version bumps are reserved for backwards-incompatible, breaking changes, and that’s what this is. You cannot simultaneously store your data across both multiple pillars and a single source of truth.

Incompatible. Breaking change. O11y 1.0, meet O11y 2.0.

small technical changes can unlock waves of powerful sociotechnical transformation

There are a LOT of ramifications and consequences that flow from this one small change in how your data gets stored. I don’t have the time or space to go into all of them here, but I will do a quick overview of the most important ones.

The historical analogue that keeps coming to mind for me is virtualization. VMs are old technology, they’ve been around since the 70s. But it wasn’t until the late 90s that VMware productized it, unlocking wave after wave of change, from cloud computing and SaaS to the very DevOps movement itself.

I believe the shift to observability 2.0 holds a similarly massive potential for change, based on what I see happening today, with teams who have already made the leap. Why? In a word, precision. O11y 1.0 can only ever give you aggregates and random exemplars. O11y 2.0, on the other hand, can tell you precisely what happened when you flipped a flag, deployed to a canary, or made any other change in production.

Will these waves of sociotechnical transformation ever be realized? Who knows. The changes that get unlocked will depend to some extent on us (Honeycomb), and to an even greater extent on engineers like you. Anyway, I’ll talk about this more some other time. Right now, I just want to establish a baseline for this vocabulary.

1.0 vs 2.0: How does the data get stored?

1.0💙 O11y 1.0 has many sources of truth, in many different formats. Typically, you end up storing your data across metrics, logs, traces, APM, RUM, profiling, and possibly other tools as well. Some folks even find themselves falling back to B.I. (business intelligence) tools like Tableau in a pinch to understand what’s happening on their systems.

Each of these tools are siloed, with no connective tissue, or only a few, predefined connective links that connect e.g. a specific metric to a specific log line. Aggregation is done at write time, so you have to decide up front which data points to collect and which questions you want to be able to ask. You may find yourself eyeballing graph shapes and assuming they must be the same data, or copy-pasting IDs around from logging to tracing tools and back.

2.0 💚 Data gets stored in arbitrarily-wide structured log events (often called “canonical logs“), often with trace and span IDs appended. You can visualize the events over time as a

trace, or slice and dice your data to zoom in to individual events, or zoom out to a birds-eye view. You can interact with your data by group by, break down, etc.

You aggregate at read time, and preserve raw events for ad hoc querying. Hopefully, you derive your SLO data from the same data you query! Think of it as B.I. for systems/app/business data, all in one place. You can derive metrics, or logs, or traces, but it’s all the same data.

1.0 vs 2.0: on metrics vs logs

1.0 💙 The workhorse of o11y 1.0 is metrics. RUM tools are built on metrics to understand browser user sessions. APM tools are built using metrics to understand application performance. Long ago, the decision was made to use metrics as the source of truth for telemetry because they are cheap and fast, and hardware used to be incredibly expensive.

The more complex our systems get, the worse of a tradeoff this becomes. Metrics are a terrible building block for understanding rich data, because you have to discard all that valuable context at write time, and they don’t support high (or even medium!) cardinality data. All you can do to enrich the data is via tags.

Metrics are a great tool for cheaply summarizing vast quantities of data. They are not equipped to help you introspect or understand complex systems. You will go broke and go mad if you try.

2.0 💚 The building block of o11y 2.0 is wide, structured log events. Logs are infinitely more powerful, useful and cost-effective than metrics are because they preserve context and relationships between data, and data is made valuable by context. Logs also allow you to capture high cardinality data and data relationships/structures, which give you the ability to compute outliers and identify related events.

1.0 vs 2.0: Who uses it, and how?

1.0 💙 Observability 1.0 is predominantly about how you operate your code. It centers around errors, incidents, crashes, bugs, user reports and problems. MTTR, MTTD, and reliability are top concerns.

O11y 1.0 is typically consumed using static dashboards — lots and lots of static dashboards. “Single pane of glass” is often mentioned as a holy grail. It’s easy to find something once you know what you’re looking for, but you need to know to look for it before you can find it.

2.0 💚 If o11y 1.0 is about how you operate your code, o11y 2.0 is about how you develop your code. O11y 2.0 is what underpins the entire software development lifecycle, enabling engineers to connect feedback loops end to end so they get fast feedback on the changes they make, while it’s still fresh in their heads. This is the foundation of your team’s ability to move swiftly, with confidence. It isn’t just about understanding bugs and outages, it’s about proactively understanding your software and how your users are experiencing it.

Thus, o11y 2.0 has a much more exploratory, open-ended interface. Any dashboards should be dynamic, allowing you to drill down into a question or follow a trail of breadcrumbs as part of the debugging/understanding process. The canonical question of o11y 2.0 is “here’s a thing I care about … why do I care about it? What are all of the ways it is different from all the other things I don’t care for?”

When it comes to understanding your software, it’s often harder to identify the question than the answer. Once you know what the question is, you probably know the answer too. With o11y 1.0, it’s very easy to find something once you know what you’re looking for. With o11y 2.0, that constraint is removed.

1.0 vs 2.0: How do you interact with production?

1.0 💙 You deploy your code and wait to get paged. 🤞 Your job is done as a developer when you commit your code and tests pass.

2.0 💚 You practice observability-driven development: as you write your code, you instrument it. You deploy to production, then inspect your code through the lens of the instrumentation you just wrote. Is it behaving the way you expected it to? Does anything else look … weird?

Your job as a developer isn’t done until you know it’s working in production. Deploying to production is the beginning of gaining confidence in your code, not the denouement.

1.0 vs 2.0: How do you debug?

1.0 💙 You flip from dashboard to dashboard, pattern-matching and looking for similar shapes with your eyeballs.

You lean heavily on intuition, educated guesses, past experience, and a mental model of the system. This means that the best debuggers are ALWAYS the engineers who have been there the longest and seen the most.

Your debugging sessions are search-first: you start by searching for something you know should exist.

2.0 💚 You check your instrumentation, or you watch your SLOs. If something looks off, you see what all the mysterious events have in common, or you start forming hypotheses, asking a question, considering the result, and forming another one based on the answer. You interrogate your systems, following the trail of breadcrumbs to the answer, every time.

You don’t have to guess or rely on elaborate, inevitably out-of-date mental models. The data is right there in front of your eyes. The best debuggers are the people who are the most curious.

Your debugging questions are analysis-first: you start with your user’s experience.

1.0 vs 2.0: The cost model

1.0 💙 You pay to store your data again and again and again and again, multiplied by all the different formats and tool types you are paying to store it in. Cost goes up at a multiplier of your traffic increase. I wrote a whole piece earlier this year on the cost crisis in observability tooling, so I won’t go into it in depth here.

As your costs increase, the value you get out of your tools actually decreases.

If you are using metrics-based products, your costs go up based on cardinality. “Custom metrics” is a euphemism for “cardinality”; “100 free custom metrics” actually means “100 free cardinality”, aka unique values.

2.0 💚 You pay to store your data once. As your costs go up, the value you get out goes up too. You have powerful, surgical options for controlling costs via head-based or tail-based dynamic sampling.

You can have infinite cardinality. You are encouraged to pack hundreds or thousands of dimensions in per event, and any or all of those dimensions can be any data type you want. This luxurious approach to cardinality and data is one of the least well understood aspects of the switch from o11y 1.0 to 2.0.

Many observability engineering teams have spent their entire careers massaging cardinality to control costs. What if you just .. didn’t have to do that? What would you do with your lives? If you could just store and query on all the crazy strings you want, forever? 🌈

Metrics are a bridge to our past

Why are observability 1.0 tools so unbelievably, eyebleedingly expensive? As anyone who works with data can tell you, this is always what happens when you use the wrong tool for the job. Once again, metrics are a great tool for summarizing vast quantities of data. When it comes to understanding complex systems, they flail.

I wrote a whole whitepaper earlier this year that did a deep dive into exactly why tools built on top of metrics are so unavoidably costly. If you want the gnarly detail, download that.

The TLDR is this: tools built on metrics — whether RUM, APM, dashboards, etc — are a bridge to our past. If there’s one thing I’m certain of, it’s that tools built on top of wide, structured logs are the bridge to our future.

Wide, structured log events are the bridge to our future

Five years from now, I predict that the center of gravity will have swung dramatically; all modern engineering teams will be powering their telemetry off of tools backed by wide, structured log events, not metrics. It’s getting harder and harder and harder to try and wring relevant insights out of metrics-based observability tools. The end of the ZIRP era is bringing unprecedented cost pressure to bear, and it’s simply a matter of time.

The future belongs to tools built on wide, structured log events — a single source of truth that you can trace over time, or zoom in, zoom out, derive SLOs from, etc.

It’s the only way to understand our systems in all their skyrocketing complexity. This constant dance with cost vs cardinality consumes entire teams worth of engineers and adds zero value. It adds negative value.

And here’s the weirdest part. The main thing holding most teams back psychologically from embracing o11y 2.0 seems to be the entrenched difficulties they have grappling with o11y 1.0, and their sense that they can’t adopt 2.0 until they get a handle on 1.0. Which gets things exactly backwards.

Because observability 2.0 is so much easier, simpler, and more cost effective than 1.0.

observability 1.0 is the hard way

It’s so fucking hard. We’ve been doing it so long that we are blind to just how HARD it is. But trying to teach teams of engineers to wrangle metrics, to squeeze the questions they want to ask into multiple abstract formats scattered across many different tools, with no visibility into what they’re doing until it comes out eventually in form of a giant bill… it’s fucking hard.

Observability 2.0 is so much simpler. You want data, you just toss it in. Format? don’t care. Cardinality? don’t care.

You want to ask the question, you just ask it. Format? don’t care.

Teams are beating themselves up trying to master an archaic, unmasterable set of technical tradeoffs based on data types from the 80s. It’s an unwinnable war. We can’t understand today’s complex systems without context-rich, explorable data.

We need more options for observability 2.0 tooling

My hope is that by sketching out these technical differences between o11y 1.0 and 2.0, we can begin to collect and build up a vendor-neutral library of o11y 2.0 options for folks. The world needs more options for understanding complex systems besides just Honeycomb and Baselime.

The world desperately needs an open source analogue to Honeycomb — something built for wide structured events, stored in a columnar store (or even just Clickhouse), with an interactive interface. Even just a written piece on how you solved it at your company would help move the industry forward.

My other hope is that people will stop building new observability startups built on metrics. Y’all, Datadog and Prometheus are the last, best metrics-backed tools that will ever be built. You can’t catch up to them or beat them at that; no one can. Do something different. Build for the next generation of software problems, not the last generation.

If anyone knows of anything along these lines, please send me links? I will happily collect them and signal boost. Honeycomb is a great, lifechanging tool (and we have a generous free tier, hint hint) but one option does not a movement make.

<3 charity

P.S. Here’s a great piece written by Ivan Burmistrov on his experience using observability 2.0 type tooling at Facebook — namely Scuba, which was the inspiration for Honeycomb. It’s a terrific piece and you should read it.

P.P.S. And if you’re curious, here’s the long twitter thread I wrote in October of 2023 on how we lost the battle to define observability:

So, we lost the battle to define observability. You know it, I know it. Observability was supposed to *mean* something, and in the early days, it did.

"Observability" once meant the kind of exploratory, open ended investigation our systems increasingly demand.
— Charity Majors (@mipsytipsy) October 30, 2023

The Cost Crisis in Observability Tooling

January 24, 2024December 21, 2024 mipsytipsycost control, metrics, observability 2.0, three pillarsLeave a comment

Originally posted on the Honeycomb blog on January 24th, 2024

The cost of services is on everybody’s mind right now, with interest rates rising, economic growth slowing, and organizational budgets increasingly feeling the pinch. But I hear a special edge in people’s voices when it comes to their observability bill, and I don’t think it’s just about the cost of goods sold. I think it’s because people are beginning to correctly intuit that the value they get out of their tooling has become radically decoupled from the price they are paying.

In the happiest cases, the price you pay for your tools is “merely” rising at a rate several times faster than the value you get out of them. But that’s actually the best case scenario. For an alarming number of people, the value they get actually decreases as their bill goes up.

Observability 1.0 and the cost multiplier effect

Are you familiar with this chestnut?

“Observability has three pillars: metrics, logs, and traces.”

This isn’t exactly true, but it’s definitely true of a particular generation of tools—one might even say definitionally true of a particular generation of tools. Let’s call it “observability 1.0.”

From an evolutionary perspective, you can see how we got here. Everybody has logs… so we spin up a service for log aggregation. But logs are expensive and everybody wants dashboards… so we buy a metrics tool. Software engineers want to instrument their applications… so we buy an APM tool. We start unbundling the monolith into microservices, and pretty soon we can’t understand anything without traces… so we buy a tracing tool. The front-end engineers point out that they need sessions and browser data… so we buy a RUM tool. On and on it goes.

Logs, metrics, traces, APM, RUM. You’re now paying to store telemetry five different ways, in five different places, for every single request. And a 5x multiplier is on the modest side of the spectrum, given how many companies pay for multiple overlapping tools in the same category. You may also also be collecting:

Profiling data
Product analytics
Business intelligence data
Database monitoring/query profiling tools
Mobile app telemetry
Behavioral analytics
Crash reporting
Language-specific profiling data
Stack traces
CloudWatch or hosting provider metrics
…and so on.

So, how many times are you paying to store data about your user requests? What’s your multiplier? (If you have one consolidated vendor bill, this may require looking at your itemized bill.)

There are many types of tools, each gathering slightly different data for a slightly different use case, but underneath the hood there are really only three basic data types: the metric, unstructured logs, and structured logs. Each of these have their own distinctive trade-offs when it comes to how much they cost and how much value you can get out of them.

Metrics

Metrics are the great-granddaddy of telemetry formats; tiny, fast, and cheap. A “metric” consists of a single number, often with tags appended. All of the context of the request gets discarded at write time; each individual metric is emitted separately. This means you can never correlate one metric with another from the same request, or select all the metrics for a given request ID, user, or app ID, or ask arbitrary new questions about your metrics data.

Metrics-based tools include vendors like Datadog and open-source projects like Prometheus. RUM tools are built on top of metrics to understand browser user sessions; APM tools are built on top of metrics to understand application performance.

When you set up a metrics tool, it generally comes prepopulated with a bunch of basic metrics, but the useful ones are typically the custom metrics you emit from your application.

Your metrics bill is usually dominated by the cost of these custom metrics. At minimum, your bill goes up linearly with the number of custom metrics you create. Which is unfortunate, because to restrain your bill from unbounded growth, you have to regularly audit your metrics, do your best to guess which ones are going to be useful in the future, and prune any you think you can afford to go without. Even in the hands of experts, these tools require significant oversight.

Linear cost growth is the goal, but it’s rarely achieved. The cost of each metric varies wildly depending on how you construct it, what the values are, how often it gets hit, etc. I’ve seen a single custom metric cost $30k per month. You probably have dozens of custom metrics per service, and it’s almost impossible to tell how much each of them costs you. Metrics bills tend to be incredibly opaque (possibly by design).

Nobody can understand their software or their systems with a metrics tool alone, because the metric is extremely limited in what it can do. No context, no cardinality, no strings… only basic static dashboards. For richer data, we must turn to logs.

Unstructured logs

You can understand much more about your code with logs than you can with metrics. Logs are typically emitted multiple times throughout the execution of the request, with one or a small number of nouns per log line, plus the request ID. Unstructured logs are still the default, although this is slowly changing.

The cost of unstructured logs is driven by a few things:

Write amplification. If you want to capture lots of rich context about the request, you need to emit a lot of log lines. If you are printing out just 10 log lines per request, per service, and you have half a dozen services, that’s 60 log events for every request.
Noisiness. It’s extremely easy to accidentally blow up your log footprint yet add no value—e.g., by putting a print statement inside a loop instead of outside the loop. Here, the usefulness of the data goes down as the bill shoots up.
Constraints on physical resources. Due to the write amplification of log lines per request, it’s often physically impossible to log everything you want to log for all requests or all users—it would saturate your NIC or disk. Therefore, people tend to use blunt instruments like these to blindly slash the log volume:
- Log levels
- Consistent hashes
- Dumb sample rates

When you emit multiple log lines per request, you end up duplicating a lot of raw data; sometimes over half the bits are consumed by request ID, process ID, timestamp. This can be quite meaningful from a cost perspective.

All of these factors can be annoying. But the worst thing about unstructured logs is that the only thing you can do to query them is full text search. The more data you have, the slower it becomes to search that data, and there’s not much you can do about it.

Searching your logs over any meaningful length of time can take minutes or even hours, which means experimenting and looking around for unknown-unknowns is prohibitively time-consuming. You have to know what to look for in order to find it. Once again, as your logging bill goes up, the value goes down.

Structured logs

Structured logs are gaining adoption across the industry, especially as OpenTelemetry picks up steam. The nice thing about structured logs is that you can actually do things with the data other than slow, dumb string searches. If you’ve structured your data properly, you can perform calculations! Compute percentiles! Generate heatmaps!

Tools built on structured logs are so clearly the future. But just taking your existing logs and adding structure isn’t quite good enough. If all you do is stuff your existing log lines into key/value pairs, the problems of amplification, noisiness, and physical constraints remain unchanged—you can just search more efficiently and do some math with your data.

There are a number of things you can and should do to your structured logs in order to use them more effectively and efficiently. In order of achievability:

Instrument your code using the principles of canonical logs, which collects all the vital characteristics of a request into one wide, dense event. It is difficult to overstate the value of doing this, for reasons of usefulness and usability as well as cost control.
Add trace IDs and span IDs so you can trace your code using the same events instead of having to use an entirely separate tool.
Feed your data into a columnar storage engine so you don’t have to predefine a schema or indexes to decide which dimensions future you can search or compute based on.
Use a storage engine that supports high cardinality, with an explorable interface.

If you go far enough down this path of enriching your structured events, instrumenting your code with the right data, and displaying it in real time, you will reach an entirely different set of capabilities, with a cost model so distinct it can only be described as “observability 2.0.” More on that in a second.

Ballooning costs are baked into observability 1.0

To recap: high costs are baked into the observability 1.0 model. Every pillar has a price.

You have to collect and store your data—and pay to store it—again and again and again, for every single use case. Depending on how many tools you use, your observability bill may be growing at a rate 3x faster than your traffic is growing, or 5x, or 10x, or even more.

It gets worse. As your costs go up, the value you get out of your tools goes down.

Your logs get slower and slower to search.
You have to know what you’re searching for in order to find it.
You have to use blunt force sampling technique to keep log volume from blowing up.
Any time you want to be able to ask a new question, you first have to commit new code and deploy it.
You have to guess which custom metrics you’ll need and which fields to index in advance.
As volume goes up, your ability to find a needle in the haystack—any unknown-unknowns—goes down commensurately.

And nothing connects any of these tools. You cannot correlate a spike in your metrics dashboard with the same requests in your logs, nor can you trace one of the errors. It’s impossible. If your APM and metrics tools report different error rates, you have no way of resolving this confusion. The only thing connecting any of these tools is the intuition and straight-up guesses made by your most senior engineers. Which means that the cognitive costs are immense, and your bus factor risks are very real. The most important connective data in your system—connecting metrics with logs, and logs with traces—exists only in the heads of a few people.

At the same time, the engineering overhead required to manage all these tools (and their bills) rises inexorably. With metrics, an engineer needs to spend time auditing your metrics, tracking people down to fix poorly constructed metrics, and reaping those that are too expensive or don’t get used. With logs, an engineer needs to spend time monitoring the log volume, watching for spammy or duplicate log lines, pruning or consolidating them, choosing and maintaining indexes.

But all this the time spent wrangling observability 1.0 data types isn’t even the costliest part. The most expensive part is the unseen costs inflicted on your engineering organization as development slows down and tech debt piles up, due to low visibility and thus low confidence.

Is there an alternative? Yes.

The cost model of observability 2.0 is very different

Observability 2.0 has no three pillars; it has a single source of truth. Observability 2.0 tools are built on top of arbitrarily-wide structured log events, also known as spans. From these wide, context-rich structured log events you can derive the other data types (metrics, logs, or traces).

Since there is only one data source, you can correlate and cross-correlate to your heart’s content. You can switch fluidly back and forth between slicing and dicing, breaking down or grouping by events, and viewing them as a trace waterfall. You don’t have to worry about cardinality or key space limitations.

You also effectively get infinite custom metrics, since you can append as many as you want to the same events. Not only does your cost not go up linearly as you add more custom metrics, your telemetry just gets richer and more valuable the more key-value pairs you add! Nor are you limited to numbers; you can add any and all types of data, including valuable high-cardinality fields like “App Id” or “Full Name.”

Observability 2.0 has its own amplification factor to consider. As you instrument your code with more spans per request, the number of events you have to send (and pay for) goes up. However, you have some very powerful tools for dealing with this: you can perform dynamic head-based sampling or even tail-based sampling, where you decide whether or not to keep the event after it’s finished, allowing you to capture 100% of slow requests and other outliers.

Engineering time is your most precious resource

But the biggest difference between observability 1.0 and 2.0 won’t show up on any invoice. The difference shows up in your engineering team’s ability to move quickly, with confidence.

Modern software engineering is all about hooking up fast feedback loops. And observability 2.0 tooling is what unlocks the kind of fine-grained, exploratory experience you need in order to accelerate those feedback loops.

Where observability 1.0 is about MTTR, MTTD, reliability, and operating software, observability 2.0 is what underpins the entire software development lifecycle, setting the bar for how swiftly you can build and ship software, find problems, and iterate on them. Observability 2.0 is about being in conversation with your code, understanding each user’s experience, and building the right things.

Observability 2.0 isn’t exactly cheap either, although it is often less expensive. But the key difference between o11y 1.0 and o11y 2.0 has never been that either is cheap; it’s that with observability 2.0, when your bill goes up, the value you derive from your telemetry goes up too. You pay more money, you get more out of your tools. As you should.

Interested in learning more? We’ve written at length about the technical prerequisites for observability with a single source of truth (“observability 2.0” as we’ve called it here). Honeycomb was built to this spec; ServiceNow (formerly Lightstep) and Baselime are other vendors that qualify. Click here to get a Honeycomb demo.

CORRECTION: The original version of this document said that “nothing connects any of these tools.” If you are using a single unified vendor for your metrics, logging, APM, RUM, and tracing tools, this is not strictly true. Vendors like New Relic or Datadog now let you define certain links between your traces and metrics, which allows you to correlate between data types in a few limited, predefined ways. This is better than nothing! But it’s very different from the kind of fluid, open-ended correlation capabilities that we describe with o11y 2.0. With o11y 2.0, you can slice and dice, break down, and group by your complex data sets, then grab a trace that matches any specific set of criteria at any level of granularity. With o11y 1.0, you can define a metric up front, then grab a random exemplar of that metric, and that’s it. All the limitations of metrics still apply; you can’t correlate any metric with any other metric from that request, app, user, etc, and you certainly can’t trace arbitrary criteria. But you’re right, it’s not nothing. 😸

Live Your Best Life With Structured Events

August 15, 2022July 14, 2023 mipsytipsyengineering, events, logs, metrics, monitoring, traces6 Comments

If you’re like most of us, you learned to debug as a baby engineer by way of printf(3). By the time you were shipping code to production you had probably learned to instrument your code with a real metrics library. Maybe a tenth of us learned to use gdb and still step through functions on the regular. (I said maybe.)

Printing stuff to stdout is still the Swiss Army knife of tools. Always there when you reach for it, usually helps more than it does harm. (I said usually.)

And then! In case you’ve been living under a rock, we recently went and blew up ye aulde monolythe, and in the process we … lost most of our single-process tools and techniques for debugging. Forget gdb; even printf doesn’t work when you’re hopping the network between functions.

If your tool set no longer works for you, friend, it’s time to go all in. Maybe what you wanted was a faster horse, but it’s time for a car, and the sooner you turn in your oats for gas cans and a spare tire, the better.

Exercising Good Technical Judgment (When You Don’t Have Any)

If you’re stuck trying to debug modern problems with pre-modern tooling, the first thing to do is stop digging the hole. Stop pushing good data after bad into formats and stores that aren’t going to help you answer the right questions.

In brief: if you aren’t rolling out a solution based on arbitrarily wide, structured raw events that are unique and ordered and trace-aware and without any aggregation at write time, you are going to regret it. (If you aren’t using OpenTelemetry, you are going to regret that, too.)

So just make the leap as soon as possible.

But let’s rewind a bit. Let’s start with observability.

Observability: an introduction

Observability is not a new word or concept, but the definition of observability as a specific technical term applied to software engineering is relatively new — about four years old. Before that, if you heard the term in softwareland it was only as a generic synonym for telemetry (“there are three pillars of observability”, in one annoying formulation) or team names (twitter, for example, has long had an “observability team”).

The term itself originates with control theory:

“In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals. The concept of observability was introduced by Hungarian-American engineer Rudolf E. Kálmán for linear dynamic systems.[1][2]”

But when applied to a software context, observability refers to how well you can understand and reason about your systems, just by interrogating them and inspecting their outputs with your tools. How well can you understand the inside of the system from the outside?

Achieving this relies your ability to ask brand new questions, questions you have never encountered and never anticipated — without shipping new code. Shipping new code is cheating, because it means that you knew in advance what the problem was in order to instrument it.

But what about monitoring?

Monitoring has a long and robust history, but it has always been about watching your systems for failures you can define and expect. Monitoring is for known-unknowns, and setting thresholds and running checks against the system. Observability is about the unknown-unknowns. Which requires a fundamentally different mindset and toolchain.

“Monitoring is the action of observing and checking the behavior and outputs of a system and its components over time.” — @grepory, in his talk “Monitoring is Dead“.

Monitoring is a third-person perspective on your software. It’s not software explaining itself from the inside out, it’s one piece of software checking up on another.

Observability is for understanding complex, ephemeral, dynamic systems (not for debugging code)

You don’t use observability for stepping through functions; it’s not a debugger. Observability is for swiftly identifying where in your system the error or problem is coming from, so you can debug it — by reproducing it, or seeing what it has in common with other erroring requests. You can think of observability as being like B.I. (business intelligence) tooling for software applications, in the way you engage in a lot of exploratory, open-ended data sifting to detect novel patterns and behaviors.

Observability is often about swiftly isolating or tracking down the problem in your large, sprawling, far-flung, dynamic system. Because the hard part of distributed systems is rarely debugging the code, it’s figuring out where the code you need to debug is.

The need for observability is often associated with microservices adoption, because they are prohibitively difficult to debug without service-level event oriented tooling — the kind you can get from Honeycomb and Lightstep.. and soon, I hope, many other vendors.

Events are the building blocks of observability

Ergh, another overloaded data term. What even is an “event”?

An observability “event” is a hop in the lifecycle of an end-to-end request. If a request executes code on three services separated by network hops before returning to the user, that request generated three observability “events”, each packed with context and details about that code running in that environment. These are also sometimes called “canonical log lines“. If you implemented tracing, each event may be a span in your trace.

If request ID #A897BEDC hits your edge, then your API service, then four more internal services, and twice connects to a db and runs a query, then request ID #A897BEDC generated 8 observability events … assuming you are in fact gathering observability data from the edge, the API, the internal services and the databases.

This is an important caveat. We only gather observability events from services that we can and do introspect. If it’s a black box to us, that hop cannot generate an observability event. So if request ID #A897BEDC also performed 20 cache lookups and called out to 8 external HTTP services and 2 managed databases, those 30 hops do not generate observability events (assuming you haven’t instrumented the memcache service and have no instrumentation from those external services/dbs). Each request generates one event per request per service hop.**

(I also wrote about logs vs structured events here.)

Observability is a first-person narrative.

We care primarily about self-reported status from the code as it executes the request path.

Instrumentation is your eyes and ears, explaining the software and its environment from the perspective of your code. Monitoring, on the other hand, is traditionally a third-person narrative — it’s one piece of software checking up on another piece of software, with no internal knowledge of its hopes and dreams.

First-person narrative reports have the best potential for telling a reliable narrative. And more importantly, they map directly to user experience in a way that third-party monitoring does not and cannot.

Events … must be structured.

First, structure your goddamn data. You’re a computer scientist, you’ve got no business using text search to plow through terabytes of text.

Events … are not just structured logs.

Now, part of the reason people seem to think structured data is cost-prohibitive is that they’re doing it wrong. They’re still thinking about these like log lines. And while you can look at events like they’re just really wide structured log lines that aren’t flushed to disk, here’s why you shouldn’t: logs have decades of abhorrent associations and absolutely ghastly practices.

Instead of bundling up and passing along one neat little pile of context, they’re spewing log lines inside loops in their code and DDoS’ing their own logging clusters.They’re shitting out “log lines” with hardly any dimensions so they’re information-sparse and just straight up wasting the writes. And then to compensate for the sparseness and repetitiveness they just start logging the same exact nouns tens or hundreds of times over the course of the request, just so they can correlate or reconstruct some lousy request that they never should have blown up in the first place!

But they keep hearing they should be structuring their logs, so they pile structure on to their horrendous little strings, which pads every log line by a few bytes, so their bill goes up but they aren’t getting any benefit! just paying more! What the hell, structuring is bull shit!

Kittens. You need a fundamentally different approach to reap the considerable benefits of structuring your data.

But the difference between strings and structured data is ~basically the difference between grep and all of computer science. 😛

Events … must be arbitrarily wide and dense with context.

So the most effective way to structure your instrumentation, to get the absolute most bang for your buck, is to emit a single arbitrarily wide event per request per service hop. At Honeycomb, the maturely instrumented datasets that we see are often 200-500 dimensions wide. Here’s an event that’s just 20 dimensions wide:

{ 

   "timestamp":"2018-11-20 19:11:56.910",
   "az":"us-west-1",
   "build_id":"3150",
   "customer_id":"2310",
   "durationMs":167,
   "endpoint":"/api/v2/search",
   "endpoint_shape":"/api/v2/search",
   "fraud_dur":131,
   "hostname":"app14",
   "id":"f46691dfeda9ede4",
   "mysql_dur":"",
   "name":"/api/v2/search",
   "parent_id":"",
   "platform":"android",
   "query":"",
   "serviceName":"api",
   "status_code":200,
   "traceId":"f46691dfeda9ede4",
   "user_id":"344310",
   "error_rate":0,
   "is_root":"true"
}

So a well-instrumented service should have hundreds of these dimensions, all bundled around the context of each request. And yet — and here’s why events blow the pants off of metrics — even with hundreds of dimensions, it’s still just one write. Adding more dimensions to your event is effectively free, it’s still one write plus a few more bits.

Compare this to a metric-based systems, where you are often in the position of trying to predict whether a metric will be valuable enough to justify the extra write, because every single metric or tag you add contributes linearly to write amplification. Ever gotten billed tens of thousands of dollars for your custom metrics, or had to prune your list of useful custom metrics down to something affordable? (“BUT THOSE ARE THE ONLY USEFUL ONES!”, as every ops team wails)

Events … must pass along the blob of context as the request executes

As you can imagine, it can be a pain in the ass to keep passing this blob of information along the life of the request as it hits many services and databases. So at Honeycomb we do all the annoying parts for you with our integrations. You just install the go pkg or ruby gem or whatever, and under the hood we:

initialize an empty debug event when the request enters that service
prepopulate the empty debug event with any and all interesting information that we already know or can guess. language type, version, environment, etc.
create a framework so you can just stuff any other details in there as easily as if you were printing it out to stdout
pass the event along and maintain its state until you are ready to error or exit
write the extremely wide event out to honeycomb

Easy!

(Check out this killer talk from @lyddonb on … well everything you need to know about life, love and distributed systems is in here, but around the 12:00 mark he describes why this approach is mandatory. WATCH IT. https://www.youtube.com/watch?v=xy3w2hGijhE&feature=youtu.be)

Events … should collect context like sticky buns collect dust

Other stuff you’ll want to track in these structured blobs includes:

Metadata like src, dst headers
The timing stats and contents of every network call (our beelines wrap all outgoing http calls and db queries automatically)
Every raw db query, normalized query family, execution time etc
Infra details like AZ, instance type, provider
Language/environment context like $lang version, build flags, $ENV variables
Any and all unique identifying bits you can get your grubby little paws on — request ID, shopping cart ID, user ID, request ID, transaction ID, any other ID … these are always the highest value data for debugging.
Any other useful application context. Service name, build id, ordering info, error rates, cache hit rate, counters, whatever.
Possibly the system resource state at this point in time. e.g. values from /proc/net/ipv4 stats

Capture all of it. Anything that ever occurs to you (“this MIGHT be handy someday”) — don’t even hesitate, just throw it on the pile. Collect it up in one rich fat structured blob.

Events … must be unique, ordered, and traceable

You need a unique request ID, and you need to propagate it through your stack in some way that preserves sequence. Once you have that, traces are just a beautiful visualization layer on top of your shiny event data.

Events … must be stored raw.

Because observability means you need to be able to ask any arbitrary new question of your system without shipping new code, and aggregation is a one-way trip. Once you have aggregated your data and discarded the raw requests, you have destroyed your ability to ask new questions of that data forever. For Ever.

Aggregation is a one-way trip. You can always, always derive your pretty metrics and dashboards and aggregates from structured events, and you can never go in reverse. Same for traces, same for logs. The structured event is the gold standard. Invest in it now, save your ass in the future.

It’s only observability if you can ask new questions. And that means storing raw events.

Events…are richer than metrics

There’s always tradeoffs when it comes to data. Metrics choose to sacrifice context and connective tissue, and sometimes high cardinality support, which you need to correlate anomalies or track down outliers. They have a very small, efficient data format, but they sacrifice everything else by discarding all but the counter, gauge, etc.

A metric looks like this, by the way.

{ metric: "db.query.time", value: 0.502, tags: Array(), type: set }

That’s it. It’s just a name, a number and maybe some tags. You can’t dig into the event and see what else was happening when that query was strangely slow. You can never get that information back after discarding it at write time.

But because they’re so cheap, you can keep every metric for every request! Maybe. (Sometimes.) More often, what happens is they aggregate at write time. So you never actually get a value written out for an individual event, it smushes everything together that happens in the 1 second interval and calculates some aggregate values to write out. And that’s all you can ever get back to.

With events, and their relative explosion of richness, we sacrifice our ability to store every single observability event about every request. At FB, every request generated hundreds of observability events as it made its way through the stack. Nobody, NOBODY is going to pay for an o11y stack that is hundreds of times as large as production. The solution to that problem is sampling.

Events…should be sampled.

But not dumb, blunt sampling at server side. Control it on the client side.

Then sample heavily for events that are known to be common and useless, but keep the events that have interesting signal. For example: health checks that return 200 OK usually represent a significant chunk of your traffic and are basically useless, while 500s are almost always interesting. So are all requests to /login or /payment endpoints, so keep all of them. For database traffic: SELECTs for health checks are useless, DELETEs and all other mutations are rare but you should keep all of them. Etc.

You don’t need to treat your observability metadata with the same care as you treat your billing data. That’s just dumb.

… To be continued.

I hope it’s now blazingly obvious why observability requires — REQUIRES — that you have access to raw structured events with no pre-aggregation or write-time rollups. Metrics don’t count. Just traces don’t count. Unstructured logs sure as fuck don’t count.

Structured, arbitrarily wide events, with dynamic sampling of the boring parts to control costs. There is no substitute.

For more about the technical requirements for observability, read this, this, or this.

**The deep fine print: it’s one observability event per request per service hop … because we gather observability detail organized by request id. Databases may be different. For example, with MongoDB or MySQL, we can’t instrument them to talk to honeycomb directly, so we gather information about its internal perspective by 1) tailing the slow query log (and turning it up to log all queries if perf allows), 2) streaming tcp over the wire and reconstructing transactions, 3) connecting to the mysql port as root every couple seconds from cron, then dumping all mysql stats and streaming them in to honeycomb as an event. SO. Database traffic is not organized around connection length or unique request id, it is organized around transaction id or query id. Therefore it generates one observability event per query or transaction.

In other words: if your request hit the edge, API, four internal services, two databases … but ran 1 query on one db and 10 queries on the second db … you would generate a total of *19 observability events* for this request.

For more on observability for databases and other black boxes, try this blog post.

The Truth About “MEH-TRICS”

April 13, 2022January 10, 2024 mipsytipsycrossposted, engineering, logs, metrics, monitoringLeave a comment

First published on 2022-04-13 at https://www.honeycomb.io/blog/truth-about-meh-trics-metrics

A long time ago, in a galaxy far, far away, I said a lot of inflammatory things about metrics.

“Metrics are shit salad.”

“Metrics are simply nerfed dimensions.”

“Metrics suck,” “metrics are legacy,” “metrics and time series aggregates will fucking kneecap you.”

I cannot tell a lie; Twitter will testify that I’ve spent the past six years ragging on metrics. So much so that ever since we launched Honeycomb Metrics last year, our poor solution architects have been encountering skeptics in the field who repeat my quotes back to them and ask, dubiously, whether Honeycomb Metrics are any good or not, and whether we genuinely plan on investing in it or not, given our known anti-metrics sympathies.

That’s a great question. 😊

Metrics aren’t worthless; they’re just limited.

Metrics are a mature technology that’s been around for over 30 years, and they have some real advantages. They’re tiny, fast, and cheap; you can hold a bunch of them in memory as counters, summaries, and gauges. They aggregate well and take up a fixed amount of storage space. The entire monitoring industry is built on top of metrics.

When it comes to workloads like, “How heavy is the write load on my hard drive?” or “What is the temperature or fan status inside my chassis?” or “What is the traffic rate in and out of this interface on my switch?” metrics are what you should use. In fact, pretty much any time you want to know the health of a system or component in toto, metrics are the right tool.

Because that’s what metrics do best—report statistics in aggregate, from the perspective of any system or component. They can tell you that your Ruby HTTP worker pool is 70% utilized or that your nginx webserver is returning 502s 1% of the time. What they can’t tell you is what this means for any one of your users, applications, delivery vehicles, and so forth.

Until recently, metrics-based tools or logs were the only game in town. People were trying to sell us metrics tools for observability use cases, and that’s what got my goat so badly. If you simply append “… for observability” to each of my inflammatory statements, then I stand by them completely.

“Metrics are shit salad … for observability.”

Yup, rings true.

You’re never going to make a metrics tool like Prometheus or Datadog into an observability tool. You’re just not. Observability is about unknown-unknowns, while metrics are a tool for known-unknowns.

If you need a refresher on the differences between observability and monitoring, I’ll refer you to pieces like this, this, and this. What I want to talk about here is slightly different. In a post-observability world, what is the true and proper place for metrics tooling?

Metrics and observability have different use cases.

Metrics aren’t completely useless, even if you have a robust observability presence. We still use metrics at Honeycomb to this day for certain workloads—and always will because they’re the right tool for the job.

There are two kinds of workloads, roughly speaking: your code—the code you write, review, ship, debug and maintain on a daily basis. And other people’s code—the code you have to run and use in order to support your code. Some examples of the latter might be: Linux, Docker, MySql, Amazon RDS, Kafka, AWS Lambda, GCP gateways, memcache, CI/CD pipelines, Kubernetes, etc.

Your code is your crown jewels, the code you need to survive and succeed as a business. It changes constantly—many times per week, if not per day. You are expected to understand its inner workings intimately, and spend lots of time chasing down bugs or understanding and reproducing behavior. You care about the way it performs and interacts with each and every individual user, with changing infrastructure state, and under a variety of different load conditions.

That is why your code demands observability. In order to understand your software, you must first instrument it, in a way that collects lots of rich context and bundles it up around each event end-to-end. Then you need to stream those events into a tool that lets you slice and dice and trace and explore with support for high-cardinality and high-dimensionality data. That’s the only way you’re going to be able to correlate errors, track down outliers, and reflect each user’s experience.

But what about the rest of the software? You can’t instrument Amazon RDS, and only crazy people would instrument, rebuild, and repackage things like Kafka or Docker or nginx. The whole point of third-party software is that you DON’T USE IT until it’s stable enough to be taken more or less for granted. Sure, you roll updates, but usually on the order of months or years—not every day. You don’t need to be intimately familiar with its inner workings because you aren’t changing it every day. Those aren’t your crown jewels.

You do care about their health though, only differently. You care about whether you need to provision more capacity or not. You care about knowing how hard you’re hammering on the underlying hardware or hypervisor. That’s why metrics and monitoring are the right tools to use for third-party code. They don’t let you peer under the hood in the same way, or slice and dice in the same way, but that’s okay. You shouldn’t have to.

With third-party stuff, you don’t care about the code, you care about the health of the service. In aggregate.

(There are some kinds of in-between software, like databases, where event-level information is super useful for debugging things like slow queries and lock percentages, and you can use various black box techniques to approximate observability without instrumentation. But in general this model holds up quite well.)

In a post-observability world, what are metrics for?

I’ve often pointed out that observability is built on top of arbitrarily wide structured data blobs, and that metrics, logs, and traces can be derived from those blobs while the reverse is not true—you can’t take a bunch of metrics and reformulate a rich event.

And yes, people who have observability typically find themselves using metrics and dashboards less and less. They’re simply not as versatile or useful as events that you can slice and dice and manipulate in infinite ways. And you can derive aggregates and trends from the events you have stored.

But metrics will always be useful for understanding third-party software, from the perspective of the service, cluster, or node. They will always be the right tool for the job when it comes to software interfacing with hardware. And they can be super complementary when you are investigating your code using events and instrumentation.

If you’re an engineer writing and shipping code, you’re never not going to want to know if your change caused memory usage to triple, or CPU utilization to skyrocket, or disk usage or network throughput to saturate. That’s why we built Honeycomb Metrics as an overlay, a way to enhance or validate your understanding of the impact your code changes have had on the underlying system.

Metrics are also valuable as a bridge to the past. People have been instrumenting software for metrics for 30 years—they’re never going away completely, and not everything can or should be reinstrumented with events. Lots of people already have robust monitoring systems that slurp in millions of metrics. Nobody wants to have to redo all that work just because they’re moving to a different tool, so people tend to point their metrics firehose at Honeycomb as a way of getting started as they roll observability out into their code.

Observability is a Many-Splendored Definition

March 3, 2020July 14, 2023 mipsytipsyapm, instrumentation, logs, metrics, monitoring, traces12 Comments

Last weekend, @swyx posted a great little primer to instrumentation titled “Observability Tools in JavaScript”. A friend sent me the link and suggested that I might want to respond and clarify some things about observability, so I did, and we had a great conversation! Here is a lightly edited transcript of my reply tweet storm.

First of all, confusion over terminology is understandable, because there are some big players out there actively trying to confuse you! Big Monitoring is indeed actively trying to define observability down to “metrics, logs and traces”. I guess they have been paying attention to the interest heating up around observability, and well… they have metrics, logs, and tracing tools to sell? So they have hopped on the bandwagon with some undeniable zeal.

But metrics, logs and traces are just data types. Which actually has nothing to do with observability. Let me explain the difference, and why I think you should care about this.

“Observability? I do not think it means what you think it means.”

Observability is a borrowed term from mechanical engineering/control theory. It means, paraphrasing: “can you understand what is happening inside the system — can you understand ANY internal state the system may get itself into, simply by asking questions from the outside?” We can apply this concept to software in interesting ways, and we may end up using some data types, but that’s putting the cart before the horse.

It’s a bit like saying that “database replication means structs, longints and elegantly diagrammed English sentences.” Er, no.. yes.. missing the point much?

This is such a reliable bait and switch that any time you hear someone talking about “metrics, logs and traces”, you can be pretty damn sure there’s no actual observability going on. If there were, they’d be talking about that instead — it’s far more interesting! If there isn’t, they fall back to talking about whatever legacy products they do have, and that typically means, you guessed it: metrics, logs and traces.

❌ Metrics

Metrics in particular are actually quite hostile to observability. They are usually pre-aggregated, which means you are stuck with whatever questions you defined in advance, and even when they aren’t pre-aggregated they permanently discard the connective tissue of the request at write time, which destroys your ability to correlate issues across requests or track down any individual requests or drill down into a set of results — FOREVER.

Which doesn’t mean metrics aren’t useful! They are useful for many things! But they are useful for things like static dashboards, trend analysis over time, or monitoring that a dimension stays within defined thresholds. Not observability. (Liz would interrupt here and say that Google’s observability story involves metrics, and that is true — metrics with exemplars. But this type of solution is not available outside Google as far as we know..)

❌ Logs

Ditto logs. When I say “logs”, you think “unstructured strings, written out to disk haphazardly during execution, “many” log lines per request, probably contains 1-5 dimensions of useful data per log line, probably has a schema and some defined indexes for searching.” Logs are at their best when you know exactly what to look for, then you can go and find it.

Again, these connotations and assumptions are the opposite of observability’s requirements, which deals with highly structured data only. It is usually generated by instrumentation deep within the app, generally not buffered to local disk, issues a single event per request per service, is schemaless and indexless (or inferred schemas and autoindexed), and typically containing hundreds of dimensions per event.

❓ Traces

Traces? Now we’re getting closer. Tracing IS a big part of observability, but tracing just means visualizing events in order by time. It certainly isn’t and shouldn’t be a standalone product, that just creates unnecessary friction and distance. Hrmm … so what IS observability again, as applied to the software domain??

As a reminder, observability applied to software systems means having the ability to ask any question of your systems — understand any user’s behavior or subjective experience — without having to predict that question, behavior or experience in advance.

Observability is about unknown-unknowns.

At its core, observability is about these unknown-unknowns.

Plenty of tools are terrific at helping you ask the questions you could predict wanting to ask in advance. That’s the easy part. “What’s the error rate?” “What is the 99th percentile latency for each service?” “How many READ queries are taking longer than 30 seconds?”

Monitoring tools like DataDog do this — you predefine some checks, then set thresholds that mean ERROR/WARN/OK.
Logging tools like Splunk will slurp in any stream of log data, then let you index on questions you want to ask efficiently.
APM tools auto-instrument your code and generate lots of useful graphs and lists like “10 slowest endpoints”.

But if you *can’t* predict all the questions you’ll need to ask in advance, or if you *don’t* know what you’re looking for, then you’re in o11y territory.

This can happen for infrastructure reasons — microservices, containerization, polyglot storage strategies can result in a combinatorial explosion of components all talking to each other, such that you can’t usefully pre-generate graphs for every combination that can possibly degrade.
And it can happen — has already happened — to most of us for product reasons, as you’ll know if you’ve ever tried to figure out why a spike of errors was being caused by users on ios11 using a particular language pack but only in three countries, and only when the request hit the image export microservice running build_id 789782 if the user’s last name starts with “MC” and they then try to click on a particular button which then issues a db request using the wrong cache key for that shard.

Gathering the right data, then exploring the data.

Observability starts with gathering the data at the right level of abstraction, organized around the request path, such that you can slice and dice and group and look for patterns and cross-correlations in the requests.

To do this, we need to stop firing off metrics and log lines willynilly and be more disciplined. We need to issue one single arbitrarily-wide event per service per request, and it must contain the *full context* of that request. EVERYTHING you know about it, anything you did in it, all the parameters passed into it, etc. Anything that might someday help you find and identify that request.

Then, when the request is poised to exit or error the service, you ship that blob off to your o11y store in one very wide structured event per request per service.

In order to deliver observability, your tool also needs to support high cardinality and high dimensionality. Briefly, cardinality refers to the number of unique items in a set, and dimensionality means how many adjectives can describe your event. If you want to read more, here is an overview of the space, and more technical requirements for observability

You REQUIRE the ability to chain and filter as many dimensions as you want with infinitely high cardinality for each one if you’re going to be able to ask arbitrary questions about your unknown unknowns. This functionality is table stakes. It is non negotiable. And you cannot get it from any metrics or logs tool on the market today.

Why this matters.

Alright, this is getting pretty long. Let me tell you why I care so much, and why I want people like you specifically (referring to frontend engineers and folks earlier in their careers) to grok what’s at stake in the observability term wars.

We are way behind where we ought to be as an industry. We are shipping code we don’t understand, to systems we have never understood. Some poor sap is on call for this mess, and it’s killing them, which makes the software engineers averse to owning their own code in prod. What a nightmare.

Meanwhile developers readily admit they waste >40% of their day doing bullshit that doesn’t move the business forward. In large part this is because they are flying blind, just stabbing around in the dark.

We all just accept this. We shrug and say well that’s just what it’s like, working on software is just a shit salad with a side of frustration, it’s just the way it is.

But it is fucking not. It is un fucking necessary. If you instrument your code, watch it deploy, then ask “is it doing what I expect, does anything else look weird” as a habit? You can build a system that is both understandable and well-understood. If you can see what you’re doing, and catch errors swiftly, it never has to become a shitty hairball in the first place. That is a choice.

🌟 But **observability in the original technical sense is a necessary prerequisite to this better world**. 🌟

If you can’t break down by high cardinality dimensions like build ids, unique ids, requests, and function names and variables, if you cannot explore and swiftly skim through new questions on the fly, then you cannot inspect the intersection of (your code + production + users) with the specificity required to associate specific changes with specific behaviors. You can’t look where you are going.

Observability as I define it is like taking off the blindfold and turning on the light before you take a swing at the pinata. It is necessary, although not sufficient alone, to dramatically improve the way you build software. Observability as they define it gets you to … exactly where you already are. Which of these is a good use of a new technical term?

Do better.

And honestly, it’s the next generation who are best poised to learn the new ways and take advantage of them. Observability is far, far easier than the old ways and workarounds … but only if you don’t have decades of scar tissue and old habits to unlearn.

The less time you’ve spent using monitoring tools and ops workarounds, the easier it will be to embrace a new and better way of building and shipping well-crafted code.

Observability matters. You should care about it. And vendors need to stop trying to confuse people into buying the same old bullshit tools by smooshing them together and slapping on a new label. Exactly how long do they expect to fool people for, anyway?