From Cloudwashing to O11ywashing

November 24, 2025 mipsytipsymonitoring, rageguy, three pillars, unified storage, vendors6 Comments

I was just watching a panel on observability, with a handful of industry executives and experts who shall remain nameless and hopefully duly obscured—their identities are not the point, the point is that this is a mainstream view among engineering executives and my head is exploding.

Scene: the moderator asked a fairly banal moderator-esque question about how happy and/or disappointed each exec has been with their observability investments.

One executive said that as far as traditional observability tools are concerned (“are there faults in our systems?”), that stuff “generally works well.”

However, what they really care about is observing the quality of their product from the customer’s perspective. EACH customer’s perspective.

Nines don't matter if users aren't happy — Nines don’t matter if users aren’t happy

“Did you know,” he mused, “that there are LOTS of things that can interrupt service or damage customer experience that won’t impact your nines of availability?”

(I begin screaming helplessly into my monitor.)

“You could have a dependency hiccup,” he continued, oblivious to my distress. “There could be an issue with rendering latency in your mobile app. All kinds of things.”

(I look down and realize that I am literally wearing this shirt.)

He finishes with,“And that is why we have invested in our own custom solution to measure key workflows through startup payment and success.”

(I have exploded. Pieces of my head now litter this office while my headless corpse types on and on.)

It’s twenty fucking twenty five. How have we come to this point?

Observability is now a billion dollar market for a meaningless term

My friends, I have failed you.

It is hard not to register this as a colossal fucking failure on a personal level when a group of modern, high performing tech execs and experts can all sit around a table nodding their heads at the idea that “traditional observability” is about whether your systems are UP👆 or DOWN👇, and that the idea of observing the quality of service from each customer’s perspective remains unsolved! unexplored! a problem any modern company needs to write custom tooling from scratch to solve.

This guy is literally describing the original definition of observability, and he doesn’t even know it. He doesn’t know it so hard that he went and built his own thing.

You guys know this, right? When he says “traditional observability tools”, he means monitoring tools. He means the whole three fucking pillars model: metrics, logging, and tracing, all separate things. As he notes, these traditional tools are entirely capable of delivering on basic operational outcomes (are we up, down, happy, sad?). They can DO this. They are VERY GOOD tools if that is your goal.

But they are not capable of solving the problem he wants to solve, because that would require combining app, business, and system telemetry in a unified way. Data that is traceable, but not just tracing. With the ability to slice and dice by any customer ID, site location, device ID, blah blah. Whatever shall we call THAT technological innovation, when someone invents it? Schmobservability, perhaps?

So anyway, “traditional observability” is now part of the mainstream vernacular. Fuck. What are we going to do about it? What CAN be done about it?

From cloudwashing to o11ywashing

I learned a new term yesterday: cloudwashing. I learned this from Rick Clark, who tells a hilarious story about the time IBM got so wound up in the enthusiasm for cloud computing that they reclassified their Z series mainframe as “cloud” back in 2008.

(Even more hilarious: asking Google about the precipitating event, and following the LLM down a decade-long wormhole of incredibly defensive posturing from the IBM marketing department and their paid foot soldiers in tech media about how this always gets held up as an example of peak cloudwashing but it was NOT AT ALL cloudwashing due to being an extension of the Z/Series Mainframe rather than a REPLACEMENT of the Z/Series Mainframe, and did you know that Mainframes are bigger business and more relevant today than ever before?)

(Sorry, but I lost a whole afternoon to this nonsense, I had to bring you along for the ride.)

Rick says the same thing is happening right now with observability. And of course it is. It’s too big of a problem, with too big a budget: an irresistible target. It’s not just the legacy behemoths anymore. Any vendor that does anything remotely connected to telemetry is busy painting on a fresh coat of o11ywashing. From a marketing perspective, It would be irresponsible not to.

How to push back on *-washing

Anyway, here are the key takeaways from my weekend research into cloudwashing.

This o11ywashing problem isn’t going away. It is only going to get bigger, because the problem keeps getting bigger, because the traditional vendors aren’t solving it, because they can’t solve it.
The Gartners of the world will help users sort this out someday, maybe, but only after we win. We can’t expect them to alienate multibillion dollar companies in the pursuit of technical truth, justice and the American Way. If we ever want to see “Industry Experts” pitching in to help users spot o11ywashing, as they eventually did with cloudwashing (see exhibit A), we first need to win in the market.
Exhibit A: “How to Spot Cloudwashing”
And (this is the only one that really matters.) we have to do a better job of telling this story to engineering executives, not just engineers. Results and outcomes, not data structures and algorithms.

(I don’t want to make this sound like an epiphany we JUST had…we’ve been working hard on this for a couple years now, and it’s starting to pay off. But it was a powerful confirmation.)

Talking to execs is different than talking to engineers

When Christine and I started Honeycomb, nearly ten years ago, we were innocent, doe-eyed engineers who truly believed on some level that if we just explained the technical details of cardinality and dimensionality clearly and patiently enough to the world, enough times, the consequences to the business would become obvious to everyone involved.

It has now been ten years since I was a hands-on engineer every day (say it again, like pressing on a bruise makes it hurt less), and I would say I’ve been a decently functioning exec for about the last three or four of those years.

What I’ve learned in that time has actually given me a lot of empathy for the different stresses and pressures that execs are under.

I wouldn’t say it’s less or more than the stresses of being an SRE on call for some of the world’s biggest databases, but it is a deeply and utterly different kind of stress, the kind of stress less expiable via fine whiskey and poor life choices. (You just wake up in the morning with a hangover, and the existential awareness of your responsibilities looming larger than ever.)

This is a systems problem, not an operational one

There is a lot of noise in the field, and executives are trying to make good decisions that satisfy all parties and constraints amidst the unprecedented stress-panic-opportunity-terror of AI changing everything. That takes storytelling skills and sales discipline on our part, in addition to technical excellence.

Companies are dumping more and more and more money into their so-called observability tools, and not getting any closer to a solution. Nor will they, so long as they keep thinking about observability in terms of operational outcomes (and buying operational tools). Observability is a systems problem. It’s the most powerful lever in your arsenal when it comes to disrupting software doom spirals and turning them into positive feedback loops. Or it should be.

As Fred Hebert might say, it’s great you’re so good at firefighting, but maybe it’s time to go read the city fire codes.

Execs don’t know what they don’t know, because we haven’t been speaking to them. But we’re starting to.

What will be the next term that gets invented and coopted in the search to solve this problem?

Where to start, with a project so big? Google’s AI says that “experts suggest looking for specific features to identify true ~~cloud~~ observability solutions versus ~~cloudwashed~~ o11ywashed ones.”

I guess this is a good place to start as any: If your “observability” tooling doesn’t help you understand the quality of your product from the customer’s perspective, EACH customer’s perspective, it isn’t fucking observability.

It’s just monitoring dressed up in marketing dollars.

Call it o11ywashing.

Is It Time To Version Observability? (Signs Point To Yes)

August 7, 2024August 9, 2024 mipsytipsycanonical logs, columnar storage, databases, metrics, monitoring, observability 2.010 Comments

Augh! I am so behind on so much writing, I’m even behind on writing shit that I need to reference in order to write other pieces of writing. Like this one. So we’re just gonna do this quick and dirty on the personal blog, and not bother bringing it up to the editorial standards of…anyone else’s sites. 😬

If you’d rather consume these ideas in other ways:

I gave a keynote at SRECon in March
Here is a slide deck of my slides from CTO Craft Con London in May
A Screaming In The Cloud podcast with Corey Quinn in April
My piece earlier in the year on the Cost Crisis in Observability Tooling touched on some of the concepts too
Matt Sanabria wrote a great piece comparing us and a bunch of other observability vendors in 2024.

What does observability mean? No one knows

In 2016, we first borrowed the term “observability” from the wikipedia entry for control systems observability, where it is a measure of your ability to understand internal system states just by observing its outputs. We (Honeycomb) then spent a couple of years trying to work out how that definition might apply to software systems. Many twitter threads, podcasts, blog posts and lengthy laundry lists of technical criteria emerged from that work, including a whole ass book.

In 2018, Peter Bourgon wrote a blog post proposing that “observability has three pillars: metrics, logs and traces. Ben Sigelman did a masterful job of unpacking why metrics, logs and traces are just telemetry. However, lots of people latched on to the three pillars language: vendors because they (coincidentally!) had metrics products, logging products, and tracing products to sell, engineers because it described their daily reality.

Since then the industry has been stuck in kind of a weird space, where the language used to describe the problems and solutions has evolved, but the solutions themselves are largely the same ones as five years ago, or ten years ago. They’ve improved, of course — massively improved — but structurally they’re variations on the same old pre-aggregated metrics.

It has gotten harder and harder to speak clearly about different philosophical approaches and technical solutions without wading deep into the weeds, where no one but experts should reasonably have to go.

This is what semantic versioning was made for

Look, I am not here to be the language police. I stopped correcting people on twitter back in 2019. We all do observability! One big happy family. 👍

I AM here to help engineers think clearly and crisply about the problems in front of them. So here we go. Let’s call the metrics, logs and traces crowd — the “three pillars” generation of tooling — that’s “Observability 1.0“. Tools like Honeycomb, which are built based on arbitrarily-wide structured log events, a single source of truth — that’s “Observability 2.0“.

Here is the twitter thread where I first teased out the differences between these generations of tooling (all the way back in December, yes, that’s how long I’ve been meaning to write this 😅).

About two months ago I wrote this thread about how we lost the battle to define ✨observability✨ -- to give it a real, specific, falsifiable technical definition, distinct from monitoring or telemetry.

I complained, I argued, I grieved...and now I'm over it. SO over it. 🙄 https://t.co/QH3U0iZZ8G
— Charity Majors (@mipsytipsy) December 22, 2023

This is literally the problem that semantic versioning was designed to solve, by the way. Major version bumps are reserved for backwards-incompatible, breaking changes, and that’s what this is. You cannot simultaneously store your data across both multiple pillars and a single source of truth.

Incompatible. Breaking change. O11y 1.0, meet O11y 2.0.

small technical changes can unlock waves of powerful sociotechnical transformation

There are a LOT of ramifications and consequences that flow from this one small change in how your data gets stored. I don’t have the time or space to go into all of them here, but I will do a quick overview of the most important ones.

The historical analogue that keeps coming to mind for me is virtualization. VMs are old technology, they’ve been around since the 70s. But it wasn’t until the late 90s that VMware productized it, unlocking wave after wave of change, from cloud computing and SaaS to the very DevOps movement itself.

I believe the shift to observability 2.0 holds a similarly massive potential for change, based on what I see happening today, with teams who have already made the leap. Why? In a word, precision. O11y 1.0 can only ever give you aggregates and random exemplars. O11y 2.0, on the other hand, can tell you precisely what happened when you flipped a flag, deployed to a canary, or made any other change in production.

Will these waves of sociotechnical transformation ever be realized? Who knows. The changes that get unlocked will depend to some extent on us (Honeycomb), and to an even greater extent on engineers like you. Anyway, I’ll talk about this more some other time. Right now, I just want to establish a baseline for this vocabulary.

1.0 vs 2.0: How does the data get stored?

1.0💙 O11y 1.0 has many sources of truth, in many different formats. Typically, you end up storing your data across metrics, logs, traces, APM, RUM, profiling, and possibly other tools as well. Some folks even find themselves falling back to B.I. (business intelligence) tools like Tableau in a pinch to understand what’s happening on their systems.

Each of these tools are siloed, with no connective tissue, or only a few, predefined connective links that connect e.g. a specific metric to a specific log line. Aggregation is done at write time, so you have to decide up front which data points to collect and which questions you want to be able to ask. You may find yourself eyeballing graph shapes and assuming they must be the same data, or copy-pasting IDs around from logging to tracing tools and back.

2.0 💚 Data gets stored in arbitrarily-wide structured log events (often called “canonical logs“), often with trace and span IDs appended. You can visualize the events over time as a

trace, or slice and dice your data to zoom in to individual events, or zoom out to a birds-eye view. You can interact with your data by group by, break down, etc.

You aggregate at read time, and preserve raw events for ad hoc querying. Hopefully, you derive your SLO data from the same data you query! Think of it as B.I. for systems/app/business data, all in one place. You can derive metrics, or logs, or traces, but it’s all the same data.

1.0 vs 2.0: on metrics vs logs

1.0 💙 The workhorse of o11y 1.0 is metrics. RUM tools are built on metrics to understand browser user sessions. APM tools are built using metrics to understand application performance. Long ago, the decision was made to use metrics as the source of truth for telemetry because they are cheap and fast, and hardware used to be incredibly expensive.

The more complex our systems get, the worse of a tradeoff this becomes. Metrics are a terrible building block for understanding rich data, because you have to discard all that valuable context at write time, and they don’t support high (or even medium!) cardinality data. All you can do to enrich the data is via tags.

Metrics are a great tool for cheaply summarizing vast quantities of data. They are not equipped to help you introspect or understand complex systems. You will go broke and go mad if you try.

2.0 💚 The building block of o11y 2.0 is wide, structured log events. Logs are infinitely more powerful, useful and cost-effective than metrics are because they preserve context and relationships between data, and data is made valuable by context. Logs also allow you to capture high cardinality data and data relationships/structures, which give you the ability to compute outliers and identify related events.

1.0 vs 2.0: Who uses it, and how?

1.0 💙 Observability 1.0 is predominantly about how you operate your code. It centers around errors, incidents, crashes, bugs, user reports and problems. MTTR, MTTD, and reliability are top concerns.

O11y 1.0 is typically consumed using static dashboards — lots and lots of static dashboards. “Single pane of glass” is often mentioned as a holy grail. It’s easy to find something once you know what you’re looking for, but you need to know to look for it before you can find it.

2.0 💚 If o11y 1.0 is about how you operate your code, o11y 2.0 is about how you develop your code. O11y 2.0 is what underpins the entire software development lifecycle, enabling engineers to connect feedback loops end to end so they get fast feedback on the changes they make, while it’s still fresh in their heads. This is the foundation of your team’s ability to move swiftly, with confidence. It isn’t just about understanding bugs and outages, it’s about proactively understanding your software and how your users are experiencing it.

Thus, o11y 2.0 has a much more exploratory, open-ended interface. Any dashboards should be dynamic, allowing you to drill down into a question or follow a trail of breadcrumbs as part of the debugging/understanding process. The canonical question of o11y 2.0 is “here’s a thing I care about … why do I care about it? What are all of the ways it is different from all the other things I don’t care for?”

When it comes to understanding your software, it’s often harder to identify the question than the answer. Once you know what the question is, you probably know the answer too. With o11y 1.0, it’s very easy to find something once you know what you’re looking for. With o11y 2.0, that constraint is removed.

1.0 vs 2.0: How do you interact with production?

1.0 💙 You deploy your code and wait to get paged. 🤞 Your job is done as a developer when you commit your code and tests pass.

2.0 💚 You practice observability-driven development: as you write your code, you instrument it. You deploy to production, then inspect your code through the lens of the instrumentation you just wrote. Is it behaving the way you expected it to? Does anything else look … weird?

Your job as a developer isn’t done until you know it’s working in production. Deploying to production is the beginning of gaining confidence in your code, not the denouement.

1.0 vs 2.0: How do you debug?

1.0 💙 You flip from dashboard to dashboard, pattern-matching and looking for similar shapes with your eyeballs.

You lean heavily on intuition, educated guesses, past experience, and a mental model of the system. This means that the best debuggers are ALWAYS the engineers who have been there the longest and seen the most.

Your debugging sessions are search-first: you start by searching for something you know should exist.

2.0 💚 You check your instrumentation, or you watch your SLOs. If something looks off, you see what all the mysterious events have in common, or you start forming hypotheses, asking a question, considering the result, and forming another one based on the answer. You interrogate your systems, following the trail of breadcrumbs to the answer, every time.

You don’t have to guess or rely on elaborate, inevitably out-of-date mental models. The data is right there in front of your eyes. The best debuggers are the people who are the most curious.

Your debugging questions are analysis-first: you start with your user’s experience.

1.0 vs 2.0: The cost model

1.0 💙 You pay to store your data again and again and again and again, multiplied by all the different formats and tool types you are paying to store it in. Cost goes up at a multiplier of your traffic increase. I wrote a whole piece earlier this year on the cost crisis in observability tooling, so I won’t go into it in depth here.

As your costs increase, the value you get out of your tools actually decreases.

If you are using metrics-based products, your costs go up based on cardinality. “Custom metrics” is a euphemism for “cardinality”; “100 free custom metrics” actually means “100 free cardinality”, aka unique values.

2.0 💚 You pay to store your data once. As your costs go up, the value you get out goes up too. You have powerful, surgical options for controlling costs via head-based or tail-based dynamic sampling.

You can have infinite cardinality. You are encouraged to pack hundreds or thousands of dimensions in per event, and any or all of those dimensions can be any data type you want. This luxurious approach to cardinality and data is one of the least well understood aspects of the switch from o11y 1.0 to 2.0.

Many observability engineering teams have spent their entire careers massaging cardinality to control costs. What if you just .. didn’t have to do that? What would you do with your lives? If you could just store and query on all the crazy strings you want, forever? 🌈

Metrics are a bridge to our past

Why are observability 1.0 tools so unbelievably, eyebleedingly expensive? As anyone who works with data can tell you, this is always what happens when you use the wrong tool for the job. Once again, metrics are a great tool for summarizing vast quantities of data. When it comes to understanding complex systems, they flail.

I wrote a whole whitepaper earlier this year that did a deep dive into exactly why tools built on top of metrics are so unavoidably costly. If you want the gnarly detail, download that.

The TLDR is this: tools built on metrics — whether RUM, APM, dashboards, etc — are a bridge to our past. If there’s one thing I’m certain of, it’s that tools built on top of wide, structured logs are the bridge to our future.

Wide, structured log events are the bridge to our future

Five years from now, I predict that the center of gravity will have swung dramatically; all modern engineering teams will be powering their telemetry off of tools backed by wide, structured log events, not metrics. It’s getting harder and harder and harder to try and wring relevant insights out of metrics-based observability tools. The end of the ZIRP era is bringing unprecedented cost pressure to bear, and it’s simply a matter of time.

The future belongs to tools built on wide, structured log events — a single source of truth that you can trace over time, or zoom in, zoom out, derive SLOs from, etc.

It’s the only way to understand our systems in all their skyrocketing complexity. This constant dance with cost vs cardinality consumes entire teams worth of engineers and adds zero value. It adds negative value.

And here’s the weirdest part. The main thing holding most teams back psychologically from embracing o11y 2.0 seems to be the entrenched difficulties they have grappling with o11y 1.0, and their sense that they can’t adopt 2.0 until they get a handle on 1.0. Which gets things exactly backwards.

Because observability 2.0 is so much easier, simpler, and more cost effective than 1.0.

observability 1.0 is the hard way

It’s so fucking hard. We’ve been doing it so long that we are blind to just how HARD it is. But trying to teach teams of engineers to wrangle metrics, to squeeze the questions they want to ask into multiple abstract formats scattered across many different tools, with no visibility into what they’re doing until it comes out eventually in form of a giant bill… it’s fucking hard.

Observability 2.0 is so much simpler. You want data, you just toss it in. Format? don’t care. Cardinality? don’t care.

You want to ask the question, you just ask it. Format? don’t care.

Teams are beating themselves up trying to master an archaic, unmasterable set of technical tradeoffs based on data types from the 80s. It’s an unwinnable war. We can’t understand today’s complex systems without context-rich, explorable data.

We need more options for observability 2.0 tooling

My hope is that by sketching out these technical differences between o11y 1.0 and 2.0, we can begin to collect and build up a vendor-neutral library of o11y 2.0 options for folks. The world needs more options for understanding complex systems besides just Honeycomb and Baselime.

The world desperately needs an open source analogue to Honeycomb — something built for wide structured events, stored in a columnar store (or even just Clickhouse), with an interactive interface. Even just a written piece on how you solved it at your company would help move the industry forward.

My other hope is that people will stop building new observability startups built on metrics. Y’all, Datadog and Prometheus are the last, best metrics-backed tools that will ever be built. You can’t catch up to them or beat them at that; no one can. Do something different. Build for the next generation of software problems, not the last generation.

If anyone knows of anything along these lines, please send me links? I will happily collect them and signal boost. Honeycomb is a great, lifechanging tool (and we have a generous free tier, hint hint) but one option does not a movement make.

<3 charity

P.S. Here’s a great piece written by Ivan Burmistrov on his experience using observability 2.0 type tooling at Facebook — namely Scuba, which was the inspiration for Honeycomb. It’s a terrific piece and you should read it.

P.P.S. And if you’re curious, here’s the long twitter thread I wrote in October of 2023 on how we lost the battle to define observability:

So, we lost the battle to define observability. You know it, I know it. Observability was supposed to *mean* something, and in the early days, it did.

"Observability" once meant the kind of exploratory, open ended investigation our systems increasingly demand.
— Charity Majors (@mipsytipsy) October 30, 2023

Live Your Best Life With Structured Events

August 15, 2022July 14, 2023 mipsytipsyengineering, events, logs, metrics, monitoring, traces6 Comments

If you’re like most of us, you learned to debug as a baby engineer by way of printf(3). By the time you were shipping code to production you had probably learned to instrument your code with a real metrics library. Maybe a tenth of us learned to use gdb and still step through functions on the regular. (I said maybe.)

Printing stuff to stdout is still the Swiss Army knife of tools. Always there when you reach for it, usually helps more than it does harm. (I said usually.)

And then! In case you’ve been living under a rock, we recently went and blew up ye aulde monolythe, and in the process we … lost most of our single-process tools and techniques for debugging. Forget gdb; even printf doesn’t work when you’re hopping the network between functions.

If your tool set no longer works for you, friend, it’s time to go all in. Maybe what you wanted was a faster horse, but it’s time for a car, and the sooner you turn in your oats for gas cans and a spare tire, the better.

Exercising Good Technical Judgment (When You Don’t Have Any)

If you’re stuck trying to debug modern problems with pre-modern tooling, the first thing to do is stop digging the hole. Stop pushing good data after bad into formats and stores that aren’t going to help you answer the right questions.

In brief: if you aren’t rolling out a solution based on arbitrarily wide, structured raw events that are unique and ordered and trace-aware and without any aggregation at write time, you are going to regret it. (If you aren’t using OpenTelemetry, you are going to regret that, too.)

So just make the leap as soon as possible.

But let’s rewind a bit. Let’s start with observability.

Observability: an introduction

Observability is not a new word or concept, but the definition of observability as a specific technical term applied to software engineering is relatively new — about four years old. Before that, if you heard the term in softwareland it was only as a generic synonym for telemetry (“there are three pillars of observability”, in one annoying formulation) or team names (twitter, for example, has long had an “observability team”).

The term itself originates with control theory:

“In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals. The concept of observability was introduced by Hungarian-American engineer Rudolf E. Kálmán for linear dynamic systems.[1][2]”

But when applied to a software context, observability refers to how well you can understand and reason about your systems, just by interrogating them and inspecting their outputs with your tools. How well can you understand the inside of the system from the outside?

Achieving this relies your ability to ask brand new questions, questions you have never encountered and never anticipated — without shipping new code. Shipping new code is cheating, because it means that you knew in advance what the problem was in order to instrument it.

But what about monitoring?

Monitoring has a long and robust history, but it has always been about watching your systems for failures you can define and expect. Monitoring is for known-unknowns, and setting thresholds and running checks against the system. Observability is about the unknown-unknowns. Which requires a fundamentally different mindset and toolchain.

“Monitoring is the action of observing and checking the behavior and outputs of a system and its components over time.” — @grepory, in his talk “Monitoring is Dead“.

Monitoring is a third-person perspective on your software. It’s not software explaining itself from the inside out, it’s one piece of software checking up on another.

Observability is for understanding complex, ephemeral, dynamic systems (not for debugging code)

You don’t use observability for stepping through functions; it’s not a debugger. Observability is for swiftly identifying where in your system the error or problem is coming from, so you can debug it — by reproducing it, or seeing what it has in common with other erroring requests. You can think of observability as being like B.I. (business intelligence) tooling for software applications, in the way you engage in a lot of exploratory, open-ended data sifting to detect novel patterns and behaviors.

Observability is often about swiftly isolating or tracking down the problem in your large, sprawling, far-flung, dynamic system. Because the hard part of distributed systems is rarely debugging the code, it’s figuring out where the code you need to debug is.

The need for observability is often associated with microservices adoption, because they are prohibitively difficult to debug without service-level event oriented tooling — the kind you can get from Honeycomb and Lightstep.. and soon, I hope, many other vendors.

Events are the building blocks of observability

Ergh, another overloaded data term. What even is an “event”?

An observability “event” is a hop in the lifecycle of an end-to-end request. If a request executes code on three services separated by network hops before returning to the user, that request generated three observability “events”, each packed with context and details about that code running in that environment. These are also sometimes called “canonical log lines“. If you implemented tracing, each event may be a span in your trace.

If request ID #A897BEDC hits your edge, then your API service, then four more internal services, and twice connects to a db and runs a query, then request ID #A897BEDC generated 8 observability events … assuming you are in fact gathering observability data from the edge, the API, the internal services and the databases.

This is an important caveat. We only gather observability events from services that we can and do introspect. If it’s a black box to us, that hop cannot generate an observability event. So if request ID #A897BEDC also performed 20 cache lookups and called out to 8 external HTTP services and 2 managed databases, those 30 hops do not generate observability events (assuming you haven’t instrumented the memcache service and have no instrumentation from those external services/dbs). Each request generates one event per request per service hop.**

(I also wrote about logs vs structured events here.)

Observability is a first-person narrative.

We care primarily about self-reported status from the code as it executes the request path.

Instrumentation is your eyes and ears, explaining the software and its environment from the perspective of your code. Monitoring, on the other hand, is traditionally a third-person narrative — it’s one piece of software checking up on another piece of software, with no internal knowledge of its hopes and dreams.

First-person narrative reports have the best potential for telling a reliable narrative. And more importantly, they map directly to user experience in a way that third-party monitoring does not and cannot.

Events … must be structured.

First, structure your goddamn data. You’re a computer scientist, you’ve got no business using text search to plow through terabytes of text.

Events … are not just structured logs.

Now, part of the reason people seem to think structured data is cost-prohibitive is that they’re doing it wrong. They’re still thinking about these like log lines. And while you can look at events like they’re just really wide structured log lines that aren’t flushed to disk, here’s why you shouldn’t: logs have decades of abhorrent associations and absolutely ghastly practices.

Instead of bundling up and passing along one neat little pile of context, they’re spewing log lines inside loops in their code and DDoS’ing their own logging clusters.They’re shitting out “log lines” with hardly any dimensions so they’re information-sparse and just straight up wasting the writes. And then to compensate for the sparseness and repetitiveness they just start logging the same exact nouns tens or hundreds of times over the course of the request, just so they can correlate or reconstruct some lousy request that they never should have blown up in the first place!

But they keep hearing they should be structuring their logs, so they pile structure on to their horrendous little strings, which pads every log line by a few bytes, so their bill goes up but they aren’t getting any benefit! just paying more! What the hell, structuring is bull shit!

Kittens. You need a fundamentally different approach to reap the considerable benefits of structuring your data.

But the difference between strings and structured data is ~basically the difference between grep and all of computer science. 😛

Events … must be arbitrarily wide and dense with context.

So the most effective way to structure your instrumentation, to get the absolute most bang for your buck, is to emit a single arbitrarily wide event per request per service hop. At Honeycomb, the maturely instrumented datasets that we see are often 200-500 dimensions wide. Here’s an event that’s just 20 dimensions wide:

{ 

   "timestamp":"2018-11-20 19:11:56.910",
   "az":"us-west-1",
   "build_id":"3150",
   "customer_id":"2310",
   "durationMs":167,
   "endpoint":"/api/v2/search",
   "endpoint_shape":"/api/v2/search",
   "fraud_dur":131,
   "hostname":"app14",
   "id":"f46691dfeda9ede4",
   "mysql_dur":"",
   "name":"/api/v2/search",
   "parent_id":"",
   "platform":"android",
   "query":"",
   "serviceName":"api",
   "status_code":200,
   "traceId":"f46691dfeda9ede4",
   "user_id":"344310",
   "error_rate":0,
   "is_root":"true"
}

So a well-instrumented service should have hundreds of these dimensions, all bundled around the context of each request. And yet — and here’s why events blow the pants off of metrics — even with hundreds of dimensions, it’s still just one write. Adding more dimensions to your event is effectively free, it’s still one write plus a few more bits.

Compare this to a metric-based systems, where you are often in the position of trying to predict whether a metric will be valuable enough to justify the extra write, because every single metric or tag you add contributes linearly to write amplification. Ever gotten billed tens of thousands of dollars for your custom metrics, or had to prune your list of useful custom metrics down to something affordable? (“BUT THOSE ARE THE ONLY USEFUL ONES!”, as every ops team wails)

Events … must pass along the blob of context as the request executes

As you can imagine, it can be a pain in the ass to keep passing this blob of information along the life of the request as it hits many services and databases. So at Honeycomb we do all the annoying parts for you with our integrations. You just install the go pkg or ruby gem or whatever, and under the hood we:

initialize an empty debug event when the request enters that service
prepopulate the empty debug event with any and all interesting information that we already know or can guess. language type, version, environment, etc.
create a framework so you can just stuff any other details in there as easily as if you were printing it out to stdout
pass the event along and maintain its state until you are ready to error or exit
write the extremely wide event out to honeycomb

Easy!

(Check out this killer talk from @lyddonb on … well everything you need to know about life, love and distributed systems is in here, but around the 12:00 mark he describes why this approach is mandatory. WATCH IT. https://www.youtube.com/watch?v=xy3w2hGijhE&feature=youtu.be)

Events … should collect context like sticky buns collect dust

Other stuff you’ll want to track in these structured blobs includes:

Metadata like src, dst headers
The timing stats and contents of every network call (our beelines wrap all outgoing http calls and db queries automatically)
Every raw db query, normalized query family, execution time etc
Infra details like AZ, instance type, provider
Language/environment context like $lang version, build flags, $ENV variables
Any and all unique identifying bits you can get your grubby little paws on — request ID, shopping cart ID, user ID, request ID, transaction ID, any other ID … these are always the highest value data for debugging.
Any other useful application context. Service name, build id, ordering info, error rates, cache hit rate, counters, whatever.
Possibly the system resource state at this point in time. e.g. values from /proc/net/ipv4 stats

Capture all of it. Anything that ever occurs to you (“this MIGHT be handy someday”) — don’t even hesitate, just throw it on the pile. Collect it up in one rich fat structured blob.

Events … must be unique, ordered, and traceable

You need a unique request ID, and you need to propagate it through your stack in some way that preserves sequence. Once you have that, traces are just a beautiful visualization layer on top of your shiny event data.

Events … must be stored raw.

Because observability means you need to be able to ask any arbitrary new question of your system without shipping new code, and aggregation is a one-way trip. Once you have aggregated your data and discarded the raw requests, you have destroyed your ability to ask new questions of that data forever. For Ever.

Aggregation is a one-way trip. You can always, always derive your pretty metrics and dashboards and aggregates from structured events, and you can never go in reverse. Same for traces, same for logs. The structured event is the gold standard. Invest in it now, save your ass in the future.

It’s only observability if you can ask new questions. And that means storing raw events.

Events…are richer than metrics

There’s always tradeoffs when it comes to data. Metrics choose to sacrifice context and connective tissue, and sometimes high cardinality support, which you need to correlate anomalies or track down outliers. They have a very small, efficient data format, but they sacrifice everything else by discarding all but the counter, gauge, etc.

A metric looks like this, by the way.

{ metric: "db.query.time", value: 0.502, tags: Array(), type: set }

That’s it. It’s just a name, a number and maybe some tags. You can’t dig into the event and see what else was happening when that query was strangely slow. You can never get that information back after discarding it at write time.

But because they’re so cheap, you can keep every metric for every request! Maybe. (Sometimes.) More often, what happens is they aggregate at write time. So you never actually get a value written out for an individual event, it smushes everything together that happens in the 1 second interval and calculates some aggregate values to write out. And that’s all you can ever get back to.

With events, and their relative explosion of richness, we sacrifice our ability to store every single observability event about every request. At FB, every request generated hundreds of observability events as it made its way through the stack. Nobody, NOBODY is going to pay for an o11y stack that is hundreds of times as large as production. The solution to that problem is sampling.

Events…should be sampled.

But not dumb, blunt sampling at server side. Control it on the client side.

Then sample heavily for events that are known to be common and useless, but keep the events that have interesting signal. For example: health checks that return 200 OK usually represent a significant chunk of your traffic and are basically useless, while 500s are almost always interesting. So are all requests to /login or /payment endpoints, so keep all of them. For database traffic: SELECTs for health checks are useless, DELETEs and all other mutations are rare but you should keep all of them. Etc.

You don’t need to treat your observability metadata with the same care as you treat your billing data. That’s just dumb.

… To be continued.

I hope it’s now blazingly obvious why observability requires — REQUIRES — that you have access to raw structured events with no pre-aggregation or write-time rollups. Metrics don’t count. Just traces don’t count. Unstructured logs sure as fuck don’t count.

Structured, arbitrarily wide events, with dynamic sampling of the boring parts to control costs. There is no substitute.

For more about the technical requirements for observability, read this, this, or this.

**The deep fine print: it’s one observability event per request per service hop … because we gather observability detail organized by request id. Databases may be different. For example, with MongoDB or MySQL, we can’t instrument them to talk to honeycomb directly, so we gather information about its internal perspective by 1) tailing the slow query log (and turning it up to log all queries if perf allows), 2) streaming tcp over the wire and reconstructing transactions, 3) connecting to the mysql port as root every couple seconds from cron, then dumping all mysql stats and streaming them in to honeycomb as an event. SO. Database traffic is not organized around connection length or unique request id, it is organized around transaction id or query id. Therefore it generates one observability event per query or transaction.

In other words: if your request hit the edge, API, four internal services, two databases … but ran 1 query on one db and 10 queries on the second db … you would generate a total of *19 observability events* for this request.

For more on observability for databases and other black boxes, try this blog post.

The Truth About “MEH-TRICS”

April 13, 2022January 10, 2024 mipsytipsycrossposted, engineering, logs, metrics, monitoringLeave a comment

First published on 2022-04-13 at https://www.honeycomb.io/blog/truth-about-meh-trics-metrics

A long time ago, in a galaxy far, far away, I said a lot of inflammatory things about metrics.

“Metrics are shit salad.”

“Metrics are simply nerfed dimensions.”

“Metrics suck,” “metrics are legacy,” “metrics and time series aggregates will fucking kneecap you.”

I cannot tell a lie; Twitter will testify that I’ve spent the past six years ragging on metrics. So much so that ever since we launched Honeycomb Metrics last year, our poor solution architects have been encountering skeptics in the field who repeat my quotes back to them and ask, dubiously, whether Honeycomb Metrics are any good or not, and whether we genuinely plan on investing in it or not, given our known anti-metrics sympathies.

That’s a great question. 😊

Metrics aren’t worthless; they’re just limited.

Metrics are a mature technology that’s been around for over 30 years, and they have some real advantages. They’re tiny, fast, and cheap; you can hold a bunch of them in memory as counters, summaries, and gauges. They aggregate well and take up a fixed amount of storage space. The entire monitoring industry is built on top of metrics.

When it comes to workloads like, “How heavy is the write load on my hard drive?” or “What is the temperature or fan status inside my chassis?” or “What is the traffic rate in and out of this interface on my switch?” metrics are what you should use. In fact, pretty much any time you want to know the health of a system or component in toto, metrics are the right tool.

Because that’s what metrics do best—report statistics in aggregate, from the perspective of any system or component. They can tell you that your Ruby HTTP worker pool is 70% utilized or that your nginx webserver is returning 502s 1% of the time. What they can’t tell you is what this means for any one of your users, applications, delivery vehicles, and so forth.

Until recently, metrics-based tools or logs were the only game in town. People were trying to sell us metrics tools for observability use cases, and that’s what got my goat so badly. If you simply append “… for observability” to each of my inflammatory statements, then I stand by them completely.

“Metrics are shit salad … for observability.”

Yup, rings true.

You’re never going to make a metrics tool like Prometheus or Datadog into an observability tool. You’re just not. Observability is about unknown-unknowns, while metrics are a tool for known-unknowns.

If you need a refresher on the differences between observability and monitoring, I’ll refer you to pieces like this, this, and this. What I want to talk about here is slightly different. In a post-observability world, what is the true and proper place for metrics tooling?

Metrics and observability have different use cases.

Metrics aren’t completely useless, even if you have a robust observability presence. We still use metrics at Honeycomb to this day for certain workloads—and always will because they’re the right tool for the job.

There are two kinds of workloads, roughly speaking: your code—the code you write, review, ship, debug and maintain on a daily basis. And other people’s code—the code you have to run and use in order to support your code. Some examples of the latter might be: Linux, Docker, MySql, Amazon RDS, Kafka, AWS Lambda, GCP gateways, memcache, CI/CD pipelines, Kubernetes, etc.

Your code is your crown jewels, the code you need to survive and succeed as a business. It changes constantly—many times per week, if not per day. You are expected to understand its inner workings intimately, and spend lots of time chasing down bugs or understanding and reproducing behavior. You care about the way it performs and interacts with each and every individual user, with changing infrastructure state, and under a variety of different load conditions.

That is why your code demands observability. In order to understand your software, you must first instrument it, in a way that collects lots of rich context and bundles it up around each event end-to-end. Then you need to stream those events into a tool that lets you slice and dice and trace and explore with support for high-cardinality and high-dimensionality data. That’s the only way you’re going to be able to correlate errors, track down outliers, and reflect each user’s experience.

But what about the rest of the software? You can’t instrument Amazon RDS, and only crazy people would instrument, rebuild, and repackage things like Kafka or Docker or nginx. The whole point of third-party software is that you DON’T USE IT until it’s stable enough to be taken more or less for granted. Sure, you roll updates, but usually on the order of months or years—not every day. You don’t need to be intimately familiar with its inner workings because you aren’t changing it every day. Those aren’t your crown jewels.

You do care about their health though, only differently. You care about whether you need to provision more capacity or not. You care about knowing how hard you’re hammering on the underlying hardware or hypervisor. That’s why metrics and monitoring are the right tools to use for third-party code. They don’t let you peer under the hood in the same way, or slice and dice in the same way, but that’s okay. You shouldn’t have to.

With third-party stuff, you don’t care about the code, you care about the health of the service. In aggregate.

(There are some kinds of in-between software, like databases, where event-level information is super useful for debugging things like slow queries and lock percentages, and you can use various black box techniques to approximate observability without instrumentation. But in general this model holds up quite well.)

In a post-observability world, what are metrics for?

I’ve often pointed out that observability is built on top of arbitrarily wide structured data blobs, and that metrics, logs, and traces can be derived from those blobs while the reverse is not true—you can’t take a bunch of metrics and reformulate a rich event.

And yes, people who have observability typically find themselves using metrics and dashboards less and less. They’re simply not as versatile or useful as events that you can slice and dice and manipulate in infinite ways. And you can derive aggregates and trends from the events you have stored.

But metrics will always be useful for understanding third-party software, from the perspective of the service, cluster, or node. They will always be the right tool for the job when it comes to software interfacing with hardware. And they can be super complementary when you are investigating your code using events and instrumentation.

If you’re an engineer writing and shipping code, you’re never not going to want to know if your change caused memory usage to triple, or CPU utilization to skyrocket, or disk usage or network throughput to saturate. That’s why we built Honeycomb Metrics as an overlay, a way to enhance or validate your understanding of the impact your code changes have had on the underlying system.

Metrics are also valuable as a bridge to the past. People have been instrumenting software for metrics for 30 years—they’re never going away completely, and not everything can or should be reinstrumented with events. Lots of people already have robust monitoring systems that slurp in millions of metrics. Nobody wants to have to redo all that work just because they’re moving to a different tool, so people tend to point their metrics firehose at Honeycomb as a way of getting started as they roll observability out into their code.

Notes on the Perfidy of Dashboards

August 9, 2021July 14, 2023 mipsytipsydashboards, monitoring, operations, sre9 Comments

The other day I said this on twitter —

every dashboard is a sunk cost
every dashboard is an answer to some long-forgotten question
every dashboard is an invitation to pattern-match the past instead of interrogate the present
every dashboard gives the illusion of correlation
every dashboard dampens your thinking https://t.co/OIEowa1COa

— Charity Majors (@mipsytipsy) July 19, 2021

… which stirred up some Feelings for many people. 🙃 So I would like to explain my opinions in more detail.

Static vs dynamic dashboards

First, let’s define the term. When I say “dashboard”, I mean STATIC dashboards, i.e. collections of metrics-based graphs that you cannot click on to dive deeper or break down or pivot. If your dashboard supports this sort of responsive querying and exploration, where you can click on any graph to drill down and slice and dice the data arbitrarily, then breathe easy — that’s not what I’m talking about. Those are great. (I don’t really consider them dashboards, but I have heard a few people refer to them as “dynamic dashboards”.)

Actually, I’m not even “against” static dashboards. Every company has them, including Honeycomb. They’re great for getting a high level sense of system functioning, and tracking important stats over long intervals. They are a good starting point for investigations. Every company should have a small, tractable number of these which are easily accessible and shared by everyone.

Debugging with dashboards: it’s a trap

What dashboards are NOT good at is debugging, or understanding or describing novel system states.

I can hear some of you now: “But I’ve debugged countless super-hard unknown problems using only static dashboards!” Yes, I’m sure you have. If all you have is a hammer, you CAN use it to drive screws into the wall, but that doesn’t mean it’s the best tool. And It takes an extraordinary amount of knowledge and experience to be able to piece together a narrative that translates low-level system statistics into bugs in your software and back. Most software engineers don’t have that kind of systems experience or intuition…and they shouldn’t have to.

Why are dashboards bad for debugging? Think of it this way: every dashboard is an answer to a question someone asked at some point. Your monitoring system is probably littered with dashboards, thousands and thousands of them, most of whose questions have been long forgotten and many of whose source data streams have long since gone silent.

So you come along trying to investigate something, and what do you do? You start skimming through dashboards, eyes scanning furiously, looking for visual patterns — e.g. any spikes that happened around the same time as your incident. That’s not debugging, that’s pattern-matching. That’s … eyeball racing.

if we did math like we do dashboards

Imagine you’re in a math competition, and you get handed a problem to solve. But instead of pulling out your pencil and solving the equation, step by step, you start hollering out guesses.

“27!”
“19992.41!”
“1/4325!”

That’s what flipping through dashboards feels like to me. You’re riffling through a bunch of graphs that were relevant to some long-ago situation, without context or history, without showing their work. Sometimes you’ll spot the exact scenario, and — huzzah! — the number you shout is correct! But when it comes to unknown scenarios, the odds are not in your favor.

Debugging looks and feels very different from flipping through answers. You ask a question, examine the answer, and ask another question based on the result. (“Which endpoints were erroring? Are all of the requests erroring, or only some? What did they have in common?”, etc.)

You methodically put one foot in front of the other, following the trail of bread crumbs, until the data itself leads you to the answer.

The limitations of metrics and dashboards

Unfortunately, you cannot do that with metrics-based dashboards, because you stripped away the connective tissue of the event back when you wrote the metrics out to disk.

If you happened to notice while skimming through dashboards that your 404 errors spiked at 14:03, and your /payment and /import endpoints started erroring at 14.03, and your database started returning a bunch of mysql errors shortly after 14:00, you’ll probably assume that they’re all related and leap to find more evidence that confirms it.

But you cannot actually confirm that those events are the same ones, not with your metrics dashboards. You cannot drill down from errors to endpoints to error strings; for that, you’d need a wide structured data blob per request. Those might in fact be two or three separate outages or anomalies happening at the same time, or just the tip of the iceberg of a much larger event, and your hasty assumptions might extend the outage for much longer than was necessary.

With metrics, you tend to find what you’re looking for. You have no way to correlate attributes between requests or ask “what are all of the dimensions these requests have in common?”, or to flip back and forth and look at the request as a trace. Dashboards can be fairly effective at surfacing the causes of problems you’ve seen before (raise your hand if you’ve ever been in an incident review where one of the follow up tasks was, “create a dashboard that will help us find this next time”), but they’re all but useless for novel problems, your unknown-unknowns.

Other complaints about dashboards:

They tend to have percentiles like 95th, 99th, 99.9th, 99.99th, etc. Which can cover over a multitude of sins. You really want a tool that allows you to see MAX and MIN, and heatmap distributions.

A lot of dashboards end up getting created that are overly specific to the incident you just had — naming specific hosts, etc — which just creates clutter and toil. This is how your dashboards become that graveyard of past outages.

The most useful approach to dashboards is to maintain a small set of them; cull regularly, and think of them as a list of starter queries for your investigations.

Fred Hebert has this analogy, which I like:

“I like to compare the dashboards to the big display in a hospital room: heartbeat, pressure, oxygenation, etc. Those can tell you when a thing is wrong, but the context around the patient chart (and the patient themselves) is what allows interpretation to be effective. If all we have is the display but none of the rest, we’re not getting anywhere close to an accurate picture. The risk with the dashboard is having the metrics but not seeing or knowing about the rest changing.”

In conclusion

Dashboards aren’t universally awful. The overuse of them just encourages sloppy thinking, and static ones make it impossible for you to follow the plot of an outage, or validate your hypotheses. 🤒 There’s too many of them, and not enough shared consensus. (It would help if, like, new dashboards expired within a month if nobody looked at them again.)

If what you have is “nothing”, even shitty dashboards are far better than no dashboards. But shitty dashboards have been the only game in town for far too long. We need more vendors to think about building for queryability, explorability, and the ability to follow a trail of breadcrumbs. Modern systems are going to demand more and more of this approach.

Nothing < Dashboards < a Queryable, Exploratory Interface

If everyone out there who slaps “observability” on their web page also felt the responsibility to add an observability-enabling interface to their tool, one that would let users explore and identify unknown-unknowns, we would all be in a far better place. 🙂

Questionable Advice: War Rooms? Really?!?

September 2, 2020July 14, 2023 mipsytipsyadvice, culture, management, monitoring, operations, sre, tech culture2 Comments

My company has recently begun pushing for us to build and staff out what I can only describe as “command centers”. They’re picturing graphs, dashboards…people sitting around watching their monitors all day just to find out which apps or teams are having issues. With your experience in monitoring and observability, and your opinions on teams supporting their own applications…do you think this sounds like a bad idea? What are things to watch out for, or some ways this might all go sideways?

— Anonymous

Jesus motherfucking Christ on a stick. Is it 1995 where you work? That’s the only way I can try and read this plan like it makes sense.

It’s a giant waste of money and no, it won’t work. This path leads into a death spiral where alarms are going off constantly (yet somehow never actually catching the real problems), people getting burned out, and anyone competent will either a) leave or b) refuse to be on call. Sideways enough for you yet?

Snark aside, there are two foundational flaws with this plan.

1) watching graphs is pointless. You can automate that shit, remember? ✨Computers!✨ Furthermore, this whole monitoring-based approach will only ever help you find the known unknowns, the problems you already know to look for. But most of your actual problems will be unknown unknowns, the ones you don’t know about yet.

2) those people watching the graphs… When something goes wrong, what exactly can they do about it? The answer, unfortunately, is “not much”. The only people who can swiftly diagnose and fix complex systems issues are the people who build and maintain those systems, and those people are busy building and maintaining, not watching graphs.

That extra human layer is worse than useless; it is actively harmful. By insulating developers from the consequences of their actions, you are concealing from them the information they need to understand the consequences of their actions. You are interfering with the most basic of feedback loops and causing it to malfunction.

The best time to find a bug is as soon as possible after writing it, while it’s all fresh in your head. If you let it fester for days, weeks, or months, it will be exponentially more challenging to find and solve. And the best people to find those bugs are the people who wrote them

Helpful? Hope so. Good luck. And if they implement this anyway — leave. You deserve to work for a company that won’t waste your fucking time.

with love, charity.

Observability is a Many-Splendored Definition

March 3, 2020July 14, 2023 mipsytipsyapm, instrumentation, logs, metrics, monitoring, traces12 Comments

Last weekend, @swyx posted a great little primer to instrumentation titled “Observability Tools in JavaScript”. A friend sent me the link and suggested that I might want to respond and clarify some things about observability, so I did, and we had a great conversation! Here is a lightly edited transcript of my reply tweet storm.

First of all, confusion over terminology is understandable, because there are some big players out there actively trying to confuse you! Big Monitoring is indeed actively trying to define observability down to “metrics, logs and traces”. I guess they have been paying attention to the interest heating up around observability, and well… they have metrics, logs, and tracing tools to sell? So they have hopped on the bandwagon with some undeniable zeal.

But metrics, logs and traces are just data types. Which actually has nothing to do with observability. Let me explain the difference, and why I think you should care about this.

“Observability? I do not think it means what you think it means.”

Observability is a borrowed term from mechanical engineering/control theory. It means, paraphrasing: “can you understand what is happening inside the system — can you understand ANY internal state the system may get itself into, simply by asking questions from the outside?” We can apply this concept to software in interesting ways, and we may end up using some data types, but that’s putting the cart before the horse.

It’s a bit like saying that “database replication means structs, longints and elegantly diagrammed English sentences.” Er, no.. yes.. missing the point much?

This is such a reliable bait and switch that any time you hear someone talking about “metrics, logs and traces”, you can be pretty damn sure there’s no actual observability going on. If there were, they’d be talking about that instead — it’s far more interesting! If there isn’t, they fall back to talking about whatever legacy products they do have, and that typically means, you guessed it: metrics, logs and traces.

❌ Metrics

Metrics in particular are actually quite hostile to observability. They are usually pre-aggregated, which means you are stuck with whatever questions you defined in advance, and even when they aren’t pre-aggregated they permanently discard the connective tissue of the request at write time, which destroys your ability to correlate issues across requests or track down any individual requests or drill down into a set of results — FOREVER.

Which doesn’t mean metrics aren’t useful! They are useful for many things! But they are useful for things like static dashboards, trend analysis over time, or monitoring that a dimension stays within defined thresholds. Not observability. (Liz would interrupt here and say that Google’s observability story involves metrics, and that is true — metrics with exemplars. But this type of solution is not available outside Google as far as we know..)

❌ Logs

Ditto logs. When I say “logs”, you think “unstructured strings, written out to disk haphazardly during execution, “many” log lines per request, probably contains 1-5 dimensions of useful data per log line, probably has a schema and some defined indexes for searching.” Logs are at their best when you know exactly what to look for, then you can go and find it.

Again, these connotations and assumptions are the opposite of observability’s requirements, which deals with highly structured data only. It is usually generated by instrumentation deep within the app, generally not buffered to local disk, issues a single event per request per service, is schemaless and indexless (or inferred schemas and autoindexed), and typically containing hundreds of dimensions per event.

❓ Traces

Traces? Now we’re getting closer. Tracing IS a big part of observability, but tracing just means visualizing events in order by time. It certainly isn’t and shouldn’t be a standalone product, that just creates unnecessary friction and distance. Hrmm … so what IS observability again, as applied to the software domain??

As a reminder, observability applied to software systems means having the ability to ask any question of your systems — understand any user’s behavior or subjective experience — without having to predict that question, behavior or experience in advance.

Observability is about unknown-unknowns.

At its core, observability is about these unknown-unknowns.

Plenty of tools are terrific at helping you ask the questions you could predict wanting to ask in advance. That’s the easy part. “What’s the error rate?” “What is the 99th percentile latency for each service?” “How many READ queries are taking longer than 30 seconds?”

Monitoring tools like DataDog do this — you predefine some checks, then set thresholds that mean ERROR/WARN/OK.
Logging tools like Splunk will slurp in any stream of log data, then let you index on questions you want to ask efficiently.
APM tools auto-instrument your code and generate lots of useful graphs and lists like “10 slowest endpoints”.

But if you *can’t* predict all the questions you’ll need to ask in advance, or if you *don’t* know what you’re looking for, then you’re in o11y territory.

This can happen for infrastructure reasons — microservices, containerization, polyglot storage strategies can result in a combinatorial explosion of components all talking to each other, such that you can’t usefully pre-generate graphs for every combination that can possibly degrade.
And it can happen — has already happened — to most of us for product reasons, as you’ll know if you’ve ever tried to figure out why a spike of errors was being caused by users on ios11 using a particular language pack but only in three countries, and only when the request hit the image export microservice running build_id 789782 if the user’s last name starts with “MC” and they then try to click on a particular button which then issues a db request using the wrong cache key for that shard.

Gathering the right data, then exploring the data.

Observability starts with gathering the data at the right level of abstraction, organized around the request path, such that you can slice and dice and group and look for patterns and cross-correlations in the requests.

To do this, we need to stop firing off metrics and log lines willynilly and be more disciplined. We need to issue one single arbitrarily-wide event per service per request, and it must contain the *full context* of that request. EVERYTHING you know about it, anything you did in it, all the parameters passed into it, etc. Anything that might someday help you find and identify that request.

Then, when the request is poised to exit or error the service, you ship that blob off to your o11y store in one very wide structured event per request per service.

In order to deliver observability, your tool also needs to support high cardinality and high dimensionality. Briefly, cardinality refers to the number of unique items in a set, and dimensionality means how many adjectives can describe your event. If you want to read more, here is an overview of the space, and more technical requirements for observability

You REQUIRE the ability to chain and filter as many dimensions as you want with infinitely high cardinality for each one if you’re going to be able to ask arbitrary questions about your unknown unknowns. This functionality is table stakes. It is non negotiable. And you cannot get it from any metrics or logs tool on the market today.

Why this matters.

Alright, this is getting pretty long. Let me tell you why I care so much, and why I want people like you specifically (referring to frontend engineers and folks earlier in their careers) to grok what’s at stake in the observability term wars.

We are way behind where we ought to be as an industry. We are shipping code we don’t understand, to systems we have never understood. Some poor sap is on call for this mess, and it’s killing them, which makes the software engineers averse to owning their own code in prod. What a nightmare.

Meanwhile developers readily admit they waste >40% of their day doing bullshit that doesn’t move the business forward. In large part this is because they are flying blind, just stabbing around in the dark.

We all just accept this. We shrug and say well that’s just what it’s like, working on software is just a shit salad with a side of frustration, it’s just the way it is.

But it is fucking not. It is un fucking necessary. If you instrument your code, watch it deploy, then ask “is it doing what I expect, does anything else look weird” as a habit? You can build a system that is both understandable and well-understood. If you can see what you’re doing, and catch errors swiftly, it never has to become a shitty hairball in the first place. That is a choice.

🌟 But **observability in the original technical sense is a necessary prerequisite to this better world**. 🌟

If you can’t break down by high cardinality dimensions like build ids, unique ids, requests, and function names and variables, if you cannot explore and swiftly skim through new questions on the fly, then you cannot inspect the intersection of (your code + production + users) with the specificity required to associate specific changes with specific behaviors. You can’t look where you are going.

Observability as I define it is like taking off the blindfold and turning on the light before you take a swing at the pinata. It is necessary, although not sufficient alone, to dramatically improve the way you build software. Observability as they define it gets you to … exactly where you already are. Which of these is a good use of a new technical term?

Do better.

And honestly, it’s the next generation who are best poised to learn the new ways and take advantage of them. Observability is far, far easier than the old ways and workarounds … but only if you don’t have decades of scar tissue and old habits to unlearn.

The less time you’ve spent using monitoring tools and ops workarounds, the easier it will be to embrace a new and better way of building and shipping well-crafted code.

Observability matters. You should care about it. And vendors need to stop trying to confuse people into buying the same old bullshit tools by smooshing them together and slapping on a new label. Exactly how long do they expect to fool people for, anyway?

Questionable Advice #2: How Do I Get My Team Into Observability?

December 17, 2019July 14, 2023 mipsytipsyadvice, culture, engineering, influence, management, monitoring, tech culture1 Comment

Welcome to the second installment of my advice column! Last time we talked about the emotional impact of going back to engineering after a stint in management. If you have a question you’d like to ask, please email me or DM it to me on twitter.

Hi Charity! I hope it’s ok to just ask you this…

I’m trying to get our company more aware of observability and I’m finding it difficult to convince people to look more into it. We currently don’t have the kind of systems that would require it much – but we will in future and I want us to be ahead of the game.

If you have any tips about how to explain this to developers (who are aware that quality is important but don’t always advocate for it / do it as much as I’d prefer), or have concrete examples of “here’s a situation that we needed observability to solve – and here’s how we solved it”, I’d be super grateful.

If this is too much to ask, let me know too 🙂

I’ve been talking to Abby Bangser a lot recently – and I’m “classifying” observability as “exploring in production” in my mental map – if you have philosophical thoughts on that, I’d also love to hear them 🙂

alex_schl

Dear Alex,

Everyone’s systems are broken. Not just yours!

Yay, what a GREAT note! I feel like I get asked some subset or variation of these questions several times a week, and I am delighted for the opportunity to both write up a response for you and post it for others to read. I bet there are orders of magnitude more people out there with the same questions who *don’t* ask, so I really appreciate those who do. <3

I want to talk about the nuts and bolts of pitching to engineering teams and shepherding technical decisions like this, and I promise I will offer you some links to examples and other materials. But first I want to examine some of the assumptions in your note, because they elegantly illuminate a couple of common myths and misconceptions.

Myth #1: you don’t need observability til you have problems of scale

First of all, there’s this misconception that observability is something you only need when you have really super duper hard problems, or that it’s only justified when you have microservices and large distributed systems or crazy scaling problems. No, no no nononono.

There may come a point where you are ABSOLUTELY FUCKED if you don’t have observability, but it is ALWAYS better to develop with it. It is never not better to be able to see what the fuck you are doing! The image in my head is of a hiker with one of those little headlamps on that lets them see where they’re putting their feet down. Most teams are out there shipping opaque, poorly understood code blindly — shipping it out to systems which are themselves crap snowballs of opaque, poorly understood code. This is costly, dangerous, and extremely wasteful of engineering time.

Ever seen an engineering team of 200, and struggled to understand how the product could possibly need more than one or two teams of engineers? They’re all fighting with the crap snowball.

Developing software with observability is better at ANY scale. It’s better for monoliths, it’s better for tiny one-person teams, it’s better for pre-production services, it’s better for literally everyone always. The sooner and earlier you adopt it, the more compounding value you will reap over time, and the more of your engineers’ time will be devoted to forward progress and creating value.

Myth #2: observability is harder and more technically advanced than monitoring

Actually, it’s the opposite — it’s much easier. If you sat a new grad down and asked them to instrument their code and debug a small problem, it would be fairly straightforward with observability. Observability speaks the native language of variables, functions and API endpoints, the mental model maps cleanly to the request path, and you can straightforwardly ask any question you can come up with. (A key tenet of observability is that it gives an engineer the ability to ask any question, without having had to anticipate it in advance.)

With metrics and logging libraries, on the other hand, it’s far more complicated.you have to make a bunch of awkward decisions about where to emit various types of statistics, and it is terrifyingly easy to make poor choices (with terminal performance implications for your code and/or the remote data source). When asking questions, you are locked in to asking only the questions that you chose to ask a long time ago. You spend a lot of time translating the relationships between code and lowlevel systems resources, and since you can’t break down by users/apps you are blocked from asking the most straightforward and useful questions entirely!

Doing it the old way Is. Fucking. Hard. Doing it the newer way is actually much easier, save for the fact that it is, well, newer — and thus harder to google examples for copy-pasta. But if you’re saturated in decades of old school ops tooling, you may have some unlearning to do before observability seems obvious to you.

Myth #3: observability is a purely technical solution

To be clear, you can just add an observability tool to your stack and go on about your business — same old things, same old way, but now with high cardinality!

You can, but you shouldn’t.

These are sociotechnical systems and they are best improved with sociotechnical solutions. Tools are an absolutely necessary and inextricable part of it. But so are on call rotations and the fundamental virtuous feedback loop of you build it, you run it. So are code reviews, monitoring checks, alerts, escalations, and a blameless culture. So are managers who allocate enough time away from the product roadmap to truly fix deep technical rifts and explosions, even when it’s inconvenient, so the engineers aren’t in constant monkeypatch mode.

I believe that observability is a prerequisite for any major effort to have saner systems, simply because it’s so powerful being able to see the impact of what you’ve done. In the hands of a creative, dedicated team, simply wearing a headlamp can be transformational.

Observability is your five senses for production.

You’re right on the money when you ask if it’s about exploring production, but you could also use words that are even more basic, like “understanding” or “inspecting”. Observability is to software systems as a debugger is to software code. It shines a light on the black box. It allows you to move much faster, with more confidence, and catch bugs much sooner in the lifecycle — before users have even noticed. It rewards you for writing code that is easy to illuminate and understand in production.

So why isn’t everyone already doing it? Well, making the leap isn’t frictionless. There’s a minimal amount of instrumentation to learn (easier than people expect, but it’s nonzero) and then you need to learn to see your code through the lens of your own instrumentation. You might need to refactor your use of older tools, such as metrics libraries, monitoring checks and log lines. You’ll need to learn another query interface and how it behaves on your systems. You might find yourself amending your code review and deploy processes a bit.

Nothing too terrible, but it’s all new. We hate changing our tool kits until absolutely fucking necessary. Back at Parse/Facebook, I actually clung to my sed/awk/shell wizardry until I was professionally shamed into learning new ways when others began debugging shit faster than I could. (I was used to being the debugger of last resort, so this really pissed me off.) So I super get it! So let’s talk about how to get your team aligned and hungry for change.

Okay okay okay already, how do I get my team on board?

If we were on the phone right now, I would be peppering you with a bunch of questions about your organization. Who owns production? Who is on call? Who runs the software that devs write? What is your deploy process, and how often does it get updated, and by who? Does it have an owner? What are the personalities of your senior folks, who made the decisions to invest in the current tools (and what are they), what motivates them, who are your most persuasive internal voices? Etc. Every team is different. <3

There’s a virtuous feedback loop you need to hook up and kickstart and tweak here, where the people with the original intent in their heads (software engineers) are also informed and motivated, i.e. empowered to make the changes and personally impacted when things are broken. I recommend starting by putting your software engineers on call for production (if you haven’t). This has a way of convincing even the toughest cases that they have a strong personal interest in quality and understandability.

Pay attention to your feedback loop and the alignment of incentives, and make sure your teams are given enough time to actually fix the broken things, and motivation usually isn’t a problem. (If it is, then perhaps another feedback loop is lacking: your engineers feeling sufficiently aligned with your users and their pain. But that’s another post.)

Technical ownership over technical outcomes

I appreciate that you want your team to own the technical decisions. I believe very strongly that this is the right way to go. But it doesn’t mean you can’t have influence or impact, and particularly in times like this.

It is literally your job to have your head up, scanning the horizon for opportunities and relevant threats. It’s their job to be heads down, focusing on creating and delivering excellent work. So it is absolutely appropriate for you to flag something like observability as both an opportunity and a potential threat, if ignored.

If I were in your situation and wanted my team to check out some technical concept, I might send around a great talk or two and ask folks to watch it, and then maybe schedule a lunchtime discussion. Or I might invite a tech luminary in to talk with the team, give a presentation and answer their questions. Or schedule a hack week to apply the concept to a current top problem, or something else of that nature.

But if I really wanted them to take it fucking seriously, I would put my thumb on the scale. I would find myself a champion, load them up with context, and give them ample time and space to skill up, prototype, and eventually present to the team a set of recommendations. (And I would stay in close contact with them throughout that period, to make sure they didn’t veer too far off course or lose sight of my goals.)

Get a champion.

Ideally you want to turn the person who is most invested in the old way of doing things — the person who owns the ELK cluster, say, or who was responsible for selecting the previous monitoring toolkit, or the goto person for ops questions — from your greatest obstacle into your proxy warrior. This only works if you know that person is open-minded and secure enough to give it a fair shot & publicly change course, has sufficiently good technical judgment to evaluate and project into the future, and has the necessary clout with their peers. If they don’t, or if they’re too afraid to buck consensus: pick someone else.
Give them context.

Take them for a long walk. Pour your heart and soul out to them. Tell them what you’ve learned, what you’ve heard, what you hope it can do for you, what you fear will happen if you don’t. It’s okay to get personal and to admit your uncertainties. The more context they have, the better the chance they will come out with an outcome you are happy with. Get them worried about the same things that worry you, get them excited about the same possibilities that excite you. Give them a sense of the stakes.

And don’t forget to tell them why you are picking them — because they are listened to by their peers, because they are already expert in the problem area, because you trust their technical judgment and their ability to evaluate new things — all the reasons for picking them will translate well into the best kind of flattery — the true kind.
Give them a deadline.

A week or two should be plenty. Most likely, the decision is not going to be unilaterally theirs (this also gives you a bit of wiggle room should they come back going “ah no ELK is great forever and ever”), but their recommendations should carry serious weight with the team and technical leadership. Make it clear what sort of outcome you would be very pleased with (e.g. a trial period for a new service) and what reasons you would find compelling for declining to pursue the project (i.e. your tech is unsupported, cost prohibitive, etc). Ideally they should use this time to get real production data into the services they are testing out, so they can actually experience and weigh the benefits, not just read the marketing copy.

As a rule of thumb, I always assume that managers can’t convince engineers to do things: only other engineers can. But what you can do instead is set up an engineer to be your champion. And then just sit quietly in the corner, nodding, with an interested look on your face.

The nuclear option

You have one final option. If there is no appropriate champion to be found, or insufficient time, or if you have sufficient trust with the team that you judge it the right thing to do: you can simply order them to do something your way. This can feel squicky. It’s not a good habit to get into. It usually results in things being done a bit slower, more reluctantly, more half-assedly. And you sacrifice some of your power every time you lean on your authority to get your team to do something.

But it’s just as bad for a leader to take it off the table entirely.

Sometimes you will see things they can’t. If you cannot wield your power when circumstances call for it, then you don’t fucking have real power — you have unilaterally disarmed yourself, to the detriment of your org. You can get away with this maybe twice a year, tops.

But here’s the thing: if you order something to be done, and it turns out in the end that you were right? You earn back all the power you expended on it plus interest. If you were right, unquestionably right in the eyes of the team, they will respect you more for having laid down the law and made sure they did the right thing.

charity

Some useful resources:

https://thenewstack.io/observability-a-3-year-retrospective/ a super meaty technical retrospective
https://www.honeycomb.io/blog/so-you-want-to-build-an-observability-tool/ the technical underpinnings of observability, and why they are different from monitoring (and why they matter)
https://www.heavybit.com/library/podcasts/o11ycast/ep-1-monitoring-vs-observability/ (a podcast, lots of folks have cited this as a great intro)
https://www.honeycomb.io/blog/incident-report-running-dry-on-memory-without-noticing/ one of our recent incident reports, with screenshots
https://charity.wtf/2019/10/28/deploys-its-not-actually-about-fridays/ includes some observations on what it’s liek to be working on a stack where we actually understand what the fuck we are shipping, we aren’t just accumulating blacker and blacker boxes with every depoy.

Love (and Alerting) in the Time of Cholera (and Observability)

September 20, 2019July 14, 2023 mipsytipsymonitoring, operations, paging & alerting, sre1 Comment

I made a vow this year to post one blog post a month, then I didn’t post anything at all from May to September. I have some catching up to do. 😑 I’ve also been meaning to transcribe some of the twitter rants that I end up linking back to into blog posts, so if there’s anything you especially want me to write about, tell me now while I’m in repentance mode.

This is one request I happened to make a note of because I can’t believe I haven’t already written it up! I’ve been saying the same thing over and over in talks and on twitter for years, but apparently never a blog post.

The question is: what is the proper role of alerting in the modern era of distributed systems? Has it changed? What are the updated best practices for alerting?

@mipsytipsy I've seen your thoughts on dashboards vs searching but haven't seen many thoughts from you on Alerting. Let me know if I've missed a blog somewhere on that! 🙂

— katy allred (@themindfulmommy) August 26, 2019

It’s a great question. I want to wax philosophically about some stuff, but first let me briefly outline the way to modernize your alerting best practices:

implement observability
implement SLOs and/or end-to-end checks that traverse key code paths and correlate to user-impacting events
create a secondary channel (tasks, ticketing system, whatever) for “things that on call should look at soon, but are not impacting users yet” which does not page anyone, but which on call is expected to look at (at least) first thing in the morning, last thing in the evening, and midday
move as many paging alerts as possible to the secondary channel, by engineering your services to auto-remediate or run in degraded mode until they can be patched up
wake people up only for SLOs and health checks that correlate to user-impacting events

Or, in an even shorter formulation: delete all your paging alerts, then page only on e2e alerts that mean users are in pain. Rely on debugging tools for debugging, and paging only when users are in pain.

To understand why I advocate deleting all your paging alerts, and when it’s safe to delete them, first we need to understand why have we accumulated so many crappy paging alerts over the years.

Monoliths, LAMP stacks, and death by pagebomb

Here, let’s crib a couple of slides from one of my talks on observability. Here are the characteristics of older monolithic LAMP-stack style systems, and best practices for running them:

The sad truth is, that when all you have is time series aggregates and traditional monitoring dashboards, you aren’t really debugging with science so much as you are relying on your gut and a handful of dashboards, using intuition and scraps of data to try and reconstruct an impossibly complex system state.

This works ok, as long as you have a relatively limited set of failure scenarios that happen over and over again. You can just pattern match from past failures to current data, and most of the time your intuition can bridge the gap correctly. Every time there’s an outage, you post mortem the incident, figure out what happened, build a dashboard “to help us find the problem immediately next time”, create a detailed runbook for how to respond to it, and (often) configure a paging alert to detect that scenario.

Over time you build up a rich library of these responses. So most of the time when you get paged you get a cluster of pages that actually serves to help you debug what’s happening. For example, at Parse, if the error graph had a particular shape I immediately knew it was a redis outage. Or, if I got paged about a high % of app servers all timing out in a short period of time, I could be almost certain the problem was due to mysql connections. And so forth.

Things fall apart; the pagebomb cannot stand

However, this model falls apart fast with distributed systems. There are just too many failures. Failure is constant, continuous, eternal. Failure stops being interesting. It has to stop being interesting, or you will die.

Instead of a limited set of recurring error conditions, you have an infinitely long list of things that almost never happen …. except that one time they do. If you invest your time into runbooks and monitoring checks, it’s wasted time if that edge case never happens again.

Frankly, any time you get paged about a distributed system, it should be a genuinely new failure that requires your full creative attention. You shouldn’t just be checking your phone, going “oh THAT again”, and flipping through a runbook. Every time you get paged it should be genuinely new and interesting.

Oh damn this talk looks baller. 😍 "Failure is important, but it is no longer interesting" — @this_hits_home… Netflix once again shining the light on where the rest of us need to get to over the next 3-5 years. 🙌🏅🎬 https://t.co/OY40Y0BTSa

— Charity Majors (@mipsytipsy) July 7, 2019

And thus you should actually have drastically fewer paging alerts than you used to.

A better way: observability and SLOs.

Instead of paging alerts for every specific failure scenario, the technically correct answer is to define your SLOs (service level objectives) and page only on those, i.e. when you are going to run out of budget ahead of schedule. But most people aren’t yet operating at this level of sophistication. (SLOs sound easy, but are unbelievably challenging to do well; many great teams have tried and failed. This is why we have built an SLO feature into Honeycomb that does the heavy lifting for you. Currently alpha testing with users.)

If you haven’t yet caught the SLO religion, the alternate answer is that “you should only page on high level end-to-end alerts, the ones which traverse the code paths that make you money and correspond to user pain”. Alert on the three golden signals: request rate, latency, and errors, and make sure to traverse every shard and/or storage type in your critical path.

That’s it. Don’t alert on the state of individual storage instances, or replication, or anything that isn’t user-visible.

(To be clear: by “alert” I mean “paging humans at any time of day or night”. You might reasonably choose to page people during normal work hours, but during sleepy hours most errors should be routed to a non-paging address. Only wake people up for actual user-visible problems.)

Here’s the thing. The reason we had all those paging alerts was because we depended on them to understand our systems.

Once you make the shift to observability, once you have rich instrumentation and the ability to swiftly zoom in from high level “there might be a problem” to identifying specifically what the errors have in common, or the source of the problem — you no longer need to lean on that scattershot bunch of pagebombs to understand your systems. You should be able to confidently ask any question of your systems, understand any system state — even if you have never encountered it before.

With observability, you debug by systematically following the trail of crumbs back to their source, whatever that is. Those paging alerts were a crutch, and now you don’t need them anymore.

Everyone is on call && on call doesn’t suck.

I often talk about how modern systems require software ownership. The person who is writing the software, who has the original intent in their head, needs to shepherd that code out into production and watch real users use it. You can’t chop that up into multiple roles, dev and ops. You just can’t. Software engineers working on highly available systems need to be on call for their code.

But the flip side of this responsibility belongs to management. If you’re asking everyone to be on call, it is your sworn duty to make sure that on call does not suck. People shouldn’t have to plan their lives around being on call. People shouldn’t have to expect to be woken up on a regular basis. Every paging alert out of hours should be as serious as a heart attack, and this means allocating real engineering resources to keeping tech debt down and noise levels low.

And the way you get there is first invest in observability, then delete all your paging alerts and start over from scratch.

It works. It really does. 🌈

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

February 13, 2019March 7, 2023 mipsytipsyaws, monitoring, operations, outsourcing, security, vendors1 Comment

This is part three of a three-part series of guest posts:

How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team. (George Chamales)
Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it. 🙏❤️ charity + friends

“Now Roll It Out And Keep Them Happy”

This is the third in a series of blog posts; previously we analyzed the security challenges of using a third party service, and we worked together with the security team to build empathy to deliver the project. You might want to read those first, since we are going to build on a lot of the ideas there to ship and maintain this integration.

Ready for launch

You’ve convinced the security team and other stakeholders, you’ve gotten the integration running, you’re getting promising results from dev-test or staging environments… now it’s time to move from proof-of-concept to full implementation. Depending on your situation this might be a transition from staging to production, or it might mean increasing a feature flipper flag from 5% to 100%, or it might mean increasing coverage of an integration from one API endpoint to cover your entire developer footprint.

Taking into account Murphy’s Law, we expect that some things will go wrong during the rollout. Perhaps during coverage, a developer realizes that the schema designed to handle the app’s event mechanism can’t represent a scenario, requiring a redesign or a hacky solution. Or perhaps the metrics dashboard shows elevated error rates from the API frontend, and while there’s no smoking gun, the ops oncall decides to rollback the integration Just In Case it’s causing the incident.

This gives us another chance to practice empathy — while it’s easy, wearing the champion hat, to dismiss any issues found by looking for someone to blame, ultimately this poisons trust within your organization and will hamper success. It’s more effective, in the long run (and often even in the short run), to find common ground with your peers in other disciplines and teams, and work through to solutions that satisfy everybody.

Keeping the lights on

In all likelihood as integration succeeds, the team will rapidly develop experts and expertise, as well as idiomatic ways to use the product. Let the experts surprise you; folks you might not expect can step up when given a chance. Expertise flourishes when given guidance and goals; as the team becomes comfortable with the integration, explicitly recognize a leader or point person for each vendor relationship. Having one person explicitly responsible for a relationship lets them pay attention to those vendor emails, updates, and avoid the tragedy of the “but I thought *you* were” commons. This Integration Lead is also a center of knowledge transfer for your organization — they won’t know everything or help every user come up to speed, but they can help empower the local power users in each team to ramp up their teams on the integration.

As comfort grows you will start to consider ways to change your usage, for example growing into new kinds of data. This is a good time to revisit that security checklist — does the change increase PII exposure to your vendor? Would the new data lead to additional requirements such as per-field encryption? Don’t let these security concerns block you from gaining valuable insight using the new tool, but do take the chance to talk it over with your security experts as appropriate.

Throughout this organic growth, the Integration Lead remains core to managing your changing profile of usage of the vendor they shepherd; as new categories of data are added to the integration, the Lead has responsibility to ensure that the vendor relationship and risk profile are well matched to the needs that the new usage (and presumably, business value) is placing on the relationship.

Documenting the Intergation Lead role and responsibilities is critical. The team should know when to check in, and writing it down helps it happen. When new code has a security implication, or a new use case potentially amplifies the cost of an integration, bringing the domain expert in will avoid unhappy surprises. Knowing how to find out who to bring in, and when to bring them in, will keep your team getting the right eyes on their changes.

Security threats and other challenges change over time, too. Collaborating with your security team so that they know what systems are in use helps your team take note of new information that is relevant to your business. A simple example is noting when your vendors publish a breach announcement, but more complex examples happen too — your vendor transitions cloud providers from AWS to Azure and the security team gets an alert about unexpected data flows from your production cluster; with transparency and trust such events become part of a routine process rather than an emergency.

It’s all operational

Monitoring and alerting is a fact of operations life, and this has to include vendor integrations (even when the vendor integration is a monitoring product.) All of your operations best practices are needed here — keep your alerts clean and actionable so that you don’t develop pager fatigue, and monitor performance of the integration so that you don’t get blindsided by a creeping latency monster in your APIs.

Authentication and authorization are changing as the threat landscape evolves and industry moves from SMS verification codes to U2F/WebAuthn. Does your vendor support your SSO integration? If they can’t support the same SSO that you use everywhere else and can’t add it — or worse, look confused when you mention SSO — that’s probably a sign you should consider a different vendor.

A beautiful sunset

Have a plan beforehand for what needs to be done should you stop using the service. Got any mobile apps that depend on APIs that will go away or start returning permission errors? Be sure to test these scenarios ahead of time.

What happens at contract termination to data stored on the service? Do you need to explicitly delete data when ceasing use?

Do you need to remove integrations from your systems before ending the commercial relationship, or can the technical shutdown and business shutdown run in parallel?

In all likelihood these are contingency plans that will never be needed, and they don’t need to be fully fleshed out to start, but a little bit of forethought can avoid unpleasant surprises.

Year after year

Industry best practice and common sense dictate that you should revisit the security questionnaire annually (if not more frequently). Use this chance to take stock of the last year and check in — are you getting value from the service? What has changed in your business needs and the competitive landscape?

It’s entirely possible that a new year brings new challenges, which could make your current vendor even more valuable (time to negotiate a better contract rate!) or could mean you’d do better with a competing service. Has the vendor gone through any major changes? They might have new offerings that suit your needs well, or they may have pivoted away from the features you need.

Check in with your friends on the security team as well; standards evolve, and last year’s sufficient solution might not be good enough for new requirements.

Andy thinks out loud about security, society, and the problems with computers on Twitter.

❤️ Thanks so much reading, folks. Please feel free to drop any complaints, comments, or additional tips to us in the comments, or direct them to me on twitter.

Have fun! Stay (a little bit) Paranoid!!

— charity

Outsource Your O11y: Get Aligned With Security (part 2/3)

February 13, 2019February 13, 2019 mipsytipsyaws, engineers, monitoring, operations, outsourcing, security, sre, vendors1 Comment

This is part two of a three-part series of guest posts:

How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team. (George Chamales)
Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it. 🙏❤️ charity + friends

“Get Aligned With Security”

by Lilly Ryan

If your team has decided on a third-party service to help you gather data and debug product issues, how do you convince an often overeager internal security team to help you adopt it?

When this service is something that provides a pathway for developers to access production data, as analytics tools often do, making the case for access to that data can screech to a halt at the mention of the word “production”. Progressing past that point will take time, empathy, and consideration.

I have been on both sides of the “adopting a new service” fence: as a developer hoping to introduce something new and useful to our stack, and now as a security professional who spends her days trying to bust holes in other people’s setups. I understand both sides of the sometimes-conflicting needs to both ship software and to keep systems safe.

This guide has advice to help you solve the immediate problem of choosing and deploying a third-party service with the approval of your security team. But it also has advice for how to strengthen the working relationship between your security and development teams over the longer term. No two companies are the same, so please adapt these ideas to fit your circumstances.

Understanding the security mindset

The biggest problems in technology are never really about technology, but about people. Seeing your security team as people and understanding where they are coming from will help you to establish empathy with them so that both of you want to help each other get what you want, not block each other.

First, understand where your security team is coming from. Development teams need to build features, improve the product, understand and ship good code. Security teams need to make sure you don’t end up on the cover of the NYT for data breaches, that your business isn’t halted by ransomware, and that you’re not building your product on a vulnerable stack.

This can be an unfamiliar frame of mind for developers. Software development tends to attract positive-minded people who love creating things and are excited about the possibilities of new technology. Software security tends to attract negative thinkers who are skilled at finding all the flaws in a system. These are very different mentalities, and the people who occupy them tend to have very different assumptions, vocabularies, and worldviews.

But if you and your security team can’t share the same worldview, it will be hard to trust each other and come to agreement. This is where practicing empathy can be helpful.

Before approaching your security team with your request to approve a new vendor, you may want to run some practice exercises for putting yourselves in their shoes and forcing yourselves to deliberately cultivate a negative thinking mindset to experience how they may react — not just in terms of the objective risk to the business, or the compliance headaches it might cause, but also what arguments might resonate with them and what emotional reactions they might have.

My favourite exercise for getting teams to think negatively is what I call the Land Astronaut approach.

The “Land Astronaut” Game

Imagine you are an astronaut on the International Space Station. Literally everything you do in space has death as a highly possible outcome. So astronauts spend a lot of time analysing, re-enacting, and optimizing their reactions to events, until it becomes muscle memory. By expecting and training for failure, astronauts use negative thinking to anticipate and mitigate flaws before they happen. It makes their chances of survival greater and their people ready for any crisis.

Your project may not be as high-stakes as a space mission, and your feet will most likely remain on the ground for the duration of your work, but you can bet your security team is regularly indulging in worst-case astronaut-type thinking. You and your team should try it, too.

The Game:

Pick a service for you and your team to game out. Schedule an hour, book a room with a whiteboard, put on your Land Astronaut helmets. Then tell your team to spend half an hour brainstorming about all the terrible things that can happen to that service, or to the rest of your stack when that service is introduced. Negative thoughts only!

Start brainstorming together. Start out by being as outlandish as possible (what happens if their data centre is suddenly overrun by a stampede of elephants?). Eventually you will find that you’ll tire of the extreme worst case scenarios and come to consider more realistic outcomes — some of which which you may not have thought of outside of the structure of the activity.

After half an hour, or whenever you feel like you’re all done brainstorming, take off your Land Astronaut helmets, sift out the most plausible of the worst case scenarios, and try to come up with answers or strategies that will help you counteract them. Which risks are plausible enough that you should mitigate them? Which are you prepared to gamble on never happening? How will this risk calculus change as your company grows and takes on more exposure?

Doing this with your team will allow you all to practice the negative thinking mindset together and get a feel for how your colleagues in the security team might approach this request. (While this may seem similar to threat modelling exercises you might have done in the past, the focus here is on learning to adopt a security mindset and gaining empathy for this thought process, rather than running through a technical checklist of common areas of concern.)

While you still have your helmets within reach, use your negative thinking mindset to fill out the spreadsheet from the first piece in this series. This will help you anticipate most of the reasonable objections security might raise, and may help you include useful detail the security team might not have known to ask for.

Once you have prepared your list of answers to George’s worksheet and held a team Land Astronaut session together, you will have come most of the way to getting on board with the way your security team thinks.

Preparing for compromise

You’ve considered your options carefully, you’ve learned how to harness negative thinking to your advantage, and you’re ready to talk to your colleagues in security – but sometimes, even with all of these tools at your disposal, you may not walk away with all of the things you are hoping for.

Being willing to compromise and anticipating some of those compromises before you approach the security team will help you negotiate more successfully.

While your Land Astronaut helmets are still within reach, consider using your negative thinking mindset game to identify areas where you may be asked to compromise. If you’re asking for production access to this new service for observability and debugging purposes, think about what kinds of objections may be raised about this and how you might counter them or accommodate them. Consider continuing the activity with half of the team remaining in the Land Astronaut role while the other half advocates from a positive thinking standpoint. This dynamic will get you having conversations about compromise early on, so that when the security team inevitably raises eyebrows, you are ready with answers.

Be prepared to consider compromises you had not anticipated, and enter into discussions with the security team with as open a mind as possible. Remember the team is balancing priorities of not only your team, but other business and development teams as well. If you and your security colleagues are doing the hard work to meet each other halfway then you are more likely to arrive at a solution that satisfies both parties.

Working together for the long term

While the previous strategies we’ve covered focus on short-term outcomes, in this continuous-deployment, shift-left world we now live in, the best way to convince your security team of the benefits of a third-party service – or any other decision – is to have them along from day one, as part of the team.

Roles and teams are increasingly fluid and boundary-crossing, yet security remains one of the roles least likely to be considered for inclusion on a software development team. Even in 2019, the task of ensuring that your product and stack are secure and well-defended is often left until the end of the development cycle. This contributes a great deal to the combative atmosphere that is common.

Bringing security people into the development process much earlier builds rapport and prevents these adversarial, territorial dynamics. Consider working together to build Disaster Recovery plans and coordinating for shared production ownership.

If your organisation isn’t ready for that kind of structural shift, there are other ways to work together more closely with your security colleagues.

Try having members of your team spend a week or two embedded with the security team. You may even consider a rolling exchange – a developer for a security team member – so that developers build the security mindset, and the security team is able to understand the problems your team is facing (and why you are looking at introducing this new service).

At the very least, you should make regular time to meet with the security team, get to know them as people, and avoid springing things on them late in the project when change is hardest.

Riding off together into the sunset…?

If you’ve taken the time to get to know your security team and how they think, you’ll hopefully be able to get what you want from them – or perhaps you’ll understand why their objections were valid, and come up with a better solution that works well for both of you.

Investing in a strong relationship between your development and security teams will rarely lead to the apocalypse. Instead, you’ll end up with a better product, probably some new work friends, and maybe an exciting idea for a boundary-crossing new career in tech.

But this story isn’t over! Once you get the green light from security, you’ll need to think about how to roll your new service out safely, maintain it, and consider its full lifespan within your company. Which leads us to part three of this series, on rolling it out and maintaining it … both your integration and your relationship with the security team.

Lilly Ryan is a pen tester, Python wrangler, and recovering historian from Melbourne. She writes and speaks internationally about ethical software, social identities after death, teamwork, and the telegraph. More recently she has researched the domestic use of arsenic in Victorian England, attempted urban camouflage, reverse engineered APIs, wielded the Oxford comma, and baked a really good lemon shortbread.

Logs vs Structured Events

February 5, 2019July 14, 2023 mipsytipsyevents, logs, monitoring, operations, serverless, sre10 Comments

I got an interesting tweet the other day from @evntdrvn in response to this thread of mine. Paraphrasing,

“So I’ve almost got our group at work up to Step 1 in your observability maturity model, but some of the devs that I work with want to turn OFF our lovely structured logging in prod for informational-level msgs due to their legacy philosophy (‘we only log errors in prod’). The reasons given are mostly philosophical (“I’m a dev and only interested when things error out, I don’t want any other noise in prod logs”, “I don’t want to slow my app down in prod”). Help?!?”

As I was reading this, I was itching to fly out and dive into battle with Eric. I know exactly where his opinionated devs are coming from. I used to say the same things! I even wrote a whole blog post about it.

These developers have internalized a set of rules and best practices for dealing with output data, in the context of “monolith application development in the early 2000s”.

Monolithic systems assumptions

Those systems had many common constraints and assumptions, such as:

We have a monolith service, or a very small number of services. We can model the system in our heads.
Logging is done to local disk, which can impact performance
Disks are expensive
Log lines are spat out inline with execution. A poorly placed printf can take the whole system down.
Investigation is rare, and usually means a human reading error logs.
Logging is of poor utility for understanding internal states or execution paths; you should just read the code or use a debugger. (There are few or network hops between functions.)
Logging is mostly useful for detecting certain terminal crash states or connection errors.

Monolithic logging best practices

Therefore:

We should be very stingy in what we log
Debuggers should be used for understanding internal states of the code
Logs are a last resort and record of crash dumps. We do not expect to use log data in the course of our daily work. We assume log-related manual investigation will be infrequent and of limited utility.

These were exactly the right lessons to learn in the era of expensive hardware and monolithic repos/artifacts. Many people still work in environments like this, and follow logging best practices like these. God bless, more power to em.

Distributed systems assumptions

But more and more of us face systems that are very different.

We have many services, possibly many MANY services. A representative request will have “many” hops across “many” services and routers and proxies and meshes and storage systems.
We cannot model the system in our heads; it would be a mistake to try. We rely on tooling as the source of truth for those systems.
You may or may not have access to those services, or the systems your code runs on. There may or may not be a logging facility, or a centralized log aggregator. Your only view of the system is through the instrumentation of your code.
Disks and system resources are cheap, ephemeral, all but disposable.
Data services are similarly cheap. We can almost entirely silo application performance off from the cost of writing perf data out.
Investigation is prohibitively slow and expensive for a human to do by hand. Many of the nodes or processes we need to inspect may no longer even exist, but their past states may still be relevant to us in understanding patterns to the present time.
Investigation should usually be done distributedly, across all instantiations of your code, however many there might be — and in real time
Investigation requires computation — not just string search. We need to ask on the fly involving math and percentiles and breakdowns and group by’s. And we need access to the raw requests in order to run accurate computations — no pre-aggregates.
The hardest part isn’t usually debugging the code, it’s figuring out where is the code you need to debug. Or what the errors or outliers have in common from the perspective of the code. Fixing the code itself is often comparatively trivial, once found.
What even is ‘logging’?
What even is ‘local disk’?

This isn’t optional: at some point of complexity or scale or distributedness, it becomes necessary if you want to work with these systems.

Logs can’t help you here.

And you aren’t going to get that kind of explorable data out of loglevel:ERROR, or by chopping up your telemetry into disconnected metrics devoid of context.

You are only going to get this kind of explorable, ad hoc, computation-friendly data if you take a radically new approach to how you output and aggregate telemetry. You’re going to need to replace your log lines and log levels with a different sort of beast: arbitrarily wide structured events that describe the request and its context, one event per request per service.

If it helps, don’t think of them as log files any more. Think of them as events. Yes, you can stash this stream in a file, but why would you? on what disk? will that work for your serverless functions too? Just stream them over the network to wherever you want to put them.

Log levels are another confusing and unnecessary artifact of yesteryear that you no longer really need. The more you think of structured events as logs, the more tempted you may be to apply the old set of best practices. So just don’t think of them as logs at all.

How to gather and structure your data

Instead of dribbling little pebbles of log effluvia throughout your code, do this. (If you’re a honeycomb user, our beelines do it all automatically for you *and* pre-propagate the blobs with everything we know of your context.)

Initialize an empty blob at the beginning, when the request first enters the service.
Stuff any and all interesting detail about the request into that blob throughout the lifetime of the request.
- Any unique id, any high-cardinality variable, any headers passed in, every full query, normalized query, and query execution time; every http call out to a remote service, every http execution time; any shopping cart id, first and last name, execution time — literally anything interesting, append to blob.
Then, when the request is about to exit or error, write the blob off to honeycomb or another service or disk somewhere.

You can see immediately how this method has radically different performance implications and risks than the earlier shotgun spray approach. No more “oops i accidentally put a print line INSIDE a for loop”. The write amplification profile is compressed. Most importantly, the incremental cost of capturing more detail about the request per service is nearly zero.

And now you have the kind of structured data that you can feed into something like a columnar store, or honeycomb, and run ad hoc queries to your heart’s delight.

Distributed systems logging events best practices:

Let’s sum up. (I’m including links to other past rants on this topic):

Emit a rich record from the perspective of the request as it executes the code. Include all the context you can get your paws on.
Emit a single event per request per service that it hits. Write it out just before the request errors or exits the service.
Bypass local disk entirely, write to a remote service.
Sample if needed for cost or resource constraints. Practice dynamic sampling.
Treat this like operational data, not transactional data. Be profligate and disposable.
Feed this data into a columnar store or honeycomb or similar
Now use it every day. Not just as a last resort. Get knee deep in production every single day. Explore. Ask and answer rich questions about your systems, system quality, system behavior, outliers, error conditions, etc. You will be absolutely amazed how useful it is … and appalled by what you turn up. 🙂

Just think.

No more doing multi-line regexps trying to look for the same request ID or user ID doing five suspicious things in a row.

No more regexps at all, for fuck’s sake.

No more bullshit percentiles that were computed at write time by averaging over a bunch of other averages

No more having to jump around from dashboards to logs trying to vainly eyeball correlate one spike with another. No more wondering why no two tools can agree if anything even exists or not

Just gather the detail you need to ask the questions when you need them, and store it in a single source of truth. It’s that simple.

No need to shame people from learning best practices that worked perfectly well for a long time. You can either let them learn the hard way that this transformation is non optional, or you can help them learn the easy way that it’s simply much better and easier to invest in this telemetry up front. You seem like a nice enough chap, which is probably why you chose door 2. (If you wanted to get tougher about it, have a few reformed folks in to tell their horror stories. Try some ex-twitter engineers.)

The hardest part seems to be getting people to unlearn all the best practices they once learned for dealing with logs. So just don’t call it logs anymore, if that helps. Call it “structured events”.

– charity.