Observability is a Many-Splendored Definition

Last weekend, @swyx posted a great little primer to instrumentation titled “Observability Tools in JavaScript”.  A friend sent me the link and suggested that I might want to respond and clarify some things about observability, so I did, and we had a great conversation!  Here is a lightly edited transcript of my reply tweet storm.

First of all, confusion over terminology is understandable, because there are some big players out there actively trying to confuse you!  Big Monitoring is indeed actively trying to define observability down to “metrics, logs and traces”.  I guess they have been paying attention to the interest heating up around observability, and well… they have metrics, logs, and tracing tools to sell?  So they have hopped on the bandwagon with some undeniable zeal.

But metrics, logs and traces are just data types.  Which actually has nothing to do with observability.  Let me explain the difference, and why I think you should care about this.

“Observability? I do not think it means what you think it means.”

Observability is a borrowed term from mechanical engineering/control theory.  It means, paraphrasing: “can you understand what is happening inside the system — can you understand ANY internal state the system may get itself into, simply by asking questions from the outside?”  We can apply this concept to software in interesting ways, and we may end up using some data types, but that’s putting the cart before the horse.

It’s a bit like saying that “database replication means structs, longints and elegantly diagrammed English sentences.”  Er, no.. yes.. missing the point much?

This is such a reliable bait and switch that any time you hear someone talking about “metrics, logs and traces”, you can be pretty damn sure there’s no actual observability going on.  If there were, they’d be talking about that instead — it’s far more interesting!  If there isn’t, they fall back to talking about whatever legacy products they do have, and that typically means, you guessed it: metrics, logs and traces.

❌ Metrics

Metrics in particular are actually quite hostile to observability.  They are usually pre-aggregated, which means you are stuck with whatever questions you defined in advance, and even when they aren’t pre-aggregated they permanently discard the connective tissue of the request at write time, which destroys your ability to correlate issues across requests or track down any individual requests or drill down into a set of results — FOREVER.

Which doesn’t mean metrics aren’t useful!  They are useful for many things!  But they are useful for things like static dashboards, trend analysis over time, or monitoring that a dimension stays within defined thresholds.  Not observability.  (Liz would interrupt here and say that Google’s observability story involves metrics, and that is true — metrics with exemplars.  But this type of solution is not available outside Google as far as we know..)

❌ Logs

Ditto logs.  When I say “logs”, you think “unstructured strings, written out to disk haphazardly during execution, “many” log lines per request, probably contains 1-5 dimensions of useful data per log line, probably has a schema and some defined indexes for searching.”  Logs are at their best when you know exactly what to look for, then you can go and find it.

Again, these connotations and assumptions are the opposite of observability’s requirements, which deals with highly structured data only.  It is usually generated by instrumentation deep within the app, generally not buffered to local disk, issues a single event per request per service, is schemaless and indexless (or inferred schemas and autoindexed), and typically containing hundreds of dimensions per event.

❓ Traces

Traces?  Now we’re getting closer.  Tracing IS a big part of observability, but tracing just means visualizing events in order by time.  It certainly isn’t and shouldn’t be a standalone product, that just creates unnecessary friction and distance.  Hrmm … so what IS observability again, as applied to the software domain??

As a reminder, observability applied to software systems means having the ability to ask any question of your systems — understand any user’s behavior or subjective experience — without having to predict that question, behavior or experience in advance.

Observability is about unknown-unknowns.

At its core, observability is about these unknown-unknowns.

Plenty of tools are terrific at helping you ask the questions you could predict wanting to ask in advance.  That’s the easy part.  “What’s the error rate?”  “What is the 99th percentile latency for each service?”  “How many READ queries are taking longer than 30 seconds?”

  • Monitoring tools like DataDog do this — you predefine some checks, then set thresholds that mean ERROR/WARN/OK.
  • Logging tools like Splunk will slurp in any stream of log data, then let you index on questions you want to ask efficiently.
  • APM tools auto-instrument your code and generate lots of useful graphs and lists like “10 slowest endpoints”.

But if you *can’t* predict all the questions you’ll need to ask in advance, or if you *don’t* know what you’re looking for, then you’re in o11y territory.

  • This can happen for infrastructure reasons — microservices, containerization, polyglot storage strategies can result in a combinatorial explosion of components all talking to each other, such that you can’t usefully pre-generate graphs for every combination that can possibly degrade.
  • And it can happen — has already happened — to most of us for product reasons, as you’ll know if you’ve ever tried to figure out why a spike of errors was being caused by users on ios11 using a particular language pack but only in three countries, and only when the request hit the image export microservice running build_id 789782 if the user’s last name starts with “MC” and they then try to click on a particular button which then issues a db request using the wrong cache key for that shard.

Gathering the right data, then exploring the data.

Observability starts with gathering the data at the right level of abstraction, organized around the request path, such that you can slice and dice and group and  look for patterns and cross-correlations in the requests.

To do this, we need to stop firing off metrics and log lines willynilly and be more disciplined.  We need to issue one single arbitrarily-wide event per service per request, and it must contain the *full context* of that request. EVERYTHING you know about it, anything you did in it, all the parameters passed into it, etc.  Anything that might someday help you find and identify that request.

Then, when the request is poised to exit or error the service, you ship that blob off to your o11y store in one very wide structured event per request per service.

In order to deliver observability, your tool also needs to support high cardinality and high dimensionality.  Briefly, cardinality refers to the number of unique items in a set, and dimensionality means how many adjectives can describe your event.  If you want to read more, here is an overview of the space, and more technical requirements for observability

You REQUIRE the ability to chain and filter as many dimensions as you want with infinitely high cardinality for each one if you’re going to be able to ask arbitrary questions about your unknown unknowns.  This functionality is table stakes.  It is non negotiable.  And you cannot get it from any metrics or logs tool on the market today.

Why this matters.

Alright, this is getting pretty long. Let me tell you why I care so much, and why I want people like you specifically (referring to frontend engineers and folks earlier in their careers) to grok what’s at stake in the observability term wars.

We are way behind where we ought to be as an industry. We are shipping code we don’t understand, to systems we have never understood. Some poor sap is on call for this mess, and it’s killing them, which makes the software engineers averse to owning their own code in prod.  What a nightmare.

Meanwhile developers readily admit they waste >40% of their day doing bullshit that doesn’t move the business forward.  In large part this is because they are flying blind, just stabbing around in the dark.

We all just accept this.  We shrug and say well that’s just what it’s like, working on software is just a shit salad with a side of frustration, it’s just the way it is.

But it is fucking not.  It is un fucking necessary.  If you instrument your code, watch it deploy, then ask “is it doing what I expect, does anything else look weird” as a habit?  You can build a system that is both understandable and well-understood.  If you can see what you’re doing, and catch errors swiftly, it never has to become a shitty hairball in the first place.  That is a choice.

🌟 But observability in the original technical sense is a necessary prerequisite to this better world. 🌟

If you can’t break down by high cardinality dimensions like build ids, unique ids, requests, and function names and variables, if you cannot explore and swiftly skim through new questions on the fly, then you cannot inspect the intersection of (your code + production + users) with the specificity required to associate specific changes with specific behaviors.  You can’t look where you are going.

Observability as I define it is like taking off the blindfold and turning on the light before you take a swing at the pinata.  It is necessary, although not sufficient alone, to dramatically improve the way you build software.  Observability as they define it gets you to … exactly where you already are.  Which of these is a good use of a new technical term?

 

Do better.

And honestly, it’s the next generation who are best poised to learn the new ways and take advantage of them. Observability is far, far easier than the old ways and workarounds … but only if you don’t have decades of scar tissue and old habits to unlearn.

The less time you’ve spent using monitoring tools and ops workarounds, the easier it will be to embrace a new and better way of building and shipping well-crafted code.

Observability matters.  You should care about it.  And vendors need to stop trying to confuse people into buying the same old bullshit tools by smooshing them together and slapping on a new label.  Exactly how long do they expect to fool people for, anyway?

Observability is a Many-Splendored Definition

Logs vs Structured Events

I got an interesting tweet the other day from @evntdrvn in response to this thread of mine. Paraphrasing,

“So I’ve almost got our group at work up to Step 1 in your observability maturity model, but some of the devs that I work with want to turn OFF our lovely structured logging in prod for informational-level msgs due to their legacy philosophy (‘we only log errors in prod’). The reasons given are mostly philosophical (“I’m a dev and only interested when things error out, I don’t want any other noise in prod logs”, “I don’t want to slow my app down in prod”). Help?!?”

As I was reading this, I was itching to fly out and dive into battle with Eric. I know exactly where his opinionated devs are coming from. I used to say the same things! I even wrote a whole blog post about it.

These developers have internalized a set of rules and best practices for dealing with output data, in the context of “monolith application development in the early 2000s”.

Monolithic systems assumptions

Those systems had many common constraints and assumptions, such as:

  • We have a monolith service, or a very small number of services. We can model the system in our heads.
  • Logging is done to local disk, which can impact performance
  • Disks are expensive
  • Log lines are spat out inline with execution.  A poorly placed printf can take the whole system down.
  • Investigation is rare, and usually means a human reading error logs.
  • Logging is of poor utility for understanding internal states or execution paths; you should just read the code or use a debugger.  (There are few or network hops between functions.)
  • Logging is mostly useful for detecting certain terminal crash states or connection errors.

Monolithic logging best practices

Therefore:

  • We should be very stingy in what we log
  • Debuggers should be used for understanding internal states of the code
  • Logs are a last resort and record of crash dumps.  We do not expect to use log data in the course of our daily work.  We assume log-related manual investigation will be infrequent and of limited utility.

These were exactly the right lessons to learn in the era of expensive hardware and monolithic repos/artifacts. Many people still work in environments like this, and follow logging best practices like these. God bless, more power to em.

Distributed systems assumptions

But more and more of us face systems that are very different.

  • We have many services, possibly many MANY services. A representative request will have “many” hops across “many” services and routers and proxies and meshes and storage systems.
  • We cannot model the system in our heads; it would be a mistake to try. We rely on tooling as the source of truth for those systems.
  • You may or may not have access to those services, or the systems your code runs on. There may or may not be a logging facility, or a centralized log aggregator. Your only view of the system is through the instrumentation of your code.
  • Disks and system resources are cheap, ephemeral, all but disposable.
  • Data services are similarly cheap.  We can almost entirely silo application performance off from the cost of writing perf data out.
  • Investigation is prohibitively slow and expensive for a human to do by hand. Many of the nodes or processes we need to inspect may no longer even exist, but their past states may still be relevant to us in understanding patterns to the present time.
  • Investigation should usually be done distributedly, across all instantiations of your code, however many there might be — and in real time
  • Investigation requires computation — not just string search. We need to ask on the fly involving math and percentiles and breakdowns and group by’s.  And we need access to the raw requests in order to run accurate computations — no pre-aggregates.
  • The hardest part isn’t usually debugging the code, it’s figuring out where is the code you need to debug. Or what the errors or outliers have in common from the perspective of the code.  Fixing the code itself is often comparatively trivial, once found.
  • What even is ‘logging’?
  • What even is ‘local disk’?

This isn’t optional: at some point of complexity or scale or distributedness, it becomes necessary if you want to work with these systems.

Logs can’t help you here.

And you aren’t going to get that kind of explorable data out of loglevel:ERROR, or by chopping up your telemetry into disconnected metrics devoid of context.

You are only going to get this kind of explorable, ad hoc, computation-friendly data if you take a radically new approach to how you output and aggregate telemetry.  You’re going to need to replace your log lines and log levels with a different sort of beast: arbitrarily wide structured events that describe the request and its context, one event per request per service.

If it helps, don’t think of them as log files any more. Think of them as events. Yes, you can stash this stream in a file, but why would you?  on what disk?  will that work for your serverless functions too?  Just stream them over the network to wherever you want to put them.

Log levels are another confusing and unnecessary artifact of yesteryear that you no longer really need. The more you think of structured events as logs, the more tempted you may be to apply the old set of best practices. So just don’t think of them as logs at all.

How to gather and structure your data

Instead of dribbling little pebbles of log effluvia throughout your code, do this.  (If you’re a honeycomb user, our beelines do it all automatically for you *and* pre-propagate the blobs with everything we know of your context.)

  1. Initialize an empty blob at the beginning, when the request first enters the service.
  2. Stuff any and all interesting detail about the request into that blob throughout the lifetime of the request.
    • Any unique id, any high-cardinality variable, any headers passed in, every full query, normalized query, and query execution time; every http call out to a remote service, every http execution time; any shopping cart id, first and last name, execution time — literally anything interesting, append to blob.
  3. Then, when the request is about to exit or error, write the blob off to honeycomb or another service or disk somewhere.

You can see immediately how this method has radically different performance implications and risks than the earlier shotgun spray approach. No more “oops i accidentally put a print line INSIDE a for loop”. The write amplification profile is compressed. Most importantly, the incremental cost of capturing more detail about the request per service is nearly zero.

And now you have the kind of structured data that you can feed into something like a columnar store, or honeycomb, and run ad hoc queries to your heart’s delight.

Distributed systems logging events best practices:

Let’s sum up.  (I’m including links to other past rants on this topic):

Just think.

No more doing multi-line regexps trying to look for the same request ID or user ID doing five suspicious things in a row.

No more regexps at all, for fuck’s sake.

No more bullshit percentiles that were computed at write time by averaging over a bunch of other averages

No more having to jump around from dashboards to logs trying to vainly eyeball correlate one spike with another. No more wondering why no two tools can agree if anything even exists or not

Just gather the detail you need to ask the questions when you need them, and store it in a single source of truth.  It’s that simple.

No need to shame people from learning best practices that worked perfectly well for a long time.  You can either let them learn the hard way that this transformation is non optional, or you can help them learn the easy way that it’s simply much better and easier to invest in this telemetry up front.  You seem like a nice enough chap, which is probably why you chose door 2.  (If you wanted to get tougher about it, have a few reformed folks in to tell their horror stories.  Try some ex-twitter engineers.)

The hardest part seems to be getting people to unlearn all the best practices they once learned for dealing with logs.  So just don’t call it logs anymore, if that helps. Call it “structured events”.

– charity.

img_4817

Logs vs Structured Events