Live Your Best Life With Structured Events

August 15, 2022July 14, 2023 mipsytipsyengineering, events, logs, metrics, monitoring, traces6 Comments

If you’re like most of us, you learned to debug as a baby engineer by way of printf(3). By the time you were shipping code to production you had probably learned to instrument your code with a real metrics library. Maybe a tenth of us learned to use gdb and still step through functions on the regular. (I said maybe.)

Printing stuff to stdout is still the Swiss Army knife of tools. Always there when you reach for it, usually helps more than it does harm. (I said usually.)

And then! In case you’ve been living under a rock, we recently went and blew up ye aulde monolythe, and in the process we … lost most of our single-process tools and techniques for debugging. Forget gdb; even printf doesn’t work when you’re hopping the network between functions.

If your tool set no longer works for you, friend, it’s time to go all in. Maybe what you wanted was a faster horse, but it’s time for a car, and the sooner you turn in your oats for gas cans and a spare tire, the better.

Exercising Good Technical Judgment (When You Don’t Have Any)

If you’re stuck trying to debug modern problems with pre-modern tooling, the first thing to do is stop digging the hole. Stop pushing good data after bad into formats and stores that aren’t going to help you answer the right questions.

In brief: if you aren’t rolling out a solution based on arbitrarily wide, structured raw events that are unique and ordered and trace-aware and without any aggregation at write time, you are going to regret it. (If you aren’t using OpenTelemetry, you are going to regret that, too.)

So just make the leap as soon as possible.

But let’s rewind a bit. Let’s start with observability.

Observability: an introduction

Observability is not a new word or concept, but the definition of observability as a specific technical term applied to software engineering is relatively new — about four years old. Before that, if you heard the term in softwareland it was only as a generic synonym for telemetry (“there are three pillars of observability”, in one annoying formulation) or team names (twitter, for example, has long had an “observability team”).

The term itself originates with control theory:

“In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals. The concept of observability was introduced by Hungarian-American engineer Rudolf E. Kálmán for linear dynamic systems.[1][2]”

But when applied to a software context, observability refers to how well you can understand and reason about your systems, just by interrogating them and inspecting their outputs with your tools. How well can you understand the inside of the system from the outside?

Achieving this relies your ability to ask brand new questions, questions you have never encountered and never anticipated — without shipping new code. Shipping new code is cheating, because it means that you knew in advance what the problem was in order to instrument it.

But what about monitoring?

Monitoring has a long and robust history, but it has always been about watching your systems for failures you can define and expect. Monitoring is for known-unknowns, and setting thresholds and running checks against the system. Observability is about the unknown-unknowns. Which requires a fundamentally different mindset and toolchain.

“Monitoring is the action of observing and checking the behavior and outputs of a system and its components over time.” — @grepory, in his talk “Monitoring is Dead“.

Monitoring is a third-person perspective on your software. It’s not software explaining itself from the inside out, it’s one piece of software checking up on another.

Observability is for understanding complex, ephemeral, dynamic systems (not for debugging code)

You don’t use observability for stepping through functions; it’s not a debugger. Observability is for swiftly identifying where in your system the error or problem is coming from, so you can debug it — by reproducing it, or seeing what it has in common with other erroring requests. You can think of observability as being like B.I. (business intelligence) tooling for software applications, in the way you engage in a lot of exploratory, open-ended data sifting to detect novel patterns and behaviors.

Observability is often about swiftly isolating or tracking down the problem in your large, sprawling, far-flung, dynamic system. Because the hard part of distributed systems is rarely debugging the code, it’s figuring out where the code you need to debug is.

The need for observability is often associated with microservices adoption, because they are prohibitively difficult to debug without service-level event oriented tooling — the kind you can get from Honeycomb and Lightstep.. and soon, I hope, many other vendors.

Events are the building blocks of observability

Ergh, another overloaded data term. What even is an “event”?

An observability “event” is a hop in the lifecycle of an end-to-end request. If a request executes code on three services separated by network hops before returning to the user, that request generated three observability “events”, each packed with context and details about that code running in that environment. These are also sometimes called “canonical log lines“. If you implemented tracing, each event may be a span in your trace.

If request ID #A897BEDC hits your edge, then your API service, then four more internal services, and twice connects to a db and runs a query, then request ID #A897BEDC generated 8 observability events … assuming you are in fact gathering observability data from the edge, the API, the internal services and the databases.

This is an important caveat. We only gather observability events from services that we can and do introspect. If it’s a black box to us, that hop cannot generate an observability event. So if request ID #A897BEDC also performed 20 cache lookups and called out to 8 external HTTP services and 2 managed databases, those 30 hops do not generate observability events (assuming you haven’t instrumented the memcache service and have no instrumentation from those external services/dbs). Each request generates one event per request per service hop.**

(I also wrote about logs vs structured events here.)

Observability is a first-person narrative.

We care primarily about self-reported status from the code as it executes the request path.

Instrumentation is your eyes and ears, explaining the software and its environment from the perspective of your code. Monitoring, on the other hand, is traditionally a third-person narrative — it’s one piece of software checking up on another piece of software, with no internal knowledge of its hopes and dreams.

First-person narrative reports have the best potential for telling a reliable narrative. And more importantly, they map directly to user experience in a way that third-party monitoring does not and cannot.

Events … must be structured.

First, structure your goddamn data. You’re a computer scientist, you’ve got no business using text search to plow through terabytes of text.

Events … are not just structured logs.

Now, part of the reason people seem to think structured data is cost-prohibitive is that they’re doing it wrong. They’re still thinking about these like log lines. And while you can look at events like they’re just really wide structured log lines that aren’t flushed to disk, here’s why you shouldn’t: logs have decades of abhorrent associations and absolutely ghastly practices.

Instead of bundling up and passing along one neat little pile of context, they’re spewing log lines inside loops in their code and DDoS’ing their own logging clusters.They’re shitting out “log lines” with hardly any dimensions so they’re information-sparse and just straight up wasting the writes. And then to compensate for the sparseness and repetitiveness they just start logging the same exact nouns tens or hundreds of times over the course of the request, just so they can correlate or reconstruct some lousy request that they never should have blown up in the first place!

But they keep hearing they should be structuring their logs, so they pile structure on to their horrendous little strings, which pads every log line by a few bytes, so their bill goes up but they aren’t getting any benefit! just paying more! What the hell, structuring is bull shit!

Kittens. You need a fundamentally different approach to reap the considerable benefits of structuring your data.

But the difference between strings and structured data is ~basically the difference between grep and all of computer science. 😛

Events … must be arbitrarily wide and dense with context.

So the most effective way to structure your instrumentation, to get the absolute most bang for your buck, is to emit a single arbitrarily wide event per request per service hop. At Honeycomb, the maturely instrumented datasets that we see are often 200-500 dimensions wide. Here’s an event that’s just 20 dimensions wide:

{ 

   "timestamp":"2018-11-20 19:11:56.910",
   "az":"us-west-1",
   "build_id":"3150",
   "customer_id":"2310",
   "durationMs":167,
   "endpoint":"/api/v2/search",
   "endpoint_shape":"/api/v2/search",
   "fraud_dur":131,
   "hostname":"app14",
   "id":"f46691dfeda9ede4",
   "mysql_dur":"",
   "name":"/api/v2/search",
   "parent_id":"",
   "platform":"android",
   "query":"",
   "serviceName":"api",
   "status_code":200,
   "traceId":"f46691dfeda9ede4",
   "user_id":"344310",
   "error_rate":0,
   "is_root":"true"
}

So a well-instrumented service should have hundreds of these dimensions, all bundled around the context of each request. And yet — and here’s why events blow the pants off of metrics — even with hundreds of dimensions, it’s still just one write. Adding more dimensions to your event is effectively free, it’s still one write plus a few more bits.

Compare this to a metric-based systems, where you are often in the position of trying to predict whether a metric will be valuable enough to justify the extra write, because every single metric or tag you add contributes linearly to write amplification. Ever gotten billed tens of thousands of dollars for your custom metrics, or had to prune your list of useful custom metrics down to something affordable? (“BUT THOSE ARE THE ONLY USEFUL ONES!”, as every ops team wails)

Events … must pass along the blob of context as the request executes

As you can imagine, it can be a pain in the ass to keep passing this blob of information along the life of the request as it hits many services and databases. So at Honeycomb we do all the annoying parts for you with our integrations. You just install the go pkg or ruby gem or whatever, and under the hood we:

initialize an empty debug event when the request enters that service
prepopulate the empty debug event with any and all interesting information that we already know or can guess. language type, version, environment, etc.
create a framework so you can just stuff any other details in there as easily as if you were printing it out to stdout
pass the event along and maintain its state until you are ready to error or exit
write the extremely wide event out to honeycomb

Easy!

(Check out this killer talk from @lyddonb on … well everything you need to know about life, love and distributed systems is in here, but around the 12:00 mark he describes why this approach is mandatory. WATCH IT. https://www.youtube.com/watch?v=xy3w2hGijhE&feature=youtu.be)

Events … should collect context like sticky buns collect dust

Other stuff you’ll want to track in these structured blobs includes:

Metadata like src, dst headers
The timing stats and contents of every network call (our beelines wrap all outgoing http calls and db queries automatically)
Every raw db query, normalized query family, execution time etc
Infra details like AZ, instance type, provider
Language/environment context like $lang version, build flags, $ENV variables
Any and all unique identifying bits you can get your grubby little paws on — request ID, shopping cart ID, user ID, request ID, transaction ID, any other ID … these are always the highest value data for debugging.
Any other useful application context. Service name, build id, ordering info, error rates, cache hit rate, counters, whatever.
Possibly the system resource state at this point in time. e.g. values from /proc/net/ipv4 stats

Capture all of it. Anything that ever occurs to you (“this MIGHT be handy someday”) — don’t even hesitate, just throw it on the pile. Collect it up in one rich fat structured blob.

Events … must be unique, ordered, and traceable

You need a unique request ID, and you need to propagate it through your stack in some way that preserves sequence. Once you have that, traces are just a beautiful visualization layer on top of your shiny event data.

Events … must be stored raw.

Because observability means you need to be able to ask any arbitrary new question of your system without shipping new code, and aggregation is a one-way trip. Once you have aggregated your data and discarded the raw requests, you have destroyed your ability to ask new questions of that data forever. For Ever.

Aggregation is a one-way trip. You can always, always derive your pretty metrics and dashboards and aggregates from structured events, and you can never go in reverse. Same for traces, same for logs. The structured event is the gold standard. Invest in it now, save your ass in the future.

It’s only observability if you can ask new questions. And that means storing raw events.

Events…are richer than metrics

There’s always tradeoffs when it comes to data. Metrics choose to sacrifice context and connective tissue, and sometimes high cardinality support, which you need to correlate anomalies or track down outliers. They have a very small, efficient data format, but they sacrifice everything else by discarding all but the counter, gauge, etc.

A metric looks like this, by the way.

{ metric: "db.query.time", value: 0.502, tags: Array(), type: set }

That’s it. It’s just a name, a number and maybe some tags. You can’t dig into the event and see what else was happening when that query was strangely slow. You can never get that information back after discarding it at write time.

But because they’re so cheap, you can keep every metric for every request! Maybe. (Sometimes.) More often, what happens is they aggregate at write time. So you never actually get a value written out for an individual event, it smushes everything together that happens in the 1 second interval and calculates some aggregate values to write out. And that’s all you can ever get back to.

With events, and their relative explosion of richness, we sacrifice our ability to store every single observability event about every request. At FB, every request generated hundreds of observability events as it made its way through the stack. Nobody, NOBODY is going to pay for an o11y stack that is hundreds of times as large as production. The solution to that problem is sampling.

Events…should be sampled.

But not dumb, blunt sampling at server side. Control it on the client side.

Then sample heavily for events that are known to be common and useless, but keep the events that have interesting signal. For example: health checks that return 200 OK usually represent a significant chunk of your traffic and are basically useless, while 500s are almost always interesting. So are all requests to /login or /payment endpoints, so keep all of them. For database traffic: SELECTs for health checks are useless, DELETEs and all other mutations are rare but you should keep all of them. Etc.

You don’t need to treat your observability metadata with the same care as you treat your billing data. That’s just dumb.

… To be continued.

I hope it’s now blazingly obvious why observability requires — REQUIRES — that you have access to raw structured events with no pre-aggregation or write-time rollups. Metrics don’t count. Just traces don’t count. Unstructured logs sure as fuck don’t count.

Structured, arbitrarily wide events, with dynamic sampling of the boring parts to control costs. There is no substitute.

For more about the technical requirements for observability, read this, this, or this.

**The deep fine print: it’s one observability event per request per service hop … because we gather observability detail organized by request id. Databases may be different. For example, with MongoDB or MySQL, we can’t instrument them to talk to honeycomb directly, so we gather information about its internal perspective by 1) tailing the slow query log (and turning it up to log all queries if perf allows), 2) streaming tcp over the wire and reconstructing transactions, 3) connecting to the mysql port as root every couple seconds from cron, then dumping all mysql stats and streaming them in to honeycomb as an event. SO. Database traffic is not organized around connection length or unique request id, it is organized around transaction id or query id. Therefore it generates one observability event per query or transaction.

In other words: if your request hit the edge, API, four internal services, two databases … but ran 1 query on one db and 10 queries on the second db … you would generate a total of *19 observability events* for this request.

For more on observability for databases and other black boxes, try this blog post.

Logs vs Structured Events

February 5, 2019July 14, 2023 mipsytipsyevents, logs, monitoring, operations, serverless, sre10 Comments

I got an interesting tweet the other day from @evntdrvn in response to this thread of mine. Paraphrasing,

“So I’ve almost got our group at work up to Step 1 in your observability maturity model, but some of the devs that I work with want to turn OFF our lovely structured logging in prod for informational-level msgs due to their legacy philosophy (‘we only log errors in prod’). The reasons given are mostly philosophical (“I’m a dev and only interested when things error out, I don’t want any other noise in prod logs”, “I don’t want to slow my app down in prod”). Help?!?”

As I was reading this, I was itching to fly out and dive into battle with Eric. I know exactly where his opinionated devs are coming from. I used to say the same things! I even wrote a whole blog post about it.

These developers have internalized a set of rules and best practices for dealing with output data, in the context of “monolith application development in the early 2000s”.

Monolithic systems assumptions

Those systems had many common constraints and assumptions, such as:

We have a monolith service, or a very small number of services. We can model the system in our heads.
Logging is done to local disk, which can impact performance
Disks are expensive
Log lines are spat out inline with execution. A poorly placed printf can take the whole system down.
Investigation is rare, and usually means a human reading error logs.
Logging is of poor utility for understanding internal states or execution paths; you should just read the code or use a debugger. (There are few or network hops between functions.)
Logging is mostly useful for detecting certain terminal crash states or connection errors.

Monolithic logging best practices

Therefore:

We should be very stingy in what we log
Debuggers should be used for understanding internal states of the code
Logs are a last resort and record of crash dumps. We do not expect to use log data in the course of our daily work. We assume log-related manual investigation will be infrequent and of limited utility.

These were exactly the right lessons to learn in the era of expensive hardware and monolithic repos/artifacts. Many people still work in environments like this, and follow logging best practices like these. God bless, more power to em.

Distributed systems assumptions

But more and more of us face systems that are very different.

We have many services, possibly many MANY services. A representative request will have “many” hops across “many” services and routers and proxies and meshes and storage systems.
We cannot model the system in our heads; it would be a mistake to try. We rely on tooling as the source of truth for those systems.
You may or may not have access to those services, or the systems your code runs on. There may or may not be a logging facility, or a centralized log aggregator. Your only view of the system is through the instrumentation of your code.
Disks and system resources are cheap, ephemeral, all but disposable.
Data services are similarly cheap. We can almost entirely silo application performance off from the cost of writing perf data out.
Investigation is prohibitively slow and expensive for a human to do by hand. Many of the nodes or processes we need to inspect may no longer even exist, but their past states may still be relevant to us in understanding patterns to the present time.
Investigation should usually be done distributedly, across all instantiations of your code, however many there might be — and in real time
Investigation requires computation — not just string search. We need to ask on the fly involving math and percentiles and breakdowns and group by’s. And we need access to the raw requests in order to run accurate computations — no pre-aggregates.
The hardest part isn’t usually debugging the code, it’s figuring out where is the code you need to debug. Or what the errors or outliers have in common from the perspective of the code. Fixing the code itself is often comparatively trivial, once found.
What even is ‘logging’?
What even is ‘local disk’?

This isn’t optional: at some point of complexity or scale or distributedness, it becomes necessary if you want to work with these systems.

Logs can’t help you here.

And you aren’t going to get that kind of explorable data out of loglevel:ERROR, or by chopping up your telemetry into disconnected metrics devoid of context.

You are only going to get this kind of explorable, ad hoc, computation-friendly data if you take a radically new approach to how you output and aggregate telemetry. You’re going to need to replace your log lines and log levels with a different sort of beast: arbitrarily wide structured events that describe the request and its context, one event per request per service.

If it helps, don’t think of them as log files any more. Think of them as events. Yes, you can stash this stream in a file, but why would you? on what disk? will that work for your serverless functions too? Just stream them over the network to wherever you want to put them.

Log levels are another confusing and unnecessary artifact of yesteryear that you no longer really need. The more you think of structured events as logs, the more tempted you may be to apply the old set of best practices. So just don’t think of them as logs at all.

How to gather and structure your data

Instead of dribbling little pebbles of log effluvia throughout your code, do this. (If you’re a honeycomb user, our beelines do it all automatically for you *and* pre-propagate the blobs with everything we know of your context.)

Initialize an empty blob at the beginning, when the request first enters the service.
Stuff any and all interesting detail about the request into that blob throughout the lifetime of the request.
- Any unique id, any high-cardinality variable, any headers passed in, every full query, normalized query, and query execution time; every http call out to a remote service, every http execution time; any shopping cart id, first and last name, execution time — literally anything interesting, append to blob.
Then, when the request is about to exit or error, write the blob off to honeycomb or another service or disk somewhere.

You can see immediately how this method has radically different performance implications and risks than the earlier shotgun spray approach. No more “oops i accidentally put a print line INSIDE a for loop”. The write amplification profile is compressed. Most importantly, the incremental cost of capturing more detail about the request per service is nearly zero.

And now you have the kind of structured data that you can feed into something like a columnar store, or honeycomb, and run ad hoc queries to your heart’s delight.

Distributed systems logging events best practices:

Let’s sum up. (I’m including links to other past rants on this topic):

Emit a rich record from the perspective of the request as it executes the code. Include all the context you can get your paws on.
Emit a single event per request per service that it hits. Write it out just before the request errors or exits the service.
Bypass local disk entirely, write to a remote service.
Sample if needed for cost or resource constraints. Practice dynamic sampling.
Treat this like operational data, not transactional data. Be profligate and disposable.
Feed this data into a columnar store or honeycomb or similar
Now use it every day. Not just as a last resort. Get knee deep in production every single day. Explore. Ask and answer rich questions about your systems, system quality, system behavior, outliers, error conditions, etc. You will be absolutely amazed how useful it is … and appalled by what you turn up. 🙂

Just think.

No more doing multi-line regexps trying to look for the same request ID or user ID doing five suspicious things in a row.

No more regexps at all, for fuck’s sake.

No more bullshit percentiles that were computed at write time by averaging over a bunch of other averages

No more having to jump around from dashboards to logs trying to vainly eyeball correlate one spike with another. No more wondering why no two tools can agree if anything even exists or not

Just gather the detail you need to ask the questions when you need them, and store it in a single source of truth. It’s that simple.

No need to shame people from learning best practices that worked perfectly well for a long time. You can either let them learn the hard way that this transformation is non optional, or you can help them learn the easy way that it’s simply much better and easier to invest in this telemetry up front. You seem like a nice enough chap, which is probably why you chose door 2. (If you wanted to get tougher about it, have a few reformed folks in to tell their horror stories. Try some ex-twitter engineers.)

The hardest part seems to be getting people to unlearn all the best practices they once learned for dealing with logs. So just don’t call it logs anymore, if that helps. Call it “structured events”.

– charity.

charity.wtf

charity wtf's about technology, databases, startups, engineering management, and whiskey.

events