How many pillars of observability can you fit on the head of a pin?

My day started off with an innocent question, from an innocent soul.

“Hey Charity, is profiling a pillar?”

I hadn’t even had my coffee yet.

“Someone was just telling me that profiling is the fourth pillar of observability now. I said I think profiling is a great tool, but I don’t know if it quite rises to the level of pillar. What do you think?”

What….do.. I think.

What I think is, there are no pillars. I think the pillars are a fucking lie, dude. I think the language of pillars does a lot of work to keep good engineers trapped inside a mental model from the 1980s, paying outrageous sums of money for tooling that can’t keep up with the chaos and complexity of modern systems.

Here is a list of things I have recently heard people refer to as the “fourth pillar of observability”:

  • Profiling
  • Tokens (as in LLMs)
  • Errors, exceptions
  • Analytics
  • Cost

Is it a pillar, is it not a pillar? Are they all pillars? How many pillars are there?? How many pillars CAN there be? Gaahhh!

This is not a new argument. Take this ranty little tweet thread of mine from way back in 2018, for starters.

 

Or perhaps you have heard of TEMPLE: Traces, Events, Metrics, Profiles, Logs, and Exceptions?

Or the “braid” of observability data, or “They Aren’t Pillars, They’re Lenses”, or the Lightstep version: “Three Pillars, Zero Answers” (that title is a personal favorite).

Alright, alright. Yes, this has been going on for a long time. I’m older now and I’m tireder now, so here’s how I’ll sum it up.

Pillar is a marketing term.
Signal is a technical term.

So “is profiling a pillar?” is a valid question, but it’s not a technical question. It’s a question about the marketing claims being made by a given company. Some companies are building a profiling product right now, so yes, to them, it is vitally important to establish profiling as a “pillar” of observability, because you can charge a hell of a lot more for a “pillar” than you can charge for a mere “feature”. And more power to them. But it doesn’t mean anything from a technical point of view.

On the other hand, “signal” is absolutely a technical term. The OpenTelemetry Signals documentation, which I consider canon, says that OTel currently supports Traces, Metrics, Logs, and Baggage as signal types, with Events and Profiles at the proposal/development stage. So yes, profiling is a type of signal.

The OTel docs define a telemetry signal as “a type of data transmitted remotely for monitoring and analysis”, and they define a pillar as … oh, they don’t even mention pillars? like at all??

I guess there’s your answer.

And this is probably where I should end my piece. (Why am I still typing…. 🤔)

Pillars vs signals

First of all, I want to stress that it does not bother me when engineers go around talking about pillars. Nobody needs to look at me guiltily and apologize for using the term ‘pillar’ atBunnies Addendum (For the Buffy Fans) - En Tequila Es Verdad the bar after a conference because they think I’m mad at them. I am not the language police, it is not my job to go around enforcing correct use of technical terms. (I used to, I know, and I’m sorry! 😆)

When engineers talk about pillars of observability, they’re just talking about signals and signal types, and “pillar” is a perfectly acceptable colloquialism for “signal”.

When a vendor starts talking about pillars, though — as in the example above! — it means they are gearing up to sell you something: another type of signal, siloed off from all the other signals you send them. Your cost multiplier is about to increment again, and then they’re going to start talking about how Important it is that you buy a product for each and every one of the Pillars they happen to have.

As a refresher: there are two basic architecture models used by observability companies, the multiple pillars model and the unified storage model (aka o11y 2.0). The multiple pillars model is to store every type of signal in a different siloed storage location — metrics, logs, traces, profiling, exceptions, etc, everybody gets a database! The unified storage model is to store all signals together in ONE database, preserving context and relationships, so you can treat data like data: slice and dice, zoom in, zoom out, etc.

Most of the industry giants were built using the pillars model, but Honeycomb (and every other observability company founded post-2019) has built using the unified storage model, building wide, structured log events on a columnar storage engine with high cardinality support, and so on.

Bunny-hopping from pillar to pillar

When you use each signal type as a standalone pillar, this leads to an experience I think of as “bunny products” 🐇 where the user is always hopping from pillar to pillar. You see something on your metrics dashboard that looks scary? hop-hop to your logs and try to find it there, using grep and search and matching by timestamps. If you can find the right logs, then you need to trace it, so you hop-hop-hop to your traces and repeat your search there. With profiling as a pillar, maybe you can hop over to that dataset too.🐇🐰

The amount of data duplication involved in this model is mind boggling. You are literally storing the same information in your metrics TSDB as you are in your logs and your traces, just formattedThe 30 Best Bunny Rabbit Memes - Hop to Pop differently. (I never miss an opportunity to link to Jeremy Morrell’s masterful doc on instrumenting your code for wide events, which also happens to illustrate this nicely.) This is insanely expensive. Every request that enters your system gets stored how many times, in how many signals? Count it up; that’s your cost multiplier.

Worse, much of the data that connects each “pillar” exists only in the heads of the most senior engineers, so they can guess or intuit their way around the system, but anyone who relies on actual data is screwed. Some vendors have added an ability to construct little rickety bridges post hoc between pillars, e.g. “this metric is derived from this value in this log line or trace”, but now you’re paying for each of those little bridges in addition to each place you store the data (and it goes without saying, you can only do this for things you can predict or hook up in the first place).

The multiple pillars model (formerly known as observability 1.0) relies on you believing that each signal type must be stored separately and treated differently. That’s what the pillars language is there to reinforce. Is it a Pillar or not?? It doesn’t matter because pillars don’t exist. Just know that if your vendor is calling it a Pillar, you are definitely going to have to Pay for it. 😉

Zooming in and out

But all this data is just.. data. There is no good reason to silo signals off from each other, and lots of good reasons not to. You can derive metrics from rich, structured data blobs, or append your metrics to wide, structured log events. You can add span IDs and visualize them as a trace. The unified storage model (“o11y 2.0”) says you should store your data once, and do all the signal processing in the collection or analysis stages. Like civilized folks.

Anya Bunny Quote - Etsy
All along, Anya was right

From the perspective of the developer, not much changes. It just gets easier (a LOT easier), because nobody is harping on you about whether this nit of data should be a metric, a log, a trace, or all of the above, or if it’s low cardinality or high cardinality, or whether the cardinality of the data COULD someday blow up, or whether it’s a counter, a gauge, a heatmap, or some other type of metric, or when the counter is going to get reset, or whether your heatmap buckets are defined at useful intervals, or…or…

Instead, it’s just a blob of json. Structured data.. If you think it might be interesting to you someday, you dump it in, and if not, you don’t. That’s all. Cognitive load drops way down..

On the backend side, we store it once, retaining all the signal type information and connective tissue.

It’s the user interface where things change most dramatically. No more bunny hopping around from pillar to pillar, guessing and copy-pasting IDs and crossing your fingers. Instead, it works more like the zoom function on PDFs or Google maps.

You start with SLOs, maybe, or a familiar-looking metrics dashboard. But instead of hopping, you just.. zoom in. The SLOs and metrics are derived from the data you need to debug with, so you’re just like.. “Ah what’s my SLO violation about? Oh, it’s because of these events.” Want to trace one of them? Just click on it. No hopping, no guessing, no pasting IDs around, no lining up time stamps.

Zoom in, zoom out, it’s all connected. Same fucking data.

“But OpenTelemetry FORCES you to use three pillars”

There’s a misconception out there that OpenTelemetry is very pro-three pillars, and very anti o11y 2.0. This is a) not true and b) actually the opposite. Austin Parker has written a voluminous amount of material explaining that actually, under the hood, OTel treats everything like one big wide structured event log.

As Austin puts it, “OpenTelemetry, fundamentally, unifies telemetry signals through shared, distributed context.” However:

“The project doesn’t require you to do this. Each signal is usable more or less independently of the other. If you want to use OpenTelemetry data to feed a traditional ‘three pillars’ system where your data is stored in different places, with different query semantics, you can. Heck, quite a few very successful observability tools let you do that today!”

“This isn’t just ‘three pillars but with some standards on top,’ it’s a radical departure from the traditional ‘log everything and let god sort it out’ approach that’s driven observability practices over the past couple of decades.”

You can use OTel to reinforce a three pillars mindset, but you don’t have to. Most vendors have chosen to implement three pillarsy crap on top of it, which you can’t really hold OTel responsible for. One[1] might even argue that OTel is doing as much as it can to influence you in the opposite direction, while still meeting Pillaristas where they’re at.

A postscript on profiling

What will profiling mean in a unified storage world? It just means you’ll be able to zoom in to even finer and lower-level resolution, down to syscalls and kernel operations instead of function calls. Like when Google Maps got good enough that you could read license plates instead of just rooftops.

Admittedly, we don’t have profiling yet at Honeycomb. When we did some research into the profiling space, what we learned was that most of the people who think they’re in desperate need of a profiling tool are actually in need of a good tracing tool. Either they didn’t have distributed tracing or their tracing tools just weren’t cutting it, for reasons that are not germane in a Honeycomb tracing world.

We’ll get to profiling, hopefully in the near-ish future, but for the most part, if you don’t need syscall level data, you probably don’t need profiling data either. Just good traces.

Also… I did not make this site or have any say whatsoever in the building of it, but I did sign the manifesto[2] and every day that I remember it exists is a day I delight in the joy and fullness of being alive: kill3pill.com 📈

Kill Three Pillars

Hop hop, little friends,
~charity

 

[1] Austin argues this. I’m talking about Austin, if not clear enough.
[2] Thank you, John Gallagher!!

How many pillars of observability can you fit on the head of a pin?

On How Long it Takes to Know if a Job is Right for You or Not

A few eagle-eyed readers have noticed that it’s been 4 weeks since my last entry in what I have been thinking of as my “niblet series” — one small piece per week, 1000 words or less, for the next three months.

This is true. However, I did leave myself some wiggle room in my original goal, when I said “weeks when I am not traveling”, knowing I was traveling 6 of the next 7 weeks. I was going to TRY to write something on the weeks I was traveling, but as you can see, I mostly did not succeed. Oh well!

Honestly, I don’t feel bad about it. I’ve written well over 1k words on bsky over the past two weeks in the neverending thread on the costs and tradeoffs of remote work. (A longform piece on the topic is coming soon.) I also wrote a couple of lengthy internal pieces.

This whole experiment was designed to help me unblock my writing process and try out new habits, and I think I’m making progress. I will share what I’m learning at a later date, but for now: onward!

How long does it take to form an impression of a new job?

This week’s niblet was inspired by a conversation I had yesterday with an internet friend. To paraphrase (and lightly anonymize) their question:

“I took a senior management role at this company six months ago. My search for this role was all about values alignment, from company mission to leadership philosophy, and the people here said all the right things in the process. But it’s just not clicking.

It’s only been six months, but it’s starting to feel like it might not work out. How much longer should I give it?”

Zero. You should give it 0 time. You already know, and you’ve known for a long time; it’s not gonna change. I’m sorry. 💔

I’m not saying you should quit tomorrow, a person needs a paycheck, but you should probably start thinking in terms of how to manage the problem and extricate yourself from it, not like you’re waiting to see if it will be a good fit.

Every job I’ve ever had has made a strong first impression

I’ve had…let’s see…about six different employers, over the course of my (post-university) career.

Every job I’ve ever taken, I knew within the first week whether it was right for me or not. That might be overstating things a bit (memory can be like that). But I definitely had a strong visceral reaction to the company within days after starting, and the rest of my tenure played out more or less congruent with that reaction.Progress not Perfection

The first week at EVERY job is a hot mess of anxiety and nerves and second-guessing yourself and those around you. It’s never warm fuzzies. But at the jobs I ended up loving and staying at long term, the anxiety was like “omg these people are so cool and so great and so fucking competent, I hope I can measure up to their expectations.”

And then there were the jobs where the anxiety I felt was more like a sinking sensation of dread, of “oooohhh god I hope this is a one-off and not the kind of thing I will encounter every day.”

🌸 There was the job where they had an incident on my very first day, and by 7 pm I was like “why isn’t someone telling me I should go home?” There was literally nothing I could do to help, I was still setting up my accounts, yet I had the distinct impression I was expected to stay.

This job turned out to be stereotypically Silicon Valley in the worst ways, hiring young, cheap engineers and glorifying coding all night and sleeping under your desks.

🌼 There was the job where they were walking me through a 50-page Microsoft Word doc on how to manage replication between DB nodes, and I laughed a little, and looked for some rueful shared acknowledgement of how shoddy this was…but I was the only one laughing.

That job turned out to be shoddy, ancient, flaky tech all the way down, with comfortable, long-tenured staff who didn’t know (and did NOT want to hear) how out of date their tech had become.

Over time, I learned to trust that intuition

Around the time I became a solidly senior engineer, I began to reflect on how indelible my early impressions of each job had been, and how reliable those impressions had turned out to be.Communicate Positive Intent

To be clear, I don’t regret these jobs. I got to work with some wonderful people, and I got to experience a range of different organizational structures and types. I learned a lot from every single one of my jobs.

Perhaps most of all, I learned how to sniff out particular environments that really do not work for me, and I never made the same mistake twice.

Companies can and do change dramatically. But absent dramatic action, which can be quite painful, they tend to drift along their current trajectory.

This matters even more for managers

This is one of those ways that I think the work of management is different from the work of engineering. As an experienced IC, it’s possible to phone it in and still do a good job. As long as you’re shipping at an acceptable rate, you can check out mentally and emotionally, even work for people or companies you basically despise.

Lots of people do in fact do this. Hell, I’ve done it. You aren’t likely to do the best work of your life under these circumstances, but people have done far worse to put food on the table.

An IC can wall themselves off emotionally and still do acceptable work, but I’m not sure a manager can do the same.

Alignment *is* the job of management

As a manager, you literally represent the company to your team and those around you. You don’t have to agree with every single decision the company makes, but if you find yourself constantly having to explain and justify things the company has done that deeply violate your personal beliefs or ethics, it does you harm.

Some managers respond to a shitty corporate situation by hunkering down and behaving like a shit umbrella; doing whatever they can to protect their people, at the cost ofif it hurts...do it more undermining the company itself. I don’t recommend this, either. It’s not healthy to know you walk around every day fucking over one of your primary stakeholders, whether it’s the company OR  your teammates.

There are also companies that aren’t actually that bad, but you just aren’t aligned with them. That’s fine. Alignment matters a lot more for managers than for ICs, because alignment is the job.

Management is about crafting and tending to complex sociotechnical systems. No manager can do this alone. Having a healthy, happy team of direct reports is only a fraction of the job description. It’s not enough. You can and should expect more.

What can you learn from the experience?

I asked my friend to think back to the interview process. What were the tells? What do they wish they had known to watch out for?

They thought for a moment, then said:

“Maybe the fact that the entire leadership team had been grown or promoted from within. SOME amount of that is terrific, but ALL of it might be a yellow flag. The result seems to be that everyone else thinks and feels the same way…and I think differently.”

This is SO insightful.

It reminds me of all the conversations Emily and I have had over the years, on how to balance developing talent from within vs bringing in fresh perspectives, people who have already seen what good looks like at the next stage of growth, people who can see around corners and challenge us in different ways.

This is a tough thing to suss out from the outside, especially when the employer is saying all the right things. But having an experience like this can inoculate you from an entire family of related mistakes. My friend will pick up on this kind of insularity from miles away, from now on.

Bad jobs happen. Interviews can only predict so much. A person who has never had a job they disliked is a profoundly lucky person. In the end, sometimes all you can take is the lessons you learned and won’t repeat.

The pig is committed

Have you ever heard the metaphor of the chicken vs the pig? The chicken contributes an egg to breakfast, the pig contributes bacon. The punch line goes something like, “the chicken is involved, but the pig is committed!”

It’s vivid and a bit over the top, but I kept thinking about it while writing this piece. The engineer contributes their labor and output to move the company forward, but the manager contributes their emotional and relational selves — their humanity — to serve the cause.

You only get one career. Who are you going to give your bacon to?

On How Long it Takes to Know if a Job is Right for You or Not