The Cost Crisis in Observability Tooling

Originally posted on the Honeycomb blog on January 24th, 2024

The cost of services is on everybody’s mind right now, with interest rates rising, economic growth slowing, and organizational budgets increasingly feeling the pinch. But I hear a special edge in people’s voices when it comes to their observability bill, and I don’t think it’s just about the cost of goods sold. I think it’s because people are beginning to correctly intuit that the value they get out of their tooling has become radically decoupled from the price they are paying.

In the happiest cases, the price you pay for your tools is “merely” rising at a rate several times faster than the value you get out of them. But that’s actually the best case scenario. For an alarming number of people, the value they get actually decreases as their bill goes up.

Observability 1.0 and the cost multiplier effect

Are you familiar with this chestnut?

“Observability has three pillars: metrics, logs, and traces.”

This isn’t exactly true, but it’s definitely true of a particular generation of tools—one might even say definitionally true of a particular generation of tools. Let’s call it “observability 1.0.”

From an evolutionary perspective, you can see how we got here. Everybody has logs… so we spin up a service for log aggregation. But logs are expensive and everybody wants dashboards… so we buy a metrics tool. Software engineers want to instrument their applications… so we buy an APM tool. We start unbundling the monolith into microservices, and pretty soon we can’t understand anything without traces… so we buy a tracing tool. The front-end engineers point out that they need sessions and browser data… so we buy a RUM tool. On and on it goes.

Logs, metrics, traces, APM, RUM. You’re now paying to store telemetry five different ways, in five different places, for every single request. And a 5x multiplier is on the modest side of the spectrum, given how many companies pay for multiple overlapping tools in the same category. You may also also be collecting:

  • Profiling data
  • Product analytics
  • Business intelligence data
  • Database monitoring/query profiling tools
  • Mobile app telemetry
  • Behavioral analytics
  • Crash reporting
  • Language-specific profiling data
  • Stack traces
  • CloudWatch or hosting provider metrics
  • …and so on.

So, how many times are you paying to store data about your user requests? What’s your multiplier? (If you have one consolidated vendor bill, this may require looking at your itemized bill.)

There are many types of tools, each gathering slightly different data for a slightly different use case, but underneath the hood there are really only three basic data types: the metric, unstructured logs, and structured logs. Each of these have their own distinctive trade-offs when it comes to how much they cost and how much value you can get out of them.

Metrics

Metrics are the great-granddaddy of telemetry formats; tiny, fast, and cheap. A “metric” consists of a single number, often with tags appended. All of the context of the request gets discarded at write time; each individual metric is emitted separately. This means you can never correlate one metric with another from the same request, or select all the metrics for a given request ID, user, or app ID, or ask arbitrary new questions about your metrics data.

Metrics-based tools include vendors like Datadog and open-source projects like Prometheus. RUM tools are built on top of metrics to understand browser user sessions; APM tools are built on top of metrics to understand application performance.

When you set up a metrics tool, it generally comes prepopulated with a bunch of basic metrics, but the useful ones are typically the custom metrics you emit from your application.

Your metrics bill is usually dominated by the cost of these custom metrics. At minimum, your bill goes up linearly with the number of custom metrics you create. Which is unfortunate, because to restrain your bill from unbounded growth, you have to regularly audit your metrics, do your best to guess which ones are going to be useful in the future, and prune any you think you can afford to go without. Even in the hands of experts, these tools require significant oversight.

Linear cost growth is the goal, but it’s rarely achieved. The cost of each metric varies wildly depending on how you construct it, what the values are, how often it gets hit, etc. I’ve seen a single custom metric cost $30k per month. You probably have dozens of custom metrics per service, and it’s almost impossible to tell how much each of them costs you. Metrics bills tend to be incredibly opaque (possibly by design).

Nobody can understand their software or their systems with a metrics tool alone, because the metric is extremely limited in what it can do. No context, no cardinality, no strings… only basic static dashboards. For richer data, we must turn to logs.

Unstructured logs

You can understand much more about your code with logs than you can with metrics. Logs are typically emitted multiple times throughout the execution of the request, with one or a small number of nouns per log line, plus the request ID. Unstructured logs are still the default, although this is slowly changing.

The cost of unstructured logs is driven by a few things:

  • Write amplification. If you want to capture lots of rich context about the request, you need to emit a lot of log lines. If you are printing out just 10 log lines per request, per service, and you have half a dozen services, that’s 60 log events for every request.
  • Noisiness. It’s extremely easy to accidentally blow up your log footprint yet add no value—e.g., by putting a print statement inside a loop instead of outside the loop. Here, the usefulness of the data goes down as the bill shoots up.
  • Constraints on physical resources. Due to the write amplification of log lines per request, it’s often physically impossible to log everything you want to log for all requests or all users—it would saturate your NIC or disk. Therefore, people tend to use blunt instruments like these to blindly slash the log volume:
    • Log levels
    • Consistent hashes
    • Dumb sample rates

When you emit multiple log lines per request, you end up duplicating a lot of raw data; sometimes over half the bits are consumed by request ID, process ID, timestamp. This can be quite meaningful from a cost perspective.

All of these factors can be annoying. But the worst thing about unstructured logs is that the only thing you can do to query them is full text search. The more data you have, the slower it becomes to search that data, and there’s not much you can do about it.

Searching your logs over any meaningful length of time can take minutes or even hours, which means experimenting and looking around for unknown-unknowns is prohibitively time-consuming. You have to know what to look for in order to find it. Once again, as your logging bill goes up, the value goes down.

Structured logs

Structured logs are gaining adoption across the industry, especially as OpenTelemetry picks up steam. The nice thing about structured logs is that you can actually do things with the data other than slow, dumb string searches. If you’ve structured your data properly, you can perform calculations! Compute percentiles! Generate heatmaps!

Tools built on structured logs are so clearly the future. But just taking your existing logs and adding structure isn’t quite good enough. If all you do is stuff your existing log lines into key/value pairs, the problems of amplification, noisiness, and physical constraints remain unchanged—you can just search more efficiently and do some math with your data.

There are a number of things you can and should do to your structured logs in order to use them more effectively and efficiently. In order of achievability:

  • Instrument your code using the principles of canonical logs, which collects all the vital characteristics of a request into one wide, dense event. It is difficult to overstate the value of doing this, for reasons of usefulness and usability as well as cost control.
  • Add trace IDs and span IDs so you can trace your code using the same events instead of having to use an entirely separate tool.
  • Feed your data into a columnar storage engine so you don’t have to predefine a schema or indexes to decide which dimensions future you can search or compute based on.
  • Use a storage engine that supports high cardinality, with an explorable interface.

If you go far enough down this path of enriching your structured events, instrumenting your code with the right data, and displaying it in real time, you will reach an entirely different set of capabilities, with a cost model so distinct it can only be described as “observability 2.0.” More on that in a second.

Ballooning costs are baked into observability 1.0

To recap: high costs are baked into the observability 1.0 model. Every pillar has a price.

You have to collect and store your data—and pay to store it—again and again and again, for every single use case. Depending on how many tools you use, your observability bill may be growing at a rate 3x faster than your traffic is growing, or 5x, or 10x, or even more.

It gets worse. As your costs go up, the value you get out of your tools goes down.

  • Your logs get slower and slower to search.
  • You have to know what you’re searching for in order to find it.
  • You have to use blunt force sampling technique to keep log volume from blowing up.
  • Any time you want to be able to ask a new question, you first have to commit new code and deploy it.
  • You have to guess which custom metrics you’ll need and which fields to index in advance.
  • As volume goes up, your ability to find a needle in the haystack—any unknown-unknowns—goes down commensurately.

And nothing connects any of these tools. You cannot correlate a spike in your metrics dashboard with the same requests in your logs, nor can you trace one of the errors. It’s impossible. If your APM and metrics tools report different error rates, you have no way of resolving this confusion. The only thing connecting any of these tools is the intuition and straight-up guesses made by your most senior engineers. Which means that the cognitive costs are immense, and your bus factor risks are very real. The most important connective data in your system—connecting metrics with logs, and logs with traces—exists only in the heads of a few people.

At the same time, the engineering overhead required to manage all these tools (and their bills) rises inexorably. With metrics, an engineer needs to spend time auditing your metrics, tracking people down to fix poorly constructed metrics, and reaping those that are too expensive or don’t get used. With logs, an engineer needs to spend time monitoring the log volume, watching for spammy or duplicate log lines, pruning or consolidating them, choosing and maintaining indexes.

But all this the time spent wrangling observability 1.0 data types isn’t even the costliest part. The most expensive part is the unseen costs inflicted on your engineering organization as development slows down and tech debt piles up, due to low visibility and thus low confidence.

Is there an alternative? Yes.

The cost model of observability 2.0 is very different

Observability 2.0 has no three pillars; it has a single source of truth. Observability 2.0 tools are built on top of arbitrarily-wide structured log events, also known as spans. From these wide, context-rich structured log events you can derive the other data types (metrics, logs, or traces).

Since there is only one data source, you can correlate and cross-correlate to your heart’s content. You can switch fluidly back and forth between slicing and dicing, breaking down or grouping by events, and viewing them as a trace waterfall. You don’t have to worry about cardinality or key space limitations.

You also effectively get infinite custom metrics, since you can append as many as you want to the same events. Not only does your cost not go up linearly as you add more custom metrics, your telemetry just gets richer and more valuable the more key-value pairs you add! Nor are you limited to numbers; you can add any and all types of data, including valuable high-cardinality fields like “App Id” or “Full Name.”

Observability 2.0 has its own amplification factor to consider. As you instrument your code with more spans per request, the number of events you have to send (and pay for) goes up. However, you have some very powerful tools for dealing with this: you can perform dynamic head-based sampling or even tail-based sampling, where you decide whether or not to keep the event after it’s finished, allowing you to capture 100% of slow requests and other outliers.

Engineering time is your most precious resource

But the biggest difference between observability 1.0 and 2.0 won’t show up on any invoice. The difference shows up in your engineering team’s ability to move quickly, with confidence.

Modern software engineering is all about hooking up fast feedback loops. And observability 2.0 tooling is what unlocks the kind of fine-grained, exploratory experience you need in order to accelerate those feedback loops.

Where observability 1.0 is about MTTR, MTTD, reliability, and operating software, observability 2.0 is what underpins the entire software development lifecycle, setting the bar for how swiftly you can build and ship software, find problems, and iterate on them. Observability 2.0 is about being in conversation with your code, understanding each user’s experience, and building the right things.

Observability 2.0 isn’t exactly cheap either, although it is often less expensive. But the key difference between o11y 1.0 and o11y 2.0 has never been that either is cheap; it’s that with observability 2.0, when your bill goes up, the value you derive from your telemetry goes up too. You pay more money, you get more out of your tools. As you should.

Interested in learning more? We’ve written at length about the technical prerequisites for observability with a single source of truth (“observability 2.0” as we’ve called it here). Honeycomb was built to this spec; ServiceNow (formerly Lightstep) and Baselime are other vendors that qualify. Click here to get a Honeycomb demo.

CORRECTION: The original version of this document said that “nothing connects any of these tools.” If you are using a single unified vendor for your metrics, logging, APM, RUM, and tracing tools, this is not strictly true. Vendors like New Relic or Datadog now let you define certain links between your traces and metrics, which allows you to correlate between data types in a few limited, predefined ways. This is better than nothing! But it’s very different from the kind of fluid, open-ended correlation capabilities that we describe with o11y 2.0. With o11y 2.0, you can slice and dice, break down, and group by your complex data sets, then grab a trace that matches any specific set of criteria at any level of granularity. With o11y 1.0, you can define a metric up front, then grab a random exemplar of that metric, and that’s it. All the limitations of metrics still apply; you can’t correlate any metric with any other metric from that request, app, user, etc, and you certainly can’t trace arbitrary criteria. But you’re right, it’s not nothing. 😸

The Cost Crisis in Observability Tooling

Questionable Advice: “My boss says we don’t need any engineering managers. Is he right?”

I recently joined a startup to run an engineering org of about 40 engineers. My title is VP Engineering. However, I have been having lots of ongoing conflict with the CEO (a former engineer) around whether or not I am allowed to have or hire any dedicated engineering managers. Right now, the engineers are clustered into small teams of 3-4, each of which has a lead engineer — someone who leads the group, but whose primary responsibility is still writing code and shipping product.

I have headcount to hire more engineers in the coming year, but no managers. My boss says we are a startup and can’t afford such luxuries. It seems obvious to me that we need engineering managers, but to him, it seems just as obvious that managers are unnecessary overhead and that all hands should be on deck writing code at our stage.

I don’t know how to make that argument. It seems so obvious to me that I actually struggle to put it into words or make the case for why we should hire EMs. Help?

— Unnecessary Overhead(?!?)

Oh boy, there’s a lot to unpack here.

It is unsurprising to me that your CEO does not understand why managers exist, given that he does not seem to understand why organizational structures exist. 🙈 Why is he micromanaging how you are structuring your org or what roles you are allowed to fill? He hired you to do a job, and he’s not letting you do it. He can’t even explain why he isn’t letting you do it. This does not bode well.

But I do think it’s an interesting question. So let’s pretend he isn’t holding your ability to do your damn job hostage until you defend yourself to his satisfaction. 😒

I can think of two ways to make the case for engineering managers: one is rather complicated, from first principles, and the other very simple, but perhaps unsatisfying.

I personally have a … vigorous … knee-jerk response to authority; I hate being told what to do. It’s only recently that I’ve found my way to an understanding of hierarchy that feels healthy and practical, and that was by looking at it through the lens of systems theory.

Why does hierarchy exist in organizations?

It makes sense that hierarchy comes with a lot of baggage. Many of us have had bad experience with managers — indeed, entire organizations — where hierarchy was used as a tool of oppression, where people rose up the leadership ranks by hoarding information and playing dominance games, and decisions got made by pulling rank.

Working at a place like that fucking sucks. Who wants to invest their creativity and life force into a place that feels like a Dilbert cartoon, knowing how little it will be valued or reciprocated, and that it will slowly but surely get crushed out of you?

But hierarchy is not intrinsically authoritarian. Hierarchy did not originate as a political structure that humans invented for controlling and dominating one another, it is in fact a property of self-organizing systems, and it emerges for the benefit of the subsystems. In fact, hierarchy is absolutely critical to the adaptability, resiliency, and scalability of complex systems.

Let’s start with few basic facts about systems, for anyone that may be unfamiliar.

Hierarchy is a property of self-organizing systems

A system is “a network of interdependent components that work together to try to accomplish a common aim” (W. Edward Deming). A pile of sand is not a system, but a car is a system; if you take out its gas tank, the car cannot achieve its aim.

A subsystem is a collection of elements with a smaller aim inside a larger system. There can be many levels of subsystems that operate interdependently. The subsystems always work to support the needs of the larger system; if the subsystem instead optimizes for its own best interests, the whole system can fail (this is where the term “suboptimize” comes from 😄).

A system is self-organizing if it has the ability to make itself more complex, by diversifying, adapting, and improving itself. As systems self-organize and their complexity increases, they tend to generate hierarchy — an arrangement of systems and subsystems. In a stable, resilient and efficient system, subsystems can largely take care of themselves, regulate themselves, and serve the needs of the larger system, while the larger system coordinates between subsystems and helps them perform better.

Hierarchy minimizes the costs of coordination and reduces the amount of information that any given part of the system has to keep track of, preventing information overload. Information transfer and relationships within a subsystem are much more dense and have fewer delays than information transfer or relationships between subsystems.

(This should all sound pretty familiar to any software engineer. Modularization, amirite?? 😍)

Applying this definition, we can say that a manager’s job is to coordinate between teams and help their team perform better.

The false binary of sociotechnical systems

You’ve probably heard this canard: “Engineers do the technical work, managers do the people work.” I hate it. ☺️ I think it misconstrues the fundamental nature of sociotechnical systems. The “socio” and “technical” of sociotechnical systems are not neatly separable, they are interwoven and interdependent. There is actually precious little that is purely technical work or purely people work; there is a metric shitload of glue work that draws upon both skill sets.

Consider a very partial list of tasks done by any functional engineering org, besides writing code:

  • Recruiting, networking, interviewing, training interviewers, synthesizing feedback, writing job descriptions and career ladders
  • Project management for each project or commitment, prioritizing backlog, managing stakeholders and resolving conflicts, estimating size and scope, running retrospectives
  • Running team meetings, having 1x1s, giving continuous growth feedback, writing reviews, representing the team’s needs
  • Architecture, code review, refactoring; capturing DORA and productivity metrics, managing alert volume to prevent burnout

A lot of this work can be done by engineers, and often is. Every company distributes the load somewhat differently. This is a good thing! You don’t WANT an org where this work is only done by managers. You want individual contributors to help co-create the org and have a stake in how it gets run. Almost all of this work would be done more effectively by someone with an engineering background.

So you can understand why someone might hesitate to spend valuable headcount on engineering managers. Why wouldn’t you want everyone in engineering to be writing and shipping code as their primary job? Isn’t that by definition the best way to maximize productivity?

Ehhh… 😉

Engineering managers are a useful abstraction

In theory, you could make a list of all the tasks that need to be done to coordinate with other teams and have each item be picked up by a different person. In practice, this is impractical because then everybody would need to know about everything. One of the primary benefits of hierarchy, remember, is to reduce information overload. Intra-team communication should be high-bandwidth and fast, inter-team communication should be more sparse.

As the company scales, you can’t expect everybody to know everyone else; we need abstractions in order to function. A manager is the point of contact and representative for their team, and they serve as routers for important information.

I sometimes imagine managers as the nervous system of the company body, carrying around messages from one limb to another to coordinate actions. Centralizing many or most of these functions into one person lets you take advantage of specialization, as a manager builds relationships and context and improves at their role, and this massively reduces context switching for everyone else.

Manager calendars vs maker calendars

Engineering labor takes concentration and focus. Context switching is expensive, and too many interrupts can be fatal. Management labor consists of context switching every hour or so, and being available for interruptions throughout the day. These are two very different modes of being, headspaces, and calendar schedules, and do not coexist well.

In general, you want people to be able to spend most of their time working on things that contribute to the success of the outcomes they are directly responsible for. Engineers can only do so much glue work before their calendar turns into Swiss cheese and they can no longer deliver on their commitments. Since managers’ calendars are already Swiss cheese, it’s typically less disruptive for them to take on a larger share of glue labor.

It isn’t up to managers to do all the glue work, but it is a manager’s job to make sure that everything that needs to get done, does gets done. It is a manager’s job to try to line up every engineer with work that is interesting and challenging, but not overwhelming, and to ensure that unpleasant labor gets equitably distributed. It’s also a manager’s job to make sure that if we are asking someone to do a job, they are equipped with the resources they need to succeed at that job. Including time to focus.

Management is a tool for accountability

When you’re an engineer, you are responsible for the software you develop, deploy, and maintain. When you’re a manager, you are responsible for your team and the organization as a whole.

Management is one way of holding people accountable for specific outcomes (building teams with the right skills, relationships, and processes to make good decisions and build value for the company), and equipping them with the resources (budget, tools, headcount) to achieve those outcomes. If you aren’t making building the organization someone’s number one job, it won’t be anyone’s number one job, which means it probably won’t get done very well. And whose responsibility will that be, Mr. CEO?

There’s a real upper limit to what you can reasonably expect tech leads, or engineers, or anyone whose actual job is shipping software to do in their “spare time”. If you’re trying to hold your tech leads responsible for building healthy engineering teams, tools, and processes, you are asking them to do two calendarily incompatible jobs with only one calendar. The likeliest scenario is that they will focus on the outcomes they feel comfortable owning (the technical ones), while you pile up organizational debt in the background.

In natural hierarchies, we look up for purpose and down for function. That, in a nutshell, is the more complicated argument for why we need engineering managers.

Choose Boring technology Culture

The simpler argument is this: most engineering orgs have engineering managers. That’s the default. Lots of people much smarter than you or me have spent lots of time thinking and tinkering with org structures over the years, and this is what we’ve got.

As Dan McKinley famously said, we should “choose boring technology“. Boring doesn’t mean bad, it means the capabilities and failure conditions are well understood. You only ever get a few innovation tokens, so you should spend those wisely on core differentiators that could make or break your business. The same goes for culture. Do you really want to spend one of your tokens on org structure? Why??

For better or for worse, the hierarchical org structure is well understood. There are plenty of people on the job market who are proficient at managing or working with managers, and you can hire them. You can get training, coaching, or read a lot of self-help books. There are various management philosophies you can coalesce around or use to rule people out. On the other hand, the manager-free experiments I’m aware of (e.g. holacracy at Medium and GitHub, or “Choose Your Own Work” at Linden Lab) have all been quietly abandoned or outgrown. Not, in my experience, because leaders went mad for power, but due to chaos, lack of focus, and poor execution.

When there is no explicit structure or hierarchy, the result is not freedom and egalitarianism, it’s “informal, unacknowledged, and unaccountable leadership”, as famously detailed in “The Tyranny of Structureless“. In reality, sadly, these teams tend to be chaotic, fragile, and frustrating. I know! I’m pissed too! 😭

This argument doesn’t necessarily prove your CEO is wrong, but I should think his bar for proof is much higher than yours. “I don’t want any of my engineers to stop writing code” is not an argument. But I’m also feeling like I haven’t quite addressed the core question of productivity, so let’s pick that up again once more.

More lines of code != more productivity

To briefly recap: we were talking about an org with ~40 engineers, broken up into 10 small clusters of 3-4 engineers, each with a tech lead. Your CEO is arguing that you can’t afford to lose any velocity, which he thinks is what would happen if anyone stops writing code full time.

Maybe. But everything I have ever experienced leads me to believe that a fewer number of larger teams, each helmed by an experienced engineering manager, should way outperform this gaggle of tiny groups. It’s not even close. And they can do so in a way that’s more efficient, sustainable, and humane than this scrappy death march.

And systems thinking shows us why! With fewer groups, but larger ones, you have less overall management overhead, and much less of the slow and costly intra-group coordination. You unlock rich, dense knowledge transfer within groups, which gives you more shared coverage of the surface area. With 7-9 engineers per group you can build a real on call rotation, which means fewer heroics and less burnout. The coordination that you do need to do can be more strategic, less tactical, and much more forward-looking.

Would five big teams ship as many lines of code as 10 small teams, even if five engineers become managers and stop writing code? Probably, but who cares? Your customers give zero fucks how many lines of code you write. They care about whether you are building the right things and solving problems that matter to them. What matters is moving the business forward, not churning out code. Don’t forget, the act of churning out code creates costs and externalities in and of itself.

What defines your velocity is that you spend your time on the right things. Learning to make good decisions about what to build is something every organization has to work out for itself, and it is always an ongoing work in progress. Engineering managers don’t do all the work or make all the decisions, but they are absolutely fucking vital, in my experience, to ensuring that work happens and is done well. As I wrote in my last piece, engineering managers are the embodiment of the feedback loops that systems use to learn and improve.

Are managers ever unnecessary overhead?

Sure, absolutely. Management is about coordinating between teams and helping teams run more optimally, so anything that decreases your need for coordination also decreases your need for management. If you are a small company, or if you have really senior folks who are used to working together, you need a lot less coordination. The next most relevant factor is probably the rate of change; if you’re growing fast or have a lot of turnover, or if there’s a lot of time pressure or frequent shifts in strategy, your need for managers goes up. But there are plenty of smaller orgs out there that are doing just fine without a lot of formal management.

Look, I’m not a fan of the word “overhead”, because a) it’s kind of rude and b) people who call managers “overhead” are typically people who disrespect or do not value the craft of management.

But management is, in fact, overhead. 😅 So is a lot of other glue work! By which I mean the work is important, but does not itself move the business forward; we should do as much of it as absolutely necessary and no more. The nature of glue work is such that it too-easily expands to consume all available time and space (and then some). Constraints are good. Feeling a bit underresourced is good, and should be the norm. It is incredibly easy for management to get a bit bloated, and managers can be very loath to acknowledge this, because it’s not like they ever feel any less stressed or stretched.[*]

Management is also very much like operations work in that when it’s being done well, it’s invisible. Evaluating managers can be very hard, especially in the near term, and making decisions about when it’s time to create or pay down organizational debt is a whole nother ball of wax, and way outside the scope of this post.

But yes, managers can absolutely be unnecessary overhead.

However, if you have 40 engineers all reporting to one VP, and nobody else whose number one job is the outcomes related to people, teams and org, I feel pretty safe in saying this is not a risk for you at this time.

<3 charity

[*] In fact, the reverse can be true; bloated management can create MORE work for managers, and they may counterintuitively feel LESS stretched or stressed with a leaner org chart. Bureaucracies do tend to pick up a momentum all their own. Especially when management gets all wrapped up in promotions and egos. Which is yet another good reason to ensure that management is not a promotion or a form of domination.

 

Questionable Advice: “My boss says we don’t need any engineering managers. Is he right?”