How Much Should My Observability Stack Cost?

First posted on 2021-08-18 at https://www.honeycomb.io/blog/how-much-should-my-observability-stack-cost

What should one pay for observability? What should your observability stack cost? What should be in your observability stack?

How much observability is enough? How much is too much, or is there such a thing?

Is it better to pay for one product that claims (dubiously) to do everything, or twenty products that are each optimized to do a different part of the problem super well?

It’s almost enough to make a busy engineer say “Screw it, I’m spinning up Nagios”.

(Hey, I said almost.)

All of these service providers can give you sticker shock when you begin investigating them. The biggest reason is always that we aren’t used to considering the price of our own time.  We act like it’s “free” to just take an hour and spin something up … we don’t count the cost of maintenance, context switching, and opportunity costs of not using the time to build something of business value.  Which is both understandable and forgivable, as a starting point.

Considerably less forgivable is the vagueness–and sometimes outright misdirection and scare tactics–some vendors offer around pricing. It’s not ok for a business to optimize for revenue at the expense of user experience. As users, we have the right to demand transparency and accurate information.  As vendors, we have the responsibility to provide it.  Any pricing scheme that doesn’t align with best practices and users’ interests will be a drag on reputation and growth.

The core question, rarely addressed outright, is: how much should you pay? In this post I’ll talk about what your observability costs include, and in the next post, what you should consider including in your “observability stack”.

But I’ll give you the answer to your question right off the bat: you should probably spend 20-30% of infra costs on observability.

O11y spend should be 20-30% of infra spend

Rule of thumb: your observability spend should come to 20-30% of your infra spend. (I’ve seen 10% a few times from reasonable-seeming shops, but they have been edge cases and outliers. I have also seen 50% or more, but again, outliers.)

Full disclosure: this isn’t based on any particular science.  It’s just based on my experience of 15+ years working in operations engineering, talking to other engineers and managers, and a couple of informal Twitter polls to satisfy my own curiosity.

Nevertheless, it’s a pretty solid rule. There are exceptions, but in general, if you’re spending less than 20%, you’re “saving money” at the expense of engineering time, or being silently dragged underwater by a million little time leaks and quality of service issues — which you could eliminate completely with a bit of investment.

Consider the person who told me proudly that his o11y spend was just 1-3%. (He meant the PagerDuty bill and Pingdom checks, actually.) He wasn’t counting the dedicated hardware for their ELK cluster (80k/month), or the 2-3 extra engineers they had to recruit, train and hire (250-300k/year apiece) to run the many open source tools they got for “free”.

And ultimately, it didn’t meet their needs very well. Few people knew how to use it, so they leaned on the “observability team” to craft custom views, write scripts and ETL one-offs, and serve as the institutional hive mind and software usability tutors.  They could have used better tools, ones under active development by large product teams.  They could have used that headcount to create core business value instead.

Engineers cost money

Engineers are expensive. Recruiting them is hard. The good ones are increasingly unwilling to waste time on unnecessary labor. This manager was “saving” maybe a million dollars a year (he mentioned a vendor quote of less than 100k/month)–but spending a couple million more than that in less-visible ways.

Worse, he was driving his engineering org into the ground by wasting so much of their time and energy on non-mission-critical work, inferior tooling, one-offs, frustrating maintenance work, etc, all of which had nothing to do with their core business value.

If you want to know if an org hires and retains good engineers, you could do worse than to ask the question: “What tools do you use, and why?”

  • Good orgs use good tools. They know engineering cycles are their scarcest and most valuable resource, and they want to train maximum firepower on their core business problems.
  • Mediocre orgs use mediocre tools, have no discipline or consistency around adoption and deprecation, and leak lost engineering cycles everywhere.

So back to our rule of thumb: observability amounting to 20-30% of total spend is where most shops should fall. This refers to cloud-native infrastructure, using third-party services to instrument and monitor code, with the basics covered — resource utilization graphs, end to end checks, paging, etc.

So, what do I need in my “observability stack”?

What are the basics? Well, obviously “it depends”. It depends on your requirements, your components, your commitments, your budget, sunk costs and skill sets, your teams, and most expensive of all — customer expectations and the cost of violating them. You should think carefully about these things and try to draw a straight line from the business case to the money you spend (or don’t spend). And don’t forget to factor in those invisible human costs.

 

How Much Should My Observability Stack Cost?

How much is your fear of continuous deployment costing you?

Most people aren’t doing true CI/CD. Most teams wait far too long to get their code into prod after writing it. Most painful of all are the teams who have done all the hard parts — wired up continuous integration, achieved test coverage, etc — but still deploy by hand, thus depriving themselves of the payoff for their hard work.

Any time an engineer merges a diff back to main, this ought to trigger a full run of your CI/CD pipeline, culminating with an automatic deploy to production. This should happen once per mergeset, never batching multiple engineers’ diffs in a run, and it should be over and done with in 15 minutes or less with no human intervention.

It’s 2021, and everyone should know this by now.

✨✨15 minutes or bust✨✨

Okay, but what if you don’t? How costly can it be, really?

Let’s do some back of the envelope math. First you’ll need to answer a couple questions about your org and deploy pipeline.

  • How many engineers do you have? ____________
  • How long typically elapses between when someone writes code and that code is live in production? _____________

Let (n) be the number of engineers it takes to efficiently build and run your product, assuming each set of changes will autodeploy individually in <15 min.

  • If changes typically ship on the order of hours, you need 2(n).
  • If changes ship on the order of days, you need 2(2(n)).
  • If changes ship on the order of weeks, you need 2(2(2(n)))
  • If changes ship on the order of months, you need 2(2(2(2(n))))

Your 6 person team with a consistent autodeploy loop would take 24 people to do the same amount of work, if it took days to deploy their changes. Your 10 person team that ships in weeks would need 80 people.

At cost to the company of approx 200k per engineer, that’s $3.6 million in the first example and $14 million in the second example. That’s how much your neglect of internal tools and kneejerk fear of autodeploy might be costing you.

It’s not just about engineers. The more delay you add into the process of building and shipping code, the more pathologies multiply, and you find yourselves needing to spend more and more time addressing those pathologies instead of making forward progress for the business. Longer diffs. Manual deploy processes. Bunching up multiple engineers’ diffs in a single deploy, then spending the rest of the day trying to figure out which one was at fault for the error.

Soon you need an SRE team to handle your reliability issues, build engineering specialists to build internal tools for all these engineers, managers to manage the teams, product folks to own the roadmap and project managers to coordinate all this blocking and waiting on each other…

You could have just fixed your build process. You could have just committed to continuous delivery. You would be moving more swiftly and confidently as a small, killer team than you ever could at your lumbering size.

✨✨15 minutes or bust✨✨

In 2021, how will *you* achieve the dream of CI/CD, and liberate your engineers from the shackles of pointless toil?

P.S. if you want to know my methodology for coming up with this equation, it’s called “pulled out of my ass because it sounded about right, then checked with about a dozen other technical folks to see if it aligned with their experience.”

 

 

How much is your fear of continuous deployment costing you?