The Truth About “MEH-TRICS”

First published on 2022-04-13 at https://www.honeycomb.io/blog/truth-about-meh-trics-metrics

A long time ago, in a galaxy far, far away, I said a lot of inflammatory things about metrics.

“Metrics are shit salad.”

“Metrics are simply nerfed dimensions.”

“Metrics suck,” “metrics are legacy,” “metrics and time series aggregates will fucking kneecap you.”

I cannot tell a lie; Twitter will testify that I’ve spent the past six years ragging on metrics. So much so that ever since we launched Honeycomb Metrics last year, our poor solution architects have been encountering skeptics in the field who repeat my quotes back to them and ask, dubiously, whether Honeycomb Metrics are any good or not, and whether we genuinely plan on investing in it or not, given our known anti-metrics sympathies.

That’s a great question. 😊

Metrics aren’t worthless; they’re just limited.

Metrics are a mature technology that’s been around for over 30 years, and they have some real advantages. They’re tiny, fast, and cheap; you can hold a bunch of them in memory as counters, summaries, and gauges. They aggregate well and take up a fixed amount of storage space. The entire monitoring industry is built on top of metrics.

When it comes to workloads like, “How heavy is the write load on my hard drive?” or “What is the temperature or fan status inside my chassis?” or “What is the traffic rate in and out of this interface on my switch?”  metrics are what you should use. In fact, pretty much any time you want to know the health of a system or component in toto, metrics are the right tool.

Because that’s what metrics do best—report statistics in aggregate, from the perspective of any system or component. They can tell you that your Ruby HTTP worker pool is 70% utilized or that your nginx webserver is returning 502s 1% of the time. What they can’t tell you is what this means for any one of your users, applications, delivery vehicles, and so forth.

Until recently, metrics-based tools or logs were the only game in town. People were trying to sell us metrics tools for observability use cases, and that’s what got my goat so badly. If you simply append “… for observability” to each of my inflammatory statements, then I stand by them completely.

“Metrics are shit salad … for observability.

Yup, rings true.

You’re never going to make a metrics tool like Prometheus or Datadog into an observability tool. You’re just not. Observability is about unknown-unknowns, while metrics are a tool for known-unknowns.

If you need a refresher on the differences between observability and monitoring, I’ll refer you to pieces like thisthis, and this. What I want to talk about here is slightly different. In a post-observability world, what is the true and proper place for metrics tooling?

Metrics and observability have different use cases.

Metrics aren’t completely useless, even if you have a robust observability presence. We still use metrics at Honeycomb to this day for certain workloads—and always will because they’re the right tool for the job.

There are two kinds of workloads, roughly speaking: your code—the code you write, review, ship, debug and maintain on a daily basis. And other people’s code—the code you have to run and use in order to support your code. Some examples of the latter might be: Linux, Docker, MySql, Amazon RDS, Kafka, AWS Lambda, GCP gateways, memcache, CI/CD pipelines, Kubernetes, etc.

Your code is your crown jewels, the code you need to survive and succeed as a business. It changes constantly—many times per week, if not per day. You are expected to understand its inner workings intimately, and spend lots of time chasing down bugs or understanding and reproducing behavior. You care about the way it performs and interacts with each and every individual user, with changing infrastructure state, and under a variety of different load conditions.

That is why your code demands observability. In order to understand your software, you must first instrument it, in a way that collects lots of rich context and bundles it up around each event end-to-end. Then you need to stream those events into a tool that lets you slice and dice and trace and explore with support for high-cardinality and high-dimensionality data. That’s the only way you’re going to be able to correlate errors, track down outliers, and reflect each user’s experience.

But what about the rest of the software? You can’t instrument Amazon RDS, and only crazy people would instrument, rebuild, and repackage things like Kafka or Docker or nginx. The whole point of third-party software is that you DON’T USE IT until it’s stable enough to be taken more or less for granted. Sure, you roll updates, but usually on the order of months or years—not every day. You don’t need to be intimately familiar with its inner workings because you aren’t changing it every day. Those aren’t your crown jewels.

You do care about their health though, only differently. You care about whether you need to provision more capacity or not. You care about knowing how hard you’re hammering on the underlying hardware or hypervisor. That’s why metrics and monitoring are the right tools to use for third-party code. They don’t let you peer under the hood in the same way, or slice and dice in the same way, but that’s okay. You shouldn’t have to.

With third-party stuff, you don’t care about the code, you care about the health of the service. In aggregate.

(There are some kinds of in-between software, like databases, where event-level information is super useful for debugging things like slow queries and lock percentages, and you can use various black box techniques to approximate observability without instrumentation. But in general this model holds up quite well.)

In a post-observability world, what are metrics for?

I’ve often pointed out that observability is built on top of arbitrarily wide structured data blobs, and that metrics, logs, and traces can be derived from those blobs while the reverse is not true—you can’t take a bunch of metrics and reformulate a rich event.

And yes, people who have observability typically find themselves using metrics and dashboards less and less. They’re simply not as versatile or useful as events that you can slice and dice and manipulate in infinite ways. And you can derive aggregates and trends from the events you have stored.

But metrics will always be useful for understanding third-party software, from the perspective of the service, cluster, or node. They will always be the right tool for the job when it comes to software interfacing with hardware. And they can be super complementary when you are investigating your code using events and instrumentation.

If you’re an engineer writing and shipping code, you’re never not going to want to know if your change caused memory usage to triple, or CPU utilization to skyrocket, or disk usage or network throughput to saturate. That’s why we built Honeycomb Metrics as an overlay, a way to enhance or validate your understanding of the impact your code changes have had on the underlying system.

Metrics are also valuable as a bridge to the past. People have been instrumenting software for metrics for 30 years—they’re never going away completely, and not everything can or should be reinstrumented with events. Lots of people already have robust monitoring systems that slurp in millions of metrics. Nobody wants to have to redo all that work just because they’re moving to a different tool, so people tend to point their metrics firehose at Honeycomb as a way of getting started as they roll observability out into their code.

The Truth About “MEH-TRICS”

Twin Anxieties of the Engineer/Manager Pendulum

I have written a lot about the pendulum swing between engineering and management, so I often hear from people who are angsting about the transition.

A quick recap of the relevant posts:

There are two anxieties I hear people express above all the rest.

The first one is something I hear over and over again, particularly from first-time managers as they contemplate the possibility of leaving management and returning to IC (individual contributor) work as an engineer:

“What if I never get another shot at people management?”
“Maybe this is the only chance I’ll ever get … and I’m about to give it up??”
“Am I going to regret this?”

“Will I ever get another shot at management?”

People decide to go back to engineering for lots of reasons. Maybe they’re burned out, or they work someplace with a poisonous management culture, or they’re having a kid and want to return to a role that feels more comfortable for a while. Or maybe they’ve been managing teams for a few years now, and have decided it’s time to go back to the well and refresh their technical skills in the interest of their long-term employability.

Regardless, these are not typically people who disliked being a manager. Rather they tend to be engineers who really enjoyed people management, and find it bittersweet to give up. Maybe they will miss the strategic elements and roadmap work, but they’re excited to clear their calendar and spend time in flow again, or they will miss having 1x1s but can’t wait to have time to mentor people. Whatever. They want to manage teams again someday, and worry they won’t get another chance.

Their anxiety is understandable! Lots of people feel like they waited a long time to be tapped for management, or like they were passed over again and again. Our cultural scripts about management definitely contribute to this sense of scarcity and diminution of agency (i.e. that management is a promotion, it is bestowed on you by your “superiors” as a reward for your performance, and it is pushy or improper to openly seek the role for yourself).

This anxiety is also, in my experience, ridiculously misplaced. ☺️

Once a manager, marked for life as a manager

You may have struggled to get your first opportunity to manage a team. But it’s a whole different story once you’ve done the job. Now you have the skills and the experience, and people can smell it on you.

I’m not joking. If you’re a good manager it’s actually nearly impossible to hide that you have the skills, because of the way it infuses your work and everything that you do as an IC. You get better at prioritization, more attuned to the needs of the business, and restless about work that doesn’t materially move the business forward. You get better at asking questions about why things need to be done and at communicating with stakeholders. You get better at motivating the people you work with, understanding their motivations and your own, and mediating conflicts or putting a damper on drama between peers. People come to you for advice and may seem to just do what you say, or go where you point.

Senior engineers with management experience are worth their weight in gold. They are valuable contributors and influential teammates. It’s a palpable shift! And every experienced manager in their vicinity will sense it.

So yes, you will be tapped for management again. And again and again and again. You are more likely to spend the rest of your career fending off management “opportunities” with a baseball bat than you are to wither away, pining for another shot.

There is a chronic shortage of good engineering managers, just like there is a chronic shortage of good, empathetic managers in every line of work. The challenge you will face from now on will not be about getting the chance to manage a team, but about being intentional and firm in carving out the time you need to recover and recharge your skills as an engineer.

“Am I too rusty to go back to engineering?”

The second anxiety is in some ways a mirror of the first:

“Can I still perform as an engineer?”
“Will anyone hire me for an engineering role?”
“Has it been too long, am I too rusty, will I be able to pull my weight?”

This is a more materially valid concern than the first one, in my opinion. Your engineering skills do wither and erode as time goes on. It will take longer and longer to refresh your skills the longer you go without using them. Management skills don’t decay in the same way that technical ones do, nor do they go out of date every few years as languages, frameworks and technologies tend to do.

If you aren’t interested in climbing the ladder and becoming a director or VP — or rather, if you aren’t actively, successfully climbing the ladder — you should have a strategy for keeping your hands-on skills sharp, because your ability to be a strong line manager is grounded in your own engineering skills.

Never, ever accept a managerial role until you are already solidly senior as an engineer. To me this means at least seven years or more writing and shipping code; definitely, absolutely no less than five. It may feel like a compliment when someone offers you the job of manager — hell, take the compliment 🙃 — but they are not doing you any favors when it comes to your career or your ability to be effective.

When you accept your first manager job, I think you should make a commitment to yourself to stick it out for two years. That’s how long it takes to rewire your instincts and synapses, to learn enough that you can tell whether you’re doing a good job or not.

After two or three years of management, it’s still pretty easy to go back to engineering. After five years, it gets progressively harder. But it can be done. And it should be worth it to your employer to invest in keeping you while you refresh your skills over the six months or whatever it may take. Insist on it, if you must. It’s better to refresh your skills while employed, on a system and codebase you’re familiar with, than to find yourself struggling to brush up enough to pass a coding interview.

Engineering fluency == job security

There is one more reason to refresh your engineering skills from time to time, one I don’t often see mentioned, and that is job security and optionality.

The higher you go up the ladder, the more money you will get paid…but the fewer jobs there be, and the fewer still that match your profile.

As a senior software engineer, there are fifteen bajillion job openings for you. Everyone wants to hire you. You can get a new job in a matter of days, no matter how picky you want to be about location, flexibility, technologies, product types, whatever. You’ve reached Peak Hire.

If you are looking for management roles, there will be an order of magnitude fewer opportunities (and more idiosyncratic hiring criteria), but still plenty for the most part. But for every step up the ladder you go, the opportunities drop by another order of magnitude, and the scrutiny becomes much more intense and particular. If you’re looking for VP roles, it may take months to find a place you want to work at, and then they might not choose you. ¯\_(ツ)_/¯

Maintaining your technical chops is a stellar way to hedge against uncertainties and maintain your optionality.

 

Twin Anxieties of the Engineer/Manager Pendulum

How can you tell if the company you’re interviewing with is rotten on the inside?

How can you tell the companies who are earnestly trying to improve apart from the ones who sound all polished and healthy from the outside, whilst rotting on the inside?

This seems to be on a lot of minds right now, what with the Great Resignation and all. There are no perfect companies, just like there are no perfect relationships; but there are many questions and techniques you can use to increase your confidence that a particular company is decent and self-aware, one whose quirks and foibles you are compatible with.

Interviews are designed to make you feel like you are under inspection, like the interviewer holds all the power. This is an illusion. Your labor is valuable — it is vital — and you should be scrutinizing them every bit as closely as they you. In fact, here is Tip #1:

  • If they allow you plenty of time to converse with your interviewers throughout the process, great. If they tack on a cursory “any questions for us?” while wrapping up, they don’t think it matters what you think of them. Pull the ripcord.

Collect and practice good interview questions for you to ask potential employers. Write them down — your mind is likely to go blank under stress, and you don’t want to let them off the hook. There is a LOT of signal to be gained by probing down below the surface answers.

Backchanneling

  1. Whisper networks and backchannels are incredibly important. It can be especially valuable to talk to someone who has recently left the company: why did they leave? Would they go there again?
  2. Alternately, do you know anyone who has worked for or with their leadership, even if not at that company?
  3. If you know any women or under-represented minorities (URM) who work there, buy them lunch and ask for the unvarnished truth. That’s where you usually turn up the real dirt. 🥂

Diversity, equity and inclusion

Just because a company has a diverse workforce doesn’t necessarily mean it is a healthy place to work. (But it’s fair to give some points up front, because that doesn’t usually happen by accident.)

  1. Do they have a diverse leadership team? A diverse board?
  2. Is their company diverse overall, or are minorities concentrated in a few (lower-paying, high-turnover) departments?
  3. You might not want to write off all the companies that don’t meet points one and two, if for no other reason than it dramatically shrinks your available option pool. If they don’t have a particularly diverse team, and this is something that matters to you, that’s your cue to dig deeper:
    • Are they bothered by their lack of diversity? What’s the plan? Do they just feel generically sad about it, or have they set specific goals to improve by specific dates? What investments are they making?
    • Who works on DEI stuff currently? (Answers like “HR and recruiting”, or “we have a woman who’s really good at it” are bad answers.)
    • Who is accountable for making sure those goals are hit? (The only right answer is “our execs”. Having a “chief diversity officer” is an anti-pattern in my book.)
    • If the team is all guys, for example, ask if they’ve ever had any women on the team in the past. Did she/they leave? Do they know why?
  4. This is a GREAT one: “As a white man, I’d ask what they’ve done to find qualified women and minorities for the role I’m interviewing for.” (via David Daly) 🔥🔥

Company stuff

  1. What are their values? Do they feel bloodless and ripped from the pages of HBR, or are they unique, lived-in, and give you a glimpse of what the people there care about? Are they mentioned over the course of your interview?
  2. Ask tough questions about the business and try to ascertain whether they are hitting their quarterly goals, how much funding they have in the bank, what the growth curve looks like, what users really think about their product, and what the biggest obstacles to success are.
    • Companies that are floundering are going to be really stressful places to work, and even if the leadership is decent, they may find themselves backed into making some really tough decisions.
    • You want to work at a company on a strong growth trajectory for lots of reasons, but a big one is your own growth potential. You will learn the most the fastest at places that are growing fast, and have way more openings for promotions and leadership roles than a slower-growing company.
  3. Are people willing to speak freely about things they’ve tried that have failed, and things that don’t work well currently? Being self-aware and comfortable with visible failure are two of the most important self-correcting mechanisms a company can cultivate.
  4. EVERYBODY thinks they value transparency, so I wouldn’t even bother asking. Instead, ask for specific examples of leadership being forthcoming with bad news to the team, and team members delivering hard feedback or bad news to upper leadership. Transparency shouldn’t be something they’re especially proud of, so much as it is taken for granted. It’s in the air that you breathe.

Planning and the unplanned

  1. Ask about how decisions get made. A chestnut is, “how does work end up on my plate?” — meaning is there a business strategy (owned by whom?), a technical strategy, a product strategy, quarterly KPIs, customer requests, manager delegation, JIRA tickets…? (The most important part may be how similar the answers you get are. 🙃)
  2. How often does work get pre-empted and why? It’s a good thing if product development has to get put on ice once in a while so the team can focus on reliability and maintenance work. It’s a bad thing if they’re expected to stuff reliability work in the cracks around their product development, or if they’re incapable of sticking to a plan.
  3. What does “crunch time” look like? Nearly every company has one from time to time (it might even be a bad sign if it never happens), but this is when you find out your leadership’s true colors.
    • Do they praise people or call them out to thank them for pulling all nighters and other extremist behaviors? 🚨BZZT🚨
    • Is it voluntary? Are you trusted to set your own pace, your own limits, or are you pressured to do more? Are people expected to participate to the extent that they are able, and not expected to justify how hard or how much (so long as they communicate their capacity, of course)?
    • How long did it last, and how often does it happen, and why? It should be rare (1-2x/year at most), involve the whole company, and move the business forward meaningfully
    • Did they follow through by making sure people took time off afterwards to recover? Not just give permission, but actually make sure the human beings had a chance to refresh themselves? Did leaders set a good example by taking a breather themselves?

Believe it or not, crunch time done correctly can be an enormously exciting, intense, bonding time for a group of people who love what they do, culminating in a surge of collective triumph and celebration, followed by recovery time. If it was done correctly, and you ask about it afterwards, people will 💡light up💡.

Team stuff

  1. Unfortunately, culture can vary widely from function to function, even from manager to manager. Make sure you get a real interview slot with your actual manager — not just a screener or wrap-up call — and as much of the team as possible, too.
  2. Ask your potential manager about the last person they had to let go. Why? What was the process? What was the impact on the team? How did the person feel afterwards?
  3. Who is on call? How often do people get paged outside of hours, and how frequently do they work an incident? (Do managers track this?) Are you expected to keep shipping product during on call weeks, or devote your time to making the system better?
  4. If you had to ship a single line of code to production using the deploy pipeline, how long would that take? Remember, the lower the deploy interval, the happier and more productive you are likely to be as an engineer there. Under 15 minutes is great. Under an hour is tolerable. More than that, proceed with great caution.

The interview itself

  1. Was your interview well-organized and conducted in a timely fashion? Were you given detailed information about what to expect, and were your interviewers well-prepared, and conversational? Were the questions fair, open-ended, and relevant to the job in question?
  2. If they asked you to perform any kind of take-home labor of more than an hour or so, did they compensate you for your time?
  3. Did they get back to you swiftly at each step of the way to let you know where you stand and what comes next?
  4. Did you find the questions interesting and challenging? Do they have a clear idea of what success looks like for you in this role? Did you leave excited and buzzing with ideas about the work you could do together?
    • This is 👆 definitely more of a “how good is this job” question than “is this a shithole” question, but one of our honeycombers brought it up as an example of how a great interview can make you decide to leave a job , even one you’re perfectly happy with.
  5. The questions they ask you while interviewing you are the questions they ask everyone else. So…did they ask you about your views on diversity and team dynamics while interviewing you? Or is that not part of their filters, only their advertised persona?

Three more

  1. Do their employees seem to speak freely on twitter? If you are an agitator of sorts, are there others who agitate about similar issues — either with company support, or at least lack of censorship?
  2. How does the company respond to criticism and feedback? For that matter, how do they treat their competitors? Being competitive is fine, being mean is not.
  3. Get clear on your own expectations. What’s on your wishlist, and what’s make-or-break for you? If something is very important to you, consider telling the hiring manager up front. For example, “These are my expectations for how women are treated. How do you think your company matches up?” Their answers will speak volumes, and so will their comfort level with the question.

In closing

If you a join a new company, and two or three weeks in you’re just not feeling it, you’re wondering if you made a mistake — leave. You do not owe them a year of your life. Trust your instincts. Just leave it off your resume entirely and roll the dice again.

Employers are all too accustomed to feeling (and acting) like they hold all the power. They do not. Every tech company is a talent business, which rises and falls on the caliber of the people they can convince to stay. They aren’t doing you a favor by employing you; you are doing THEM a favor by lending them your creativity, labor, and a third of the hours in your day.

Do they deserve it? Will their success make the world a better place? If not, stop supporting them with your work and lend your muscle to a company that deserves it.

In the hottest job market of my lifetime, with millions of opportunities newly open to people who live literally anywhere, you owe it to yourself, your future self, and your family to take a good hard look at where you sit. 🍄 Are you happy? 🍄 Are you compensated well, is your time valued? 🍄 Are you still learning new things and improving your skills every day? 🍄 Is your company still on a growth trajectory? 🍄 Do you trust your leadership and your team, 🍄 do still you believe in the mission, and 🍄 do you think your labor contributes meaningfully to making the world a better place?

If not, consider joining the Great Resignation. I hear they have cookies.

Huge thanks to Amy Davis, Phillip Carter, Ian Smith, Sarah Voegeli, Kent Quirk, Liz Fong-Jones, Amanda Shapiro, Nick Rycar, Fred Hebert and David Daly, all of whom contributed to this post!

How can you tell if the company you’re interviewing with is rotten on the inside?

How “Engineering-Driven” Leads to “Engineering-Supremacy”

Honeycomb has a reputation for being a very engineering-driven company. No surprise there, since it was founded by two engineers and our mission involves building an engineering product for other engineers.

We are never going to stop being engineering-driven in the sense that we are building for engineers and we always want engineers to have a seat at the table when it comes to what we build, and how, and why. But I am increasingly uncomfortable with the term “engineering-driven” and the asymmetry it implies.

We are less and less engineering-driven nowadays, for entirely good reasons — we want this to be just as much of a design-driven company and a product-driven company, and I would never want sales or marketing to feel like anything other than equal partners in our journey towards revolutionizing the way the world builds and runs code in production.

It is true that most honeycomb employees were engineers for the first few years, and our culture felt very engineer-centric. Other orgs were maybe comprised of a person or two, or had engineers trying to play them on TV, or just felt highly experimental.

But if there is one thing Christine and I were crystal clear on from the beginning, it’s this:

✨WE ARE HERE TO BUILD A BUSINESS✨

Not just the shiniest, most hardcore tech, not just the happiest, most diverse teams. These things matter to us — they matter a lot! But succeeding at business is what gives us the power to change the world in all of these other ways we care about changing it.

If business is booming and people are thirsty for more of whatever it is you’re serving, you pretty much get a blank check for radical experiments in sociotechnical transformation, be that libertarian or communitarian or anything in between.

If you don’t have the business to back it up, you get fuck-all.

Not only do you not get shit, you risk being pointed out as a Cautionary Tale of “what happens if $(thing you deeply care about) comes true.” Sit on THAT for a hot second. 😕

So yes, we cared. Which is not to say we knew how to build a great business. We most certainly did not. But both of us had been through too many startups (Linden Lab, Aardvark, Parse, etc) where the tech was amazing, the people were amazing, the product was amazing … and the business side just did not keep up. Which always leads to the same thing: heartbreak and devastation.

✨IF you can’t make it profitable✨
✨your destiny will inevitably be✨
✨taken out of your hands✨
✨and given to someone else.✨

So Christine and I have been repeating these twin facts back and forth to each other for over six years now:

  1. Honeycomb must succeed as a revenue-generating, eventually profitable business.
  2. We are not business experts. Therefore we have to make Honeycomb a place that explicitly values business expertise, that places it on the same level as engineering expertise.

We have worked hard to get better at understanding the business side (her more than me 🙃) but ultimately, we cannot be the domain experts in marketing or sales (or customer success, support, etc).

What we could do is demonstrate respect for those functions, bake that respect into our culture, and hire and support amazing business talent to run them with us.

On being “engineering-driven”

Self-described “engineering-driven” companies tend to fall into one of two traps. Either they alienate the business side by pinching their nose and holding business development at arm’s length (“aaahhh, i’m just an engineer! I have no interest in or capacity for participating in developing our marketing voice or sales pitch decks, get off me!”), or they act like engineering is a sort of super-skillset that makes you capable of doing everybody else’s job better than they possibly could. As though those other disciplines and skill sets aren’t every bit as deep and creative and challenging in their own right as developing software can be.

For the first few years we really did use engineers for all of those functions. We were trying to figure out how to build and explain and sell something new, which meant working out these things on the ground every day with our users. Engineer to engineer. What resonated? What clicked? What worked?

So we hired just a few engineers who were interested in how the business worked, and who were willing to work like Swiss Army knives across the org. We didn’t yet have a workable plan in place, which is what you need in order to bring domain experts on board, point them in a direction, and trust them to do what they do best in executing that plan.

Like I said, we didn’t know what to do or how to do it. But at least we knew that. Which kept us humble. And translated into a hard, fast rule which we set early on in our hiring process:

We DO NOT hire engineers who talk shit about sales and marketing.

If I was interviewing an engineer and they made any alienating sort of comment whatsoever about their counterparts on the business side, it was an automatic no. Easy out. We had a zero tolerance policy for talking down down about other functions, or joking, even for being unwilling to perform other business functions.

In retrospect, I think this is one of the best decisions we ever made.

Hiring engineers who respected other functions

We leaned hard into hiring engineers who asked curious questions about our business strategy and execution. We pursued engineers who talked about wanting to spend time directly with users, who were intrigued by the idea of writing marketing copy to help explain concepts to engineers, and who were ready, willing and able to go along on sales calls.

Once we finally found product-market fit, about 2-3 years ago, we stopped using engineers to play other roles and started hiring actual professionals in product, design, marketing, sales, customer success, etc to build and staff out their organizations. That was when we first started building the business for the longer term; until we found PMF, our event horizon was never more than 1-3 months ahead of “right now”.

(I’ll never forget going out to coffee with one of our earlier VPs of marketing, shortly after she was hired, and having her ask, in bemusement: “Why are all these engineers just sitting around in the #marketing channel? I’ve never had so many people giving opinions on my work!” 🙃)

Those early Swiss Army knife engineers have since stepped back, gratefully, into roles more centered around engineering. But that early knee-jerk reaction of ours established an important company norm that pays dividends to this day.

Every function of the business is equally challenging, creative, and worthy of respect. None of us are here to peacock; our skill sets serve the primary business goals. We all do our jobs better when we know more about how each other’s functions work.

These days, it’s not just about making sure we hire engineers who treat business counterparts like equals. It’s more about finding ways to stimulate the flow of information cross-functionally, creating a hunger for this information.

Caring about the big picture is a ✨learnable skill✨

You can try to hire people who care about the overall business outcomes, not just their own corner of reality, and we do select for this to  some degree — for all roles, not just engineers.

But you can also foster this curiosity and teach people to seek it out. Curiosity begets curiosity, and every single person at Honeycomb is doing something interesting. We all want to succeed and win together, and there’s something infectious and exciting about connecting all the dots that lead to success and reflecting that story back to the rest of the company.

For example,

  1. Every time we close a deal, a post gets written up and dropped into the announcements slack by the sales team. Not just who did we close and how much money did we make, but the full story of that customer’s interaction with honeycomb. How did they hear of us? Whose blog posts, training sessions, or office hours did they engage with? Did someone on the telemetry pull a record-fast turnaround on an integration they needed to get going? What pains did we solve effectively for them as a tool, and where were the rough edges that we can improve on in the future?

    The story is often half a page long or more, and tags a dozen or more people throughout all parts of the company, showing how everyone’s hard work added up and materially contributed to the final result.
  2. We have orgs take turn presenting in all hands — where they’re at, what they’ve built, and the impact of their contributions, week after week. Whether that’s the design team talking about how they’ve rolled out our new design system and how it is going to help everyone in the company experiment more and move more quickly, or it’s the people team showing how they’ve improved our recruiting, interviewing, and hiring processes to make people feel more seen, welcomed, and appreciated throughout the process.

    We expect people to be curious about the rest of the company. We expect honeybees to be interested in, excited about and celebratory of each other’s hard work. And it’s easy to be excited when you see people showing off work that they were excited to do.
  3. We have a weekly Friday “demo day” where people come and show off something they’ve built this week, rapid-fire. Whether it’s connecting to a mysql shell in the terminal to show off our newly consistent permissioning scheme, or product marketing showing off new work on the website. Everybody’s work counts. Everybody wants to see it.
  4. We have a #love channel in slack where you can drop in and tag someone when you’re feeling thankful for how much they just made your day better. We also have a “Gratefuls” section during all hands, where people speak up and give verbal props to coworkers who have really made a difference in their lives at work.

We have always attracted engineers who care about the business, not just the technology and the culture. As a result, we have consistently recruited and retained business leaders who are well above our weight class — our investors still sometimes marvel at the caliber of the business talent we have been able to attract. It is way above the norm for developer tools companies like ours.

“Engineering-driven” can be a mask for “engineering-supremacy”

Because the sad truth is that so many companies who pride themselves on being “engineering-driven” are actually what I would call more “engineering-supremacist”. Ask any top-tier sales or marketing leader out there about their experiences in the tech industry and you’ll hear a painful, rage-inducing list of times they were talked down to by technical founders, had their counsel blown off or overridden, had their plans scrapped and their budgets cut, and every other sort of disrespectful act you can imagine.

(I am aware that the opposite also exists; that there are companies and cultures out there that valorize and glorify sales or marketing while treating engineers like code monkeys and button pushers, but it’s less common around here. In neither direction is this okay.)

This isn’t good for business, and it isn’t good for people.

It is still true that engineering is the most mature and developed organization at the company, because it has been around the longest. But our other orgs are starting to catch up and figure out what it means to “be honeycomby” for them, on their own terms. How do our core values apply to the sales team, the developer evangelists, the marketing folks, the product people? We are starting to see this play out in real time, and it’s fascinating. It is better than forcing all teams to be “engineering-driven”.

Business success is what makes all things possible

We are known for the caliber of our engineering today. But none of that matters a whit if you never hear about us, or can’t buy us in a way that makes sense for you and your team, or if you can’t use the product, or if we don’t keep building the right things, the things you need to modernize your engineering teams and move into the future together.

When looking towards that future, I still want us to be known for our great engineering. But I also want us to be a magnet for great designers who trust that they can come here and be respected, for great product people who know they can come here and do the best work of their life. That won’t happen if we see ourselves as being “driven” by one third of the triad.

Supremacy destroys balance. Always.

And none of this, none of this works unless we have a surging, thriving business to keep the wind in our sails.

~charity

How “Engineering-Driven” Leads to “Engineering-Supremacy”

Why I hate the phrase “breaking down silos”

We hear this phrase constantly: “I worked at breaking down silos.” “We need to break down silos.” “What did I do in my last role? I broke down silos.”

It sets my fucking teeth on edge.

What is a ‘silo’, anyway? What specifically wasn’t working well, and how did you solve it; or how was it solved, and what was your contribution to the solution? did you just follow orders, or did you personally diagnose the problem, or did some of your suggestions pan out?

Solutions to complex problems rarely work on the first go, so … what else did you try? How did you know it wasn’t working, how did you know when to abandon earlier ideas? It’s fiendishly hard to know whether you’ve given a solution enough time to bake, for people to adjust, so that you can even evaluate whether it works better or worse off than before.

Communication is not magic pixie dust

Breaking down silos is supposed to be about increasing communication, removing barriers and roadblocks to collaboration.

But you can’t just blindly throw “more communication” at your teams. Too much communication can be just as much of a problem and a burden as too little. It can distract, and confuse, and create little eddies of information that is incorrect or harmful.

The quantity of the communication isn’t the issue, so much as the quality. Who is talking to whom, and when, and why? How does information flow throughout your company? Who gets left out? Whose input is sought, and when, and why? How can any given individual figure out who to talk to about any given responsibility?

Every time you say "break down silos", I want to "break down your face." | News EcardWhen someone says they are “breaking down silos”, whether in an interview, a panel, or casual conversation, it tells me jack shit about what they actually did.

cliches are a substitute for critical thinking

It’s just like when people say “it’s a culture problem”, or “fix your culture”, or “everything is about people”. These phrases tell me nothing except that the speaker has gone to a lot of conferences and wants to to sound cool.

If someone says “breaking down silos”, it immediately generates a zillion questions in my mind. I’m curious, because these problems are genuinely hard and people who solve them are incredibly rare.

Unfortunately, the people who use these phrases are almost never the ones who are out there in the muck and grind, struggling to solve real problems.

When asked, people who have done the hard labor of building better organizations with healthy communication flows, less inefficiency, and alignment around a single mission — people who have gotten all the people rowing in the same direction — tend to talk about the work.

People who haven’t, say they were “breaking down silos.”

Why I hate the phrase “breaking down silos”

Software deploys and cognitive biases

There exist some wonderful teams out there who have valid, well thought through, legitimate reasons for enforcing “NO FRIDAY DEPLOYS” week in and week out, for not hooking CI/CD up to autodeploy, and for not shipping one person’s changes at a time.

And then there are the reasons most people have.

Bad decisions, and the biases they came from

 

We’re humans. 💜  We leap to conclusions with the wetware we have doing the best it can based on heuristics that feel objectively true, but are ultimately just emotional reactions based on past lived experience. And then we retroactively enshrine those goofy gut feelings with the language of noble motive and moral values.

“I tell people not to deploy to production … because I care so deeply about my team and their ability to have a quiet weekend.”

Barf. 🙄  That’s just like saying you tell your kid not to brush his teeth at night, because you care SO DEEPLY about him and his ability to go to bed calm and happy.

Once the retcon engine in your brain gets running, it comes up with all sorts of reasons. Plausible-sounding reasons! But every single argument of the items in the list above is materially false.

Deploy myths are never going away for good; they appeal to too many of our cognitive biases. But what if there was one simple thing you could do that would invert many of these cognitive biases and cause people to grapple with the question in a new way? What if you could kickstart a recalculation?

My next post will pick up right here. I’ll tell you all about the One Simple Trick you can do to fix your deploys and set you on the virtuous path of high-performing teams.

Til then, here’s what I’ve previously written on the topic.

 

Footnotes

 

Availability bias: The tendency to overestimate the likelihood of events with greater “availability” in memory, which can be influenced by how recent the memories are or how unusual or emotionally charged they may be.

Continued influence effect: The tendency to believe previously learned misinformation even after it has been corrected. Misinformation can still influence inferences one generates after a correction has occurred.

Conservatism bias: The tendency to revise one’s belief insufficiently when presented with new evidence.

Default effect: When given a choice between several options, the tendency to favor the default one.

Dread aversion: Just as losses yield double the emotional impact of gains, dread yields double the emotional impact of savouring

False-uniqueness bias: The tendency of people to see their projects and themselves as more singular than they actually are.

Functional fixedness: Limits a person to using an object only in the way it is traditionally used

Hyperbolic discounting: Discounting is the tendency for people to have a stronger preference for more immediate payoffs relative to later payoffs. Hyperbolic discounting leads to choices that are inconsistent over time – people make choices today that their future selves would prefer not to have made, despite using the same reasoning

IKEA effect: The tendency for people to place a disproportionately high value on objects that they partially assembled themselves, such as furniture from IKEA, regardless of the quality of the end product

Illusory truth effect: A tendency to believe that a statement is true if it is easier to process, or if it has been stated multiple times, regardless of its actual veracity.

Irrational escalation: The phenomenon where people justify increased investment in a decision, based on the cumulative prior investment, despite new evidence suggesting that the decision was probably wrong. Also known as the sunk cost fallacy

Law of the instrument: An over-reliance on a familiar tool or methods, ignoring or under-valuing alternative approaches. “If all you have is a hammer, everything looks like a nail”

Mere exposure effect: The tendency to express undue liking for things merely because of familiarity with them

Negativity bias: Psychological phenomenon by which humans have a greater recall of unpleasant memories compared with positive memories

Non-adaptive choice switching: After experiencing a bad outcome with a decision problem, the tendency to avoid the choice previously made when faced with the same decision problem again, even though the choice was optimal

Omission bias: The tendency to judge harmful actions (commissions) as worse, or less moral, than equally harmful inactions (omissions).

Ostrich effect: Ignoring an obvious (negative) situation

Plan continuation bias: Failure to recognize that the original plan of action is no longer appropriate for a changing situation or for a situation that is different than anticipated

Prevention bias: When investing money to protect against risks, decision makers perceive that a dollar spent on prevention buys more security than a dollar spent on timely detection and response, even when investing in either option is equally effective

Pseudocertainty effect: The tendency to make risk-averse choices if the expected outcome is positive, but make risk-seeking choices to avoid negative outcomes

Salience bias: The tendency to focus on items that are more prominent or emotionally striking and ignore those that are unremarkable, even though this difference is often irrelevant by objective standards

Selective perception bias: The tendency for expectations to affect perception

Status-quo bias: If no special action is taken, the default action that will happen is that the code will go live. You will need an especially compelling reason to override this bias and manually stop the code from going live, as it would by default.

Slow-motion bias: We feel certain that we are more careful and less risky when we slow down. This is precisely the opposite of the real world risk factors for shipping software. Slow is dangerous for software; speed is safety. The more frequently you ship code, the smaller the diffs you ship, the less dangerous each one actually becomes. This is the most powerful and difficult to overcome of all of our biases, because there is no readily available counter-metaphor for us to use. (Riding a bike is the best I’ve come up with. 😔)

Surrogation: Losing sight of the strategic construct that a measure is intended to represent, and subsequently acting as though the measure is the construct of interest

Time-saving bias: Underestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively low speed and overestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively high speed.

Zero-risk bias: Preference for reducing a small risk to zero over a greater reduction in a larger risk.

Software deploys and cognitive biases

Why every software engineering interview should include ops questions

I’ve fallen way behind on my blog posts — my goal was to write one per month, and I haven’t published anything since MAY. Egads. So here I am dipping into the drafts archives! This one was written in April of 2016, when I was noodling over my CraftConf 2016 talk on “DevOps for Developers (see slides).”

So I got to the part in my talk where I’m talking about how to interview and hire software engineers who aren’t going to burn the fucking house down, and realized I could spend a solid hour on that question alone. That’s why I decided to turn it into a blog post instead.

Stop telling ops people to code better, start telling SWEs to ops better

Our industry has gotten very good at pressing operations engineers to get better at writing code, writing tests, and software engineering in general these past few years. Which is great! But we have not been nearly so good at pushing software engineers to level up their systems skills. Which is unfortunate, because it is just as important.

Most systems suffer from the syndrome of running too much software. Tossing more software into the heap is as likely to cause more problems as often as it solves them.

We see this play out at companies stacked with good software engineers who have built horrifying spaghetti messes of their infrastructure, and then commence paging themselves to death.

The only way to unwind this is to reset expectations, and make it clear that

  1. you are still responsible for your code after it’s been deployed to production, and 
  2. operational excellence is everyone’s job.

Operations is the constellation of tools, practices, policies, habits, and docs around shipping value to users, and every single one of us needs to participate in order to do this swiftly and safely.

Every software engineering interviewing loop should have an ops component.

Nobody interviews candidates for SRE or ops nowadays without asking some coding questions. You don’t have to be the greatest programmer in the world, but you can’t be functionally illiterate. The reverse is less common: asking software engineers basic, stupid questions about the lifecycle of their code, instrumentation best practices, etc. 

It’s common practice at lots of companies now to have a software engineer in the loop for hiring SREs to evaluate their coding abilities. It should be just as common to have an ops engineer in the loop for a SWE hire, especially for any SWE who is being considered for a key senior position. Those are the people you most rely on to be mentors and role models for junior hires. All engineers should embrace the ethos of owning their code in production, and nobody should be promoted or hired into a senior role if they don’t.

And yes, that means all engineers!  Even your iOS/Android engineers and website developers should be interested in what happens to their code after they hit deploy.  They should care about things like instrumentation, and what kind of data they may need later to debug their problems, and how their features may impact other infrastructure components.

You need to balance out your software engineers with engineers who don’t react to every problem by writing more code. You need engineers who write code begrudgingly, as a last resort. You’ll find these priceless gems in ops and SRE.

ops questions for software engineers

The best questions are broad and start off easy, with plenty of reasonable answers and pathways to explore. Even beginners can give a reasonable answer, while experts can go on talking for hours.

For example: give them the specs for a new feature, and ask them to talk through the infrastructure choices and dependencies to support that feature. Do they ask about things like which languages, databases, and frameworks are already supported by the team? Do they understand what kind of monitoring and observability tools to use, do they ask about local instrumentation best practices?

Or design a full deployment pipeline together. Probe what they know about generating artifacts, versioning, rollbacks, branching vs master, canarying, rolling restarts, green/blue deploys, etc. How might they design a deploy tool? Talk through the tradeoffs.

Some other good starting points:

  • “Tell me about the last time you caused a production outage. What happened, how did you find out, how was it resolved, and what did you learn?”
  • “What are some of your favorite tools for visibility, instrumentation, and debugging?
  • “Latency seems to have doubled over the last 6 hours. Where do you start looking, how do you start debugging?”
  • And this chestnut: “What happens when you type ‘google.com’ into a web browser?” You would be fucking *astonished* how many senior software engineers don’t know a thing about DNS, HTTP, SSL/TLS, cookies, TCP/IP, routing, load balancers, web servers, proxies, and on and on.

Another question I really like is: “what’s your favorite API (or database, or language) and why?” followed up by “… and what are the worst things about it?” (True love doesn’t mean blind worship.)

Remember, you’re exploring someone’s experience and depth here, not giving them a pass-fail quiz. It’s okay if they don’t know it all. You’re also evaluating them on communication skills, which is severely underrated by most people but is actually as a key technical skill.

Signals to look for

You’re not looking for perfection. You are teasing out signals for things like, how will this person perform on a team where software engineers are expected to own their code? How much do they know about the world outside the code they write themselves? Are they curious, eager, and willing to learn, or fearful, incurious and begrudging?

Do they expect networks to be reliable? Do they expect databases to respond, retries to succeed? Are they offended by the idea of being on call? Are they overly clever or do they look to simplify? (God, I hate clever software engineers 🙃.)

It’s valuable to get a feel for an engineer’s operational chops, but let’s be clear, you’re doing this for one big reason: to set expectations. By making ops questions part of the interview, you’re establishing from the start that you run an org where operations is valued, where ownership is non-optional. This is not an ivory tower where software engineers can merrily git push and go home for the day and let other people handle the fallout

It can be toxic when you have an engineer who thinks all ops work is toil and operations engineering is lesser-than. It tends to result in operations work being done very poorly. This is your best chance to let those people self-select out.

You know what, I’m actually feeling uncharacteristically optimistic right now. I’m remembering how controversial some of this stuff was when I first wrote it, five years ago in 2016. Nowadays it just sounds obvious. Like table stakes.

Hell yeah. 🤘

Why every software engineering interview should include ops questions

How Much Should My Observability Stack Cost?

First posted on 2021-08-18 at https://www.honeycomb.io/blog/how-much-should-my-observability-stack-cost

What should one pay for observability? What should your observability stack cost? What should be in your observability stack?

How much observability is enough? How much is too much, or is there such a thing?

Is it better to pay for one product that claims (dubiously) to do everything, or twenty products that are each optimized to do a different part of the problem super well?

It’s almost enough to make a busy engineer say “Screw it, I’m spinning up Nagios”.

(Hey, I said almost.)

All of these service providers can give you sticker shock when you begin investigating them. The biggest reason is always that we aren’t used to considering the price of our own time.  We act like it’s “free” to just take an hour and spin something up … we don’t count the cost of maintenance, context switching, and opportunity costs of not using the time to build something of business value.  Which is both understandable and forgivable, as a starting point.

Considerably less forgivable is the vagueness–and sometimes outright misdirection and scare tactics–some vendors offer around pricing. It’s not ok for a business to optimize for revenue at the expense of user experience. As users, we have the right to demand transparency and accurate information.  As vendors, we have the responsibility to provide it.  Any pricing scheme that doesn’t align with best practices and users’ interests will be a drag on reputation and growth.

The core question, rarely addressed outright, is: how much should you pay? In this post I’ll talk about what your observability costs include, and in the next post, what you should consider including in your “observability stack”.

But I’ll give you the answer to your question right off the bat: you should probably spend 20-30% of infra costs on observability.

O11y spend should be 20-30% of infra spend

Rule of thumb: your observability spend should come to 20-30% of your infra spend. (I’ve seen 10% a few times from reasonable-seeming shops, but they have been edge cases and outliers. I have also seen 50% or more, but again, outliers.)

Full disclosure: this isn’t based on any particular science.  It’s just based on my experience of 15+ years working in operations engineering, talking to other engineers and managers, and a couple of informal Twitter polls to satisfy my own curiosity.

Nevertheless, it’s a pretty solid rule. There are exceptions, but in general, if you’re spending less than 20%, you’re “saving money” at the expense of engineering time, or being silently dragged underwater by a million little time leaks and quality of service issues — which you could eliminate completely with a bit of investment.

Consider the person who told me proudly that his o11y spend was just 1-3%. (He meant the PagerDuty bill and Pingdom checks, actually.) He wasn’t counting the dedicated hardware for their ELK cluster (80k/month), or the 2-3 extra engineers they had to recruit, train and hire (250-300k/year apiece) to run the many open source tools they got for “free”.

And ultimately, it didn’t meet their needs very well. Few people knew how to use it, so they leaned on the “observability team” to craft custom views, write scripts and ETL one-offs, and serve as the institutional hive mind and software usability tutors.  They could have used better tools, ones under active development by large product teams.  They could have used that headcount to create core business value instead.

Engineers cost money

Engineers are expensive. Recruiting them is hard. The good ones are increasingly unwilling to waste time on unnecessary labor. This manager was “saving” maybe a million dollars a year (he mentioned a vendor quote of less than 100k/month)–but spending a couple million more than that in less-visible ways.

Worse, he was driving his engineering org into the ground by wasting so much of their time and energy on non-mission-critical work, inferior tooling, one-offs, frustrating maintenance work, etc, all of which had nothing to do with their core business value.

If you want to know if an org hires and retains good engineers, you could do worse than to ask the question: “What tools do you use, and why?”

  • Good orgs use good tools. They know engineering cycles are their scarcest and most valuable resource, and they want to train maximum firepower on their core business problems.
  • Mediocre orgs use mediocre tools, have no discipline or consistency around adoption and deprecation, and leak lost engineering cycles everywhere.

So back to our rule of thumb: observability amounting to 20-30% of total spend is where most shops should fall. This refers to cloud-native infrastructure, using third-party services to instrument and monitor code, with the basics covered — resource utilization graphs, end to end checks, paging, etc.

So, what do I need in my “observability stack”?

What are the basics? Well, obviously “it depends”. It depends on your requirements, your components, your commitments, your budget, sunk costs and skill sets, your teams, and most expensive of all — customer expectations and the cost of violating them. You should think carefully about these things and try to draw a straight line from the business case to the money you spend (or don’t spend). And don’t forget to factor in those invisible human costs.

 

How Much Should My Observability Stack Cost?

Notes on the Perfidy of Dashboards

The other day I said this on twitter —

… which stirred up some Feelings for many people. 🙃  So I would like to explain my opinions in more detail.

Static vs dynamic dashboards

First, let’s define the term. When I say “dashboard”, I mean STATIC dashboards, i.e. collections of metrics-based graphs that you cannot click on to dive deeper or break down or pivot. If your dashboard supports this sort of responsive querying and exploration, where you can click on any graph to drill down and slice and dice the data arbitrarily, then breathe easy — that’s not what I’m talking about. Those are great. (I don’t really consider them dashboards, but I have heard a few people refer to them as “dynamic dashboards”.)

Actually, I’m not even “against” static dashboards. Every company has them, including Honeycomb. They’re great for getting a high level sense of system functioning, and tracking important stats over long intervals. They are a good starting point for investigations. Every company should have a small, tractable number of these which are easily accessible and shared by everyone.

Debugging with dashboards: it’s a trap

What dashboards are NOT good at is debugging, or understanding or describing novel system states.

I can hear some of you now: “But I’ve debugged countless super-hard unknown problems using only static dashboards!” Yes, I’m sure you have. If all you have is a hammer, you CAN use it to drive screws into the wall, but that doesn’t mean it’s the best tool. And It takes an extraordinary amount of knowledge and experience to be able to piece together a narrative that translates low-level system statistics into bugs in your software and back. Most software engineers don’t have that kind of systems experience or intuition…and they shouldn’t have to.

Why are dashboards bad for debugging? Think of it this way: every dashboard is an answer to a question someone asked at some point. Your monitoring system is probably littered with dashboards, thousands and thousands of them, most of whose questions have been long forgotten and many of whose source data streams have long since gone silent.

So you come along trying to investigate something, and what do you do? You start skimming through dashboards, eyes scanning furiously, looking for visual patterns — e.g. any spikes that happened around the same time as your incident. That’s not debugging, that’s pattern-matching. That’s … eyeball racing.

if we did math like we do dashboards

Imagine you’re in a math competition, and you get handed a problem to solve. But instead of pulling out your pencil and solving the equation, step by step, you start hollering out guesses.

“27!”
“19992.41!”
“1/4325!”

That’s what flipping through dashboards feels like to me. You’re riffling through a bunch of graphs that were relevant to some long-ago situation, without context or history, without showing their work. Sometimes you’ll spot the exact scenario, and — huzzah! — the number you shout is correct! But when it comes to unknown scenarios, the odds are not in your favor.

Debugging looks and feels very different from flipping through answers. You ask a question, examine the answer, and ask another question based on the result. (“Which endpoints were erroring? Are all of the requests erroring, or only some? What did they have in common?”, etc.)

You methodically put one foot in front of the other, following the trail of bread crumbs, until the data itself leads you to the answer.

The limitations of metrics and dashboards

Unfortunately, you cannot do that with metrics-based dashboards, because you stripped away the connective tissue of the event back when you wrote the metrics out to disk.

If you happened to notice while skimming through dashboards that your 404 errors spiked at 14:03, and your /payment and /import endpoints started erroring at 14.03, and your database started returning a bunch of mysql errors shortly after 14:00, you’ll probably assume that they’re all related and leap to find more evidence that confirms it.

But you cannot actually confirm that those events are the same ones, not with your metrics dashboards. You cannot drill down from errors to endpoints to error strings; for that, you’d need a wide structured data blob per request. Those might in fact be two or three separate outages or anomalies happening at the same time, or just the tip of the iceberg of a much larger event, and your hasty assumptions might extend the outage for much longer than was necessary.

With metrics, you tend to find what you’re looking for. You have no way to correlate attributes between requests or ask “what are all of the dimensions these requests have in common?”, or to flip back and forth and look at the request as a trace. Dashboards can be fairly effective at surfacing the causes of problems you’ve seen before (raise your hand if you’ve ever been in an incident review where one of the follow up tasks was, “create a dashboard that will help us find this next time”), but they’re all but useless for novel problems, your unknown-unknowns.

Other complaints about dashboards:

They tend to have percentiles like 95th, 99th, 99.9th, 99.99th, etc. Which can cover over a multitude of sins. You really want a tool that allows you to see MAX and MIN, and heatmap distributions.

A lot of dashboards end up getting created that are overly specific to the incident you just had — naming specific hosts, etc — which just creates clutter and toil. This is how your dashboards become that graveyard of past outages.

The most useful approach to dashboards is to maintain a small set of them; cull regularly, and think of them as a list of starter queries for your investigations.

Fred Hebert has this analogy, which I like:

“I like to compare the dashboards to the big display in a hospital room: heartbeat, pressure, oxygenation, etc. Those can tell you when a thing is wrong, but the context around the patient chart (and the patient themselves) is what allows interpretation to be effective. If all we have is the display but none of the rest, we’re not getting anywhere close to an accurate picture. The risk with the dashboard is having the metrics but not seeing or knowing about the rest changing.”

In conclusion

Dashboards aren’t universally awful. The overuse of them just encourages sloppy thinking, and static ones make it impossible for you to follow the plot of an outage, or validate your hypotheses. 🤒  There’s too many of them, and not enough shared consensus. (It would help if, like, new dashboards expired within a month if nobody looked at them again.)

If what you have is “nothing”, even shitty dashboards are far better than no dashboards. But shitty dashboards have been the only game in town for far too long. We need more vendors to think about building for queryability, explorability, and the ability to follow a trail of breadcrumbs. Modern systems are going to demand more and more of this approach.

Nothing < Dashboards < a Queryable, Exploratory Interface

If everyone out there who slaps “observability” on their web page also felt the responsibility to add an observability-enabling interface to their tool, one that would let users explore and identify unknown-unknowns, we would all be in a far better place. 🙂

 

 

 

 

 

Notes on the Perfidy of Dashboards

Questionable Advice: “What Should I Say In My Exit Interview?”

I recently received this gem of a note::

Hi Charity, I really enjoy your writing and a lot of it has directly contributed to me finally deciding to leave a company with a toxic management culture. I’ll also be leaving many great IC friends that will have lost a strong voice.

My exit interview will be next week. Any advice on how honest I should be?

I’ve googled quite a bit but there are only generic “don’t burn bridges” comments. Would love to see something a little more authoritative 🙂

–Anonymous Reader

Ew, fuck that. That’s exactly the kind of quivering, self-serving, ass covering advice you’d get from HR. It’s exactly the kind of advice that good people use to perpetuate harm.

I wouldn’t worry about “too much honesty” or personal repercussions or whatnot. I would worry about just one thing: being effective. This is your last chance to do the people you care about a solid, and you don’t want to waste it.

So … ranting about every awful person, boring project and offensive party theme of your tenure: not effective. Ranting about people who were personally irritating but had very limited power: not effective. Talking only in vague, high level abstractions (“toxic culture”), or about things only engineers understand and are bugged by: not effective.

What is effective? Hm, let’s think on this..

  • Start off with your high level assertion (toxic culture) and methodically assemble a list of stories, incidents, and consequences that support your thesis. Structure-wise, this is a lot like writing a good essay
  • Tie your critiques to the higher ups who enabled or encouraged the bad behavior, not just the flunkies who carried it out.
  • Wherever possible, draw a straight line to material consequences — people quitting, customers leaving, your company’s reputation suffering
  • Keep it crisp. No more than three pages total. Pick your top 1-3 points and drive them home. No detours.
  • This one sucks, but … if someone was perceived as an underperformer or a problem employee, avoid using them as evidence in support of your argument. It won’t help you or them, it will be used as an excuse to discredit you.
  • Keep it mostly professional. I am not saying don’t show any anger or strong emotion; it can be a powerful tool; just be careful with it. Get a proofread from someone with upper management experience, ideally with no connection to your work. (Me, if necessary.)
  • Put it in WRITING!✨ Deliver your feedback in person, but hand over a written copy as well. Written words are harder to ignore or distort.
  • For extra oomph, give a copy to any execs, managers, or high level ICs you trust. Don’t just email it to them, though. Have a face to face conversation where you state your case, and hand them a written copy at the end.

The sad fact is that most exit feedback is dutifully entered in by a low ranking employee who makes a third your salary and has no reason whatsoever to rock the boat, after which it gets tossed in a folder or the trash and is never seen again.

If you want to use your voice on your way out the door, the challenge you face isn’t one of retribution, it’s inertia and apathy. HR doesn’t care about your feedback … but they care if they think their boss saw it and cares about it

And I think you should use your voice! You clearly have some clout, and what’s the point of having power if you won’t extend yourself now and then on behalf of those who don’t?

Good luck!!

charity

Questionable Advice: “What Should I Say In My Exit Interview?”