Questionable Advice: “People Used To Take Me Seriously. Then I Became A Software Vendor”

I recently got a plaintive text message from my magnificent friend Abby Bangser, asking about a conversation we had several years ago:

“Hey, I’ve got a question for you. A long time ago I remember you talking about what an adjustment it was becoming a vendor, how all of a sudden people would just discard your opinion and your expertise without even listening. And that it was SUPER ANNOYING.

I’m now experiencing something similar. Did you ever find any good reading/listening/watching to help you adjust to being on the vendor side without being either a terrible human or constantly disregarded?”

Oh my.. This brings back memories. ☺️🙈

Like Abby, I’ve spent most of my career as an engineer in the trenches. I have also spent a lot of time cheerfully talking smack about software. I’ve never really had anyone question my experience[1] or my authority as an expert, hardened as I was in the flames of a thousand tire fires.

Then I started a software company. And all of a sudden this bullshit starts popping up. Someone brushing me off because I was “selling something”, or dismissing my work like I was fatally compromised. I shrugged it off, but if I stopped to think, it really bothered me. Sometimes I felt like yelling “HEY FUCKERS, I am one of your kind! I’m trying to HELP YOU. Stop making this so hard!” 😡 (And sometimes I actually did yell, lol.)

That’s what I remember complaining to Abby about, five or six years ago. It was all very fresh and raw at the time.

We’ll get to that. First let’s dial the clock back a few more years, so you can fully appreciate the rich irony of my situation. (Or skip the story and jump straight to “Five easy ways to make yourself a vendor worth listening to“.)

The first time I encountered “software for sale”

My earliest interaction with software vendors was at Linden Lab. Like most infrastructure teams, most of the software we used was open source. But somewhere around 2009? 2010? Linden’s data engineering team began auditioning vendors like Splunk, Greenplum, Vertica[2], etc for our data warehouse solution, and I tagged along as the sysinfra/ops delegate.

For two full days we sat around this enormous table as vendor after vendor came by to demo and plump their wares, then opened the floor for questions.

One of the very first sales guys did something that pissed me off me. I don’t remember exactly what happened — maybe he was ignoring my questions or talking down to me. (I’m certain I didn’t come across like a seasoned engineering professional; in my mid twenties, face buried in my laptop, probably wearing pajamas and/or pigtails.) But I do remember becoming very irritated, then settling in to a stance of, shall we say, oppositional defiance.

I peppered every sales team aggressively with questions about the operational burden of running their software, their architectural decisions, and how canned or cherry-picked their demos were. Any time they let slip a sign of weakness or betrayed uncertainty, I bore down harder and twisted the knife. I was a ✨royal asshole✨. My coworkers on the data team found this extremely entertaining, which only egged me on.

What the fuck?? 🫢😧🫠 I’m not usually an asshole to strangers.. where did that come from?

What open source culture taught me about sales

I came from open source, where contempt for software vendors was apparently de rigueur. (is it still this way?? seems like it might have gotten better? 😦) It is fascinating now to look back and realize how much attitude I soaked up before coming face to face with my first software vendor. According to my worldview at the time,

  1. Vendors are liars
  2. They will say anything to get you to buy
  3. Open source software is always the safest and best code
  4. Software written for profit is inherently inferior, and will ultimately be replaced by the inevitable rise of better, faster, more democratic open source solutions
  5. Sales exists to create needs that ought never to have existed, then take you to the cleaners
  6. Engineers who go work for software vendors have either sold out, or they aren’t good enough to hack it writing real (consumer facing) software.

I’m remembering Richard Stallman trailing around behind me, up and down the rows of vendor booths at USENIX in his St IGNUcious robes, silver disk platter halo atop his head, offering (begging?) to lay his hands on my laptop and bless it, to “free it from the demons of proprietary software.” Huh. (Remember THIS song? 🎶 😱)

Given all that, it’s not hugely surprising that my first encounter with software vendors devolved into hostile questioning.

(It’s fun to speculate on the origin of some of these beliefs. Like, I bet 3) and 4) came from working on databases, particularly Oracle and MySQL/Postgres. As for 5) that sounds an awful lot like the beauty industry and other products sold to women. 🤭)

Behind every software vendor lies a recovering open source zealot(???)

I’ve had many, many experiences since then that slowly helped me dismantle this worldview, brick by brick. Working at Facebook made me realize that open source successes like Apache, Haproxy, Nginx etc are exceptions, not the norm; that this model is only viable for certain types of general-purpose infrastructure software; that governance and roadmaps are a huge issue for open source projects too; and that if steady progress is being made, at the end of the day, somewhere somebody is probably paying those developers.

I learned that the overwhelming majority of production-caliber code is written by somebody who was paid to write it — not by volunteers. I learned about coordination costs and overhead, how expensive it is to organize an army of volunteers, and the pains of decentralized quality control. I learned that you really really want the person who wrote the code to stick around and own it for a long time, and not just on alternate weekends when they don’t have the kids (and/or they happen to feel like it).

I learned about game theory, and I learned that sales is about relationships. Yes, there are unscrupulous sellers out there, just like there are shady developers, but good sales people don’t want you to walk away feeling tricked or disappointed any more than you want to be tricked or disappointed. They want to exceed your expectations and deliver more value than expected, so you’ll keep coming back. In game theory terms, it’s a “repeated game”.

I learned SO MUCH from interviewing sales candidates at Honeycomb.[3] Early on, when nobody knew who we were, I began to notice how much our sales candidates were obsessed with value. They were constantly trying to puzzle out out how much value Honeycomb actually brought to the companies we were selling to. I was not used to talking or thinking about software in terms of “value”, and initially I found this incredibly offputting (can you believe it?? 😳).

Sell unto others as you would have them sell unto you

Ultimately, this was the biggest (if dumbest) lesson of all: I learned that good software has tremendous value. It unlocks value and creates value, it pays enormous ongoing dividends in dollars and productivity, and the people who build it, support it, and bring it to market fully deserve to recoup a slice of the value they created for others.

There was a time when I would have bristled indignantly and said, “we didn’t start honeycomb to make money!” I would have said that the reason we built honeycomb because we knew as engineers what a radical shift it had wrought in how we built and understood software, and we didn’t want to live without it, ever again.

But that’s not quite true. Right from the start, Christine and I were intent on building not just great software, but a great software business. It wasn’t personal wealth we were chasing, it was independence and autonomy — the freedom to build and run a company the way we thought it should be run, building software to radically empower other engineers like ourselves.

Guess what you have to do if you care about freedom and autonomy?

Make money. 🙄☺️

I also realized, belatedly, that most people who start software companies do so for the same damn reasons Christine and I did… to solve hard problems, share solutions, and help other engineers like ourselves. If all you want to do is get rich, this is actually a pretty stupid way to do that. Over 90% of startups fail, and even the so-called “success stories” aren’t as predictably lucrative as RSUs. And then there’s the wear and tear on relationships, the loss of social life, the vicissitudes of the financial system, the ever-looming spectre of failure … 👻☠️🪦 Startups are brutal, my friend.

Karma is a bitch

None of these are particularly novel insights, but there was a time when they were definitely news to me. ☺️ It was a pretty big shock to my system when I first became a software vendor and found myself sitting on the other side of the table, the freshly minted target of hostile questioning.

These days I am far less likely to be cited as an objective expert than I used to be. I see people on Hacker News dismissing me with the same scornful wave of the hand as I used to dismiss other vendors. Karma’s a bitch, as they say. What goes around comes around. 🥰

I used to get very bent out of shape by this. “You act like I only care because I’m trying to sell you something,” I would hotly protest, “but it’s exactly the opposite. I built something because I cared.” That may be true, but it doesn’t change the fact that vested interests can create blind spots, ones I might not even be aware of.

And that’s ok! My arguments/my solutions should be sturdy enough to withstand any disclosure of personal interest. ☺️

Some people are jerks; I can’t control that. But there are a few things I can do to acknowledge my biases up front, play fair, and just generally be the kind of vendor that I personally would be happy to work with.

Five easy ways to make yourself a vendor worth listening to

So I gave Abby a short list of a few things I do to try and signal that I am a trustworthy voice, a vendor worth listening to. (What do you think, did I miss anything?)

🌸 Lead with your bias.🌸
I always try to disclose my own vested interest up front, and sometimes I exaggerate for effect: “As a vendor, I’m contractually obligated to say this”, or “Take it for what you will, obviously I have religious convictions here”. Everyone has biases; I prefer to talk to people who are aware of theirs.

🌸 Avoid cheap shots.🌸
Try to engage with the most powerful arguments for your competitors’ solutions. Don’t waste your time against straw men or slam dunks; go up against whatever ideal scenarios or “steel man” arguments they would muster in their own favor. Comparing your strengths vs  their strengths results in a way more interesting, relevant and USEFUL discussion for all involved.

🌸 Be your own biggest critic.🌸
Be forthcoming about the flaws of your own solution. People love it when you are unafraid to list your own product’s shortcomings or where the competition shines, or describe the scenarios where other tools are genuinely superior or more cost-effective. It makes you look strong and confident, not weak.

What would you say about your own product as an engineer, or a customer? Say that.

🌸 You can still talk shit about software, just not your competitors‘ software. 🌸
I try not to gratuitously snipe at our competitors. It’s fine to speak at length about technical problems, differentiation and tradeoffs, and to address how specifically your product compares with theirs. But confine your shit talking to categories of software where you don’t have a personal conflict of interest.

Like, I’m not going to get on twitter and take a swipe at a monitoring vendor (anymore 😇), but I might say rude things about a language, a framework, or a database I have no stake in, if I’m feeling punchy. ☺️ (This particular gem of advice comes by way of Adam Jacob.)

🌸 Be generous with your expertise.🌸
If you have spent years going deep on one gnarly problem, you might very well know that problem and its solution space more thoroughly than almost anyone else in the world. Do you know how many people you can help with that kind of mastery?! A few minutes from you could potentially spare someone days or weeks of floundering. This is a gift few can give.

It feels good, and it’s a nice break from battering your head against unsolvable problems. Don’t restrict your help to paying customers, and, obviously, don’t give self-serving advice. Maybe they can’t buy/don’t need your solution today, but maybe someday they will.

In conclusion

There’s a time and place for being oppositional. Sometimes a vendor gets all high on their own supply, or starts making claims that aren’t just an “optimistic” spin on the facts but are provably untrue. If any vendor is operating in poor faith they deserve to to be corrected.

But it’s a shitty, self-limiting stance to take as a default. We are all here to build things, not tear things down. No one builds software alone. The code you write that defines your business is just the wee tippy top of a colossal iceberg of code written by other people — device drivers, libraries, databases, graphics cards, routers, emacs. All of this value was created by other people, yet we collectively benefit.

Think of how many gazillion lines of code are required for you to run just one AWS Lambda function! Think of how much cooperation and trust that represents. And think of all the deals that brokered that trust and established that value, compensating the makers and allowing them to keep building and improving the software we all rely on.

We build software together. Vendors exist to help you. We do what we do best, so you can spend your engineering cycles doing what you do best, working on your core product. Good sales deals don’t leave anyone feeling robbed or cheated, they leave both sides feeling happy and excited to collaborate.[4]

🐝💜Charity.

[1] Yes, I know this experience is far from universal; LOTS of people in tech have not felt like their voices are heard or their expertise acknowledged. This happens disproportionately to women and other under-represented groups, but it also happens to plenty of members of the dominant groups. It’s just a really common thing! However that has not really been my experience — or if it has, I haven’t noticed — nor Abby’s, as far as I’m aware.

[2] My first brush with columnar storage systems! Which is what makes Honeycomb possible today.

[3] I have learned SO MUCH from watching the world class sales professionals we have at Honeycomb. Sales is a tough gig, and doing it well involves many disciplines — empathy, creativity, business acumen, technical expertise, and so much more. Selling to software engineers in particular means you are often dealing with cocky little shits who think they could do your job with a few lines of code. On behalf of my fellow little shits engineers, I am sorry. 🙈

[4] Like our sales team says: “Never do a deal unless you’d do both sides of the deal.” I fucking love that.

Questionable Advice: “People Used To Take Me Seriously. Then I Became A Software Vendor”

Deploys Are The ✨WRONG✨ Way To Change User Experience

This piece was first published on the honeycomb.io blog on 2023-03-08.

….

I’m no stranger to ranting about deploys. But there’s one thing I haven’t sufficiently ranted about yet, which is this: Deploying software is a terrible, horrible, no good, very bad way to go about the process of changing user-facing code.

It sucks even if you have excellent, fast, fully automated deploys (which most of you do not). Relying on deploys to change user experience is a problem because it fundamentally confuses and scrambles up two very different actions: Deploys and releases.

Deploy

“Deploying” refers to the process of building, testing, and rolling out changes to your production software. Deploying should happen very often, ideally several times a day. Perhaps even triggered every time an engineer lands a change.

Everything we know about building and changing software safely points to the fact that speed is safety and smaller changes make for safer deploys. Every deploy should apply a small diff to your software, and deploys should generally be invisible to users (other than minor bug fixes).

Release

“Releasing” refers to the process of changing user experience in a meaningful way. This might mean anything from adding functionality to adding entire product lines. Most orgs have some concept of above or below the fold where this matters. For example, bug fixes and small requests can ship continuously, but larger changes call for a more involved process that could mean anything from release notes to coordinating a major press release.

A tale of cascading failures

Have you ever experienced anything like this?

Your company has been working on a major new piece of functionality for six months now. You have tested it extensively in staging and dev environments, even running load tests to simulate production. You have a marketing site ready to go live, and embargoed articles on TechCrunch and The New Stack that will be published at 10:00 a.m. PST. All you need to do now is time your deploy so the new product goes live at the same time.

It takes about three hours to do a full build, test, and deploy of your entire system. You’ve deployed as much as possible in advance, and you’ve already built and tested the artifacts, so all you have to do is a streamlined subset of the deploy process in the morning. You’ve gotten it down to just about an hour. You are paranoid, so you decide to start an hour early. So you kick off the deploy script at 8:00 a.m. PST… and sit there biting your nails, waiting for it to finish.

SHIT! 20 minutes through the deploy, there’s a random flaky SSH timeout that causes the whole thing to cancel and roll back. You realize that by running a non-standard subset of the deploy process, some of your error handling got bypassed. You frantically fix it and restart the whole process.

Your software finishes deploying at 9:30 a.m., 30 minutes before the embargoed articles go live. Visitors to your website might be confused in the meantime, but better to finish early than to finish late, right? 😬

Except… as 10:00 a.m. rolls around, and new users excitedly begin hitting your new service, you suddenly find that a path got mistyped, and many requests are returning 500. You hurriedly merge a fix and begin the whole 3-hour long build/test/deploy process from scratch. How embarrassing! 🙈

Deploys are a terrible way to change user experience

The build/release/deploy process generally has a lot of safeguards and checks baked in to make sure it completes correctly. But as a result…

  • It’s slow
  • It’s often flaky
  • It’s unreliable
  • It’s staggered
  • The process itself is untestable
  • It can be nearly impossible to time it right
  • It’s very all or nothing—the norm is to roll back completely upon any error
  • Fixing a single character mistake takes the same amount of time as doubling the feature set!

Changing user-visible behaviors and feature sets using the deploy process is a great way to get egg on your face. Because the process is built for distributing large code distributions or artifacts; user experience gets changed only as a side effect.

So how should you change user experience?

By using feature flags.

Feature flags: the solution to many of life’s software’s problems

You should deploy your code continuously throughout the day or week. But you should wrap any large, user-visible behavior changes behind a feature flag, so you can release that code by flipping a flag.

This enables you to develop safely without worrying about what your users see. It also means that turning a feature on and off no longer requires a diff, a code review, or a deploy. Changing user experience is no longer an engineering task at all.

Deploys are an engineering task, but releases can be done by product managers—even marketing teams. Instead of trying to calculate when to begin deploying by working backwards from 10:00 a.m., you simply flip the switch at 10:00 a.m.

Testing in production, progressive delivery

The benefits of decoupling deploys and releases extend far beyond timely launches. Feature flags are a critical tool for apostles of testing in production (spoiler alert: everybody tests in production, whether they admit it or not; good teams are aware of this and build tools to do it safely). You can use feature flags to do things like:

  • Enable the code for internal users only
  • Show it to a defined subset of alpha testers, or a randomized few
  • Slowly ramp up the percentage of users who see the new code gradually. This is super helpful when you aren’t sure how much load it will place on a backend component
  • Build a new feature, but only turning it on for a couple “early access” customers who are willing to deal with bugs
  • Make a perf improvement that should be bulletproof logically (and invisible to the end user), but safely. Roll it out flagged off, and do progressive delivery starting with users/customers/segments that are low risk if something’s fucked up
  • Doing something timezone-related in a batch process, and testing it out on New Zealand (small audience, timezone far away from your engineers in PST) first

Allowing beta testing, early adoption, etc. is a terrific way to prove out concepts, involve development partners, and have some customers feel special and extra engaged. And feature flags are a veritable Swiss Army Knife for practicing progressive delivery.

It becomes a downright superpower when combined with an observability tool (a real one that supports high cardinality, etc.), because you can:

  • Break down and group by flag name plus build id, user id, app id, etc.
  • Compare performance, behavior, or return code between identical requests with different flags enabled
  • For example, “requests to /export with flag “USE_CACHING” enabled are 3x slower than requests to /export without that flag, and 10% of them now return ‘402’”

It’s hard to emphasize enough just how powerful it is when you have the ability to break down by build ID and feature flag value and see exactly what the difference is between requests where a given flag is enabled vs. requests where it is not.

It’s very challenging to test in production safely without feature flags; the possibilities for doing so with them are endless. Feature flags are a scalpel, where deploys are a chainsaw. Both complement each other, and both have their place.

“But what about long-lived feature branches?”

Long-lived branches are the traditional way that teams develop features, and do so without deploying or releasing code to users. This is a familiar workflow to most developers.

But there is much to be said for continuously deploying code to production, even if you aren’t exposing new surface area to the world. There are lots of subterranean dependencies and interactions that you can test and validate all along.

There’s also something very psychologically different between working with branches. As one of our engineering directors, Jess Mink, says:

There’s something very different, stress and motivation-wise. It’s either, ‘my code is in a branch, or staging env. We’re releasing, I really hope it works, I’ll be up and watching the graphs and ready to respond,’ or ‘oh look! A development customer started using my code. This is so cool! Now we know what to fix, and oh look at the observability. I’ll fix that latency issue now and by the time we scale it up to everyone it’s a super quiet deploy.’

Which brings me to another related point. I know I just said that you should use feature flags for shipping user-facing stuff, but being able to fix things quickly makes you much more willing to ship smaller user-facing fixes. As our designer, Sarah Voegeli, said:

With frequent deploys, we feel a lot better about shipping user-facing changes via deploy (without necessarily needing a feature flag), because we know we can fix small issues and bugs easily in the next one. We’re much more willing to push something out with a deploy if we know we can fix it an hour or two later if there’s an issue.

Everything gets faster, which instills more confidence, which means everything gets faster. It’s an accelerating feedback loop at the heart of your sociotechnical system.

“Great idea, but this sounds like a huge project. Maybe next year.”

I think some people have the idea that this has to be a huge, heavyweight project that involves signing up for a SaaS, forking over a small fortune, and changing everything about the way they build software. While you can do that—and we’re big fans/users of LaunchDarkly in particular—you don’t have to, and you certainly don’t have to start there.

As Mike Terhar from our customer success team says, “When I build them in my own apps, it’s usually just something in a ‘configuration’ database table. You can make a config that can enable/disable, or set a scope by team, user, region, etc.”

You don’t have to get super fancy to decouple deploys from releases. You can start small. Eliminate some pain today.

In conclusion

Decoupling your deploys and releases frees your engineering teams to ship small changes continuously, instead of sitting on branches for a dangerous length of time. It empowers other teams to own their own roadmaps and move at their own pace. It is better, faster, safer, and more reliable than trying to use deploys to manage user-facing changes.

If you don’t have feature flags, you should embrace them. Do it today! 🌈

Deploys Are The ✨WRONG✨ Way To Change User Experience

Live Your Best Life With Structured Events

If you’re like most of us, you learned to debug as a baby engineer by way of printf(3). By the time you were shipping code to production you had probably learned to instrument your code with a real metrics library. Maybe a tenth of us learned to use gdb and still step through functions on the regular. (I said maybe.)

Printing stuff to stdout is still the Swiss Army knife of tools. Always there when you reach for it, usually helps more than it does harm. (I said usually.)

And then! In case you’ve been living under a rock, we recently went and blew up ye aulde monolythe, and in the process we … lost most of our single-process tools and techniques for debugging. Forget gdb; even printf doesn’t work when you’re hopping the network between functions.

If your tool set no longer works for you, friend, it’s time to go all in. Maybe what you wanted was a faster horse, but it’s time for a car, and the sooner you turn in your oats for gas cans and a spare tire, the better.

Exercising Good Technical Judgment (When You Don’t Have Any)

If you’re stuck trying to debug modern problems with pre-modern tooling, the first thing to do is stop digging the hole. Stop pushing good data after bad into formats and stores that aren’t going to help you answer the right questions.

0893d048d8361fe632b090b0429ad78b-rainbow-dash-rainbows-e1542789580565.jpgIn brief: if you aren’t rolling out a solution based on arbitrarily wide, structured raw events that are unique and ordered and trace-aware and without any aggregation at write time, you are going to regret it. (If you aren’t using OpenTelemetry, you are going to regret that, too.)

So just make the leap as soon as possible.

But let’s rewind a bit.  Let’s start with observability.

 

Observability: an introduction

Observability is not a new word or concept, but the definition of observability as a specific technical term applied to software engineering is relatively new — about four years old. Before that, if you heard the term in softwareland it was only as a generic synonym for telemetry (“there are three pillars of observability”, in one annoying formulation) or team names (twitter, for example, has long had an “observability team”).

The term itself originates with control theory:

“In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals. The concept of observability was introduced by Hungarian-American engineer Rudolf E. Kálmán for linear dynamic systems.[1][2]”

But when applied to a software context, observability refers to how well you can understand and reason about your systems, just by interrogating them and inspecting their outputs with your tools. How well can you understand the inside of the system from the outside?

Achieving this relies your ability to ask brand new questions, questions you have never encountered and never anticipated — without shipping new code. Shipping new code is cheating, because it means that you knew in advance what the problem was in order to instrument it.

But what about monitoring?

Monitoring has a long and robust history, but it has always been about watching your systems for failures you can define and expect. Monitoring is for known-unknowns, and setting thresholds and running checks against the system. Observability is about the unknown-unknowns. Which requires a fundamentally different mindset and toolchain.

“Monitoring is the action of observing and checking the behavior and outputs of a system and its components over time.” — @grepory, in his talk “Monitoring is Dead“.

Monitoring is a third-person perspective on your software. It’s not software explaining itself from the inside out, it’s one piece of software checking up on another.

Observability is for understanding complex, ephemeral, dynamic systems (not for debugging code)

You don’t use observability for stepping through functions; it’s not a debugger.  Observability is for swiftly identifying where in your system the error or problem is coming from, so you can debug it — by reproducing it, or seeing what it has in common with other erroring requests.  You can think of observability as being like B.I. (business intelligence) tooling for software applications, in the way you engage in a lot of exploratory, open-ended data sifting to detect novel patterns and behaviors.

rainbow_dash___no_by_cptofthefriendship-d4erd69Observability is often about swiftly isolating or tracking down the problem in your large, sprawling, far-flung, dynamic system. Because the hard part of distributed systems is rarely debugging the code, it’s figuring out where the code you need to debug is.

The need for observability is often associated with microservices adoption, because they are prohibitively difficult to debug without service-level event oriented tooling — the kind you can get from Honeycomb and Lightstep.. and soon, I hope, many other vendors.

Events are the building blocks of observability

Ergh, another overloaded data term. What even is an “event”?

An observability “event” is a hop in the lifecycle of an end-to-end request. If a request executes code on three services separated by network hops before returning to the user, that request generated three observability “events”, each packed with context and details about that code running in that environment. These are also sometimes called “canonical log lines“. If you implemented tracing, each event may be a span in your trace.

If request ID #A897BEDC hits your edge, then your API service, then four more internal services, and twice connects to a db and runs a query, then request ID #A897BEDC generated 8 observability events … assuming you are in fact gathering observability data from the edge, the API, the internal services and the databases.ponyfm-i7812-original

This is an important caveat. We only gather observability events from services that we can and do introspect. If it’s a black box to us, that hop cannot generate an observability event. So if request ID #A897BEDC also performed 20 cache lookups and called out to 8 external HTTP services and 2 managed databases, those 30 hops do not generate observability events (assuming you haven’t instrumented the memcache service and have no instrumentation from those external services/dbs). Each request generates one event per request per service hop.**

(I also wrote about logs vs structured events here.)

Observability is a first-person narrative.

We care primarily about self-reported status from the code as it executes the request path.

Instrumentation is your eyes and ears, explaining the software and its environment from the perspective of your code. Monitoring, on the other hand, is traditionally a third-person narrative — it’s one piece of software checking up on another piece of software, with no internal knowledge of its hopes and dreams.

First-person narrative reports have the best potential for telling a reliable narrative.  And more importantly, they map directly to user experience in a way that third-party monitoring does not and cannot.

Events … must be structured.

First, structure your goddamn data.  You’re a computer scientist, you’ve got no business using text search to plow through terabytes of text.

Events …  are not just structured logs.

Now, part of the reason people seem to think structured data is cost-prohibitive is that they’re doing it wrong.  They’re still thinking about these like log lines.  And while you can look at events like they’re just really wide structured log lines that aren’t flushed to disk, here’s why you shouldn’t: logs have decades of abhorrent associations and absolutely ghastly practices.

Instead of bundling up and passing along one neat little pile of context, they’re spewing log lines inside loops in their code and DDoS’ing their own logging clusters.They’re shitting out “log lines” with hardly any dimensions so they’re information-sparse and just straight up wasting the writes. And then to compensate for the sparseness and repetitiveness they just start logging the same exact nouns tens or hundreds of times over the course of the request, just so they can correlate or reconstruct some lousy request that they never should have blown up in the first place!

But they keep hearing they should be structuring their logs, so they pile structure on to their horrendous little strings, which pads every log line by a few bytes, so their bill goes up but they aren’t getting any benefit! just paying more! What the hell, structuring is bull shit!giphy

Kittens. You need a fundamentally different approach to reap the considerable benefits of structuring your data.

But the difference between strings and structured data is ~basically the difference between grep and all of computer science. 😛

Events … must be arbitrarily wide and dense with context.

So the most effective way to structure your instrumentation, to get the absolute most bang for your buck, is to emit a single arbitrarily wide event per request per service hop. At Honeycomb, the maturely instrumented datasets that we see are often 200-500 dimensions wide.  Here’s an event that’s just 20 dimensions wide:

{ 

   "timestamp":"2018-11-20 19:11:56.910",
   "az":"us-west-1",
   "build_id":"3150",
   "customer_id":"2310",
   "durationMs":167,
   "endpoint":"/api/v2/search",
   "endpoint_shape":"/api/v2/search",
   "fraud_dur":131,
   "hostname":"app14",
   "id":"f46691dfeda9ede4",
   "mysql_dur":"",
   "name":"/api/v2/search",
   "parent_id":"",
   "platform":"android",
   "query":"",
   "serviceName":"api",
   "status_code":200,
   "traceId":"f46691dfeda9ede4",
   "user_id":"344310",
   "error_rate":0,
   "is_root":"true"
}

So a well-instrumented service should have hundreds of these dimensions, all bundled around the context of each request. And yet — and here’s why events blow the pants off of metrics — even with hundreds of dimensions, it’s still just one write. Adding more dimensions to your event is effectively free, it’s still one write plus a few more bits.

Compare this to a metric-based systems, where you are often in the position of trying to predict whether a metric will be valuable enough to justify the extra write, because every single metric or tag you add contributes linearly to write amplification. Ever gotten billed tens of thousands of dollars for your custom metrics, or had to prune your list of useful custom metrics down to something affordable? (“BUT THOSE ARE THE ONLY USEFUL ONES!”, as every ops team wails)

Events … must pass along the blob of context as the request executes

As you can imagine, it can be a pain in the ass to keep passing this blob of information along the life of the request as it hits many services and databases. So at Honeycomb we do all the annoying parts for you with our integrations. You just install the go pkg or ruby gem or whatever, and under the hood we:

  1. initialize an empty debug event when the request enters that service
  2. prepopulate the empty debug event with any and all interesting information that we already know or can guess.  language type, version, environment, etc.
  3. create a framework so you can just stuff any other details in there as easily as if you were printing it out to stdout
  4. pass the event along and maintain its state until you are ready to error or exit
  5. write the extremely wide event out to honeycomb

Easy!

(Check out this killer talk from @lyddonb on … well everything you need to know about life, love and distributed systems is in here, but around the 12:00 mark he describes why this approach is mandatory. WATCH IT. https://www.youtube.com/watch?v=xy3w2hGijhE&feature=youtu.be)

Events … should collect context like sticky buns collect dust

Other stuff you’ll want to track in these structured blobs includes:

1225287_1370081029072_full

  1. Metadata like src, dst headers
  2. The timing stats and contents of every network call (our beelines wrap all outgoing http calls and db queries automatically)
  3. Every raw db query, normalized query family, execution time etc
  4. Infra details like AZ, instance type, provider
  5. Language/environment context like $lang version, build flags, $ENV variables
  6. Any and all unique identifying bits you can get your grubby little paws on — request ID, shopping cart ID, user ID, request ID, transaction ID, any other ID … these are always the highest value data for debugging.
  7. Any other useful application context.  Service name, build id, ordering info, error rates, cache hit rate, counters, whatever.
  8. Possibly the system resource state at this point in time.  e.g. values from /proc/net/ipv4 stats

Capture all of it. Anything that ever occurs to you (“this MIGHT be handy someday”) — don’t even hesitate, just throw it on the pile. Collect it up in one rich fat structured blob.

Events … must be unique, ordered, and traceable

You need a unique request ID, and you need to propagate it through your stack in some way that preserves sequence. Once you have that, traces are just a beautiful visualization layer on top of your shiny event data.

Events … must be stored raw.

Because observability means you need to be able to ask any arbitrary new question of Rainbow-Dash-is-not-amused-my-little-pony-friendship-is-magic-31088082-900-622your system without shipping new code, and aggregation is a one-way trip. Once you have aggregated your data and discarded the raw requests, you have destroyed your ability to ask new questions of that data forever. For Ever.

Aggregation is a one-way trip.  You can always, always derive your pretty metrics and dashboards and aggregates from structured events, and you can never go in reverse. Same for traces, same for logs. The structured event is the gold standard. Invest in it now, save your ass in the future.

It’s only observability if you can ask new questions. And that means storing raw events.

Events…are richer than metrics

There’s always tradeoffs when it comes to data. Metrics choose to sacrifice context and connective tissue, and sometimes high cardinality support, which you need to correlate anomalies or track down outliers. They have a very small, efficient data format, but they sacrifice everything else by discarding all but the counter, gauge, etc.

A metric looks like this, by the way.

{ metric: "db.query.time", value: 0.502, tags: Array(), type: set }

That’s it. It’s just a name, a number and maybe some tags. You can’t dig into the event and see what else was happening when that query was strangely slow. You can never get that information back after discarding it at write time.

But because they’re so cheap, you can keep every metric for every request! Maybe. (Sometimes.) More often, what happens is they aggregate at write time. So you never actually get a value written out for an individual event, it smushes everything together that happens in the 1 second interval and calculates some aggregate values to write out. And that’s all you can ever get back to.

With events, and their relative explosion of richness, we sacrifice our ability to store every single observability event about every request. At FB, every request generated hundreds of observability events as it made its way through the stack. Nobody, NOBODY is going to pay for an o11y stack that is hundreds of times as large as production. The solution to that problem is sampling.

Events…should be sampled.rainbow_dash___no_by_cptofthefriendship-d4erd69

But not dumb, blunt sampling at server side. Control it on the client side.

Then sample heavily for events that are known to be common and useless, but keep the events that have interesting signal. For example: health checks that return 200 OK usually represent a significant chunk of your traffic and are basically useless, while 500s are almost always interesting. So are all requests to /login or /payment endpoints, so keep all of them. For database traffic: SELECTs for health checks are useless, DELETEs and all other mutations are rare but you should keep all of them. Etc.

You don’t need to treat your observability metadata with the same care as you treat your billing data. That’s just dumb.

… To be continued.

I hope it’s now blazingly obvious why observability requires — REQUIRES — that you have access to raw structured events with no pre-aggregation or write-time rollups. Metrics don’t count. Just traces don’t count. Unstructured logs sure as fuck don’t count.

Structured, arbitrarily wide events, with dynamic sampling of the boring parts to control costs. There is no substitute.

For more about the technical requirements for observability, read this, this, or this.

IMG_4619
**The deep fine print: it’s one observability event per request per service hop … because we gather observability detail organized by request id.  Databases may be different.  For example, with MongoDB or MySQL, we can’t instrument them to talk to honeycomb directly, so we gather information about its internal perspective by 1) tailing the slow query log (and turning it up to log all queries if perf allows), 2) streaming tcp over the wire and reconstructing transactions, 3) connecting to the mysql port as root every couple seconds from cron, then dumping all mysql stats and streaming them in to honeycomb as an event.  SO.  Database traffic is not organized around connection length or unique request id, it is organized around transaction id or query id.  Therefore it generates one observability event per query or transaction. 
In other words: if your request hit the edge, API, four internal services, two databases … but ran 1 query on one db and 10 queries on the second db … you would generate a total of *19 observability events* for this request.
For more on observability for databases and other black boxes, try this blog post.
Live Your Best Life With Structured Events

How “Engineering-Driven” Leads to “Engineering-Supremacy”

Honeycomb has a reputation for being a very engineering-driven company. No surprise there, since it was founded by two engineers and our mission involves building an engineering product for other engineers.

We are never going to stop being engineering-driven in the sense that we are building for engineers and we always want engineers to have a seat at the table when it comes to what we build, and how, and why. But I am increasingly uncomfortable with the term “engineering-driven” and the asymmetry it implies.

We are less and less engineering-driven nowadays, for entirely good reasons — we want this to be just as much of a design-driven company and a product-driven company, and I would never want sales or marketing to feel like anything other than equal partners in our journey towards revolutionizing the way the world builds and runs code in production.

It is true that most honeycomb employees were engineers for the first few years, and our culture felt very engineer-centric. Other orgs were maybe comprised of a person or two, or had engineers trying to play them on TV, or just felt highly experimental.

But if there is one thing Christine and I were crystal clear on from the beginning, it’s this:

✨WE ARE HERE TO BUILD A BUSINESS✨

Not just the shiniest, most hardcore tech, not just the happiest, most diverse teams. These things matter to us — they matter a lot! But succeeding at business is what gives us the power to change the world in all of these other ways we care about changing it.

If business is booming and people are thirsty for more of whatever it is you’re serving, you pretty much get a blank check for radical experiments in sociotechnical transformation, be that libertarian or communitarian or anything in between.

If you don’t have the business to back it up, you get fuck-all.

Not only do you not get shit, you risk being pointed out as a Cautionary Tale of “what happens if $(thing you deeply care about) comes true.” Sit on THAT for a hot second. 😕

So yes, we cared. Which is not to say we knew how to build a great business. We most certainly did not. But both of us had been through too many startups (Linden Lab, Aardvark, Parse, etc) where the tech was amazing, the people were amazing, the product was amazing … and the business side just did not keep up. Which always leads to the same thing: heartbreak and devastation.

✨IF you can’t make it profitable✨
✨your destiny will inevitably be✨
✨taken out of your hands✨
✨and given to someone else.✨

So Christine and I have been repeating these twin facts back and forth to each other for over six years now:

  1. Honeycomb must succeed as a revenue-generating, eventually profitable business.
  2. We are not business experts. Therefore we have to make Honeycomb a place that explicitly values business expertise, that places it on the same level as engineering expertise.

We have worked hard to get better at understanding the business side (her more than me 🙃) but ultimately, we cannot be the domain experts in marketing or sales (or customer success, support, etc).

What we could do is demonstrate respect for those functions, bake that respect into our culture, and hire and support amazing business talent to run them with us.

On being “engineering-driven”

Self-described “engineering-driven” companies tend to fall into one of two traps. Either they alienate the business side by pinching their nose and holding business development at arm’s length (“aaahhh, i’m just an engineer! I have no interest in or capacity for participating in developing our marketing voice or sales pitch decks, get off me!”), or they act like engineering is a sort of super-skillset that makes you capable of doing everybody else’s job better than they possibly could. As though those other disciplines and skill sets aren’t every bit as deep and creative and challenging in their own right as developing software can be.

For the first few years we really did use engineers for all of those functions. We were trying to figure out how to build and explain and sell something new, which meant working out these things on the ground every day with our users. Engineer to engineer. What resonated? What clicked? What worked?

So we hired just a few engineers who were interested in how the business worked, and who were willing to work like Swiss Army knives across the org. We didn’t yet have a workable plan in place, which is what you need in order to bring domain experts on board, point them in a direction, and trust them to do what they do best in executing that plan.

Like I said, we didn’t know what to do or how to do it. But at least we knew that. Which kept us humble. And translated into a hard, fast rule which we set early on in our hiring process:

We DO NOT hire engineers who talk shit about sales and marketing.

If I was interviewing an engineer and they made any alienating sort of comment whatsoever about their counterparts on the business side, it was an automatic no. Easy out. We had a zero tolerance policy for talking down down about other functions, or joking, even for being unwilling to perform other business functions.

In retrospect, I think this is one of the best decisions we ever made.

Hiring engineers who respected other functions

We leaned hard into hiring engineers who asked curious questions about our business strategy and execution. We pursued engineers who talked about wanting to spend time directly with users, who were intrigued by the idea of writing marketing copy to help explain concepts to engineers, and who were ready, willing and able to go along on sales calls.

Once we finally found product-market fit, about 2-3 years ago, we stopped using engineers to play other roles and started hiring actual professionals in product, design, marketing, sales, customer success, etc to build and staff out their organizations. That was when we first started building the business for the longer term; until we found PMF, our event horizon was never more than 1-3 months ahead of “right now”.

(I’ll never forget going out to coffee with one of our earlier VPs of marketing, shortly after she was hired, and having her ask, in bemusement: “Why are all these engineers just sitting around in the #marketing channel? I’ve never had so many people giving opinions on my work!” 🙃)

Those early Swiss Army knife engineers have since stepped back, gratefully, into roles more centered around engineering. But that early knee-jerk reaction of ours established an important company norm that pays dividends to this day.

Every function of the business is equally challenging, creative, and worthy of respect. None of us are here to peacock; our skill sets serve the primary business goals. We all do our jobs better when we know more about how each other’s functions work.

These days, it’s not just about making sure we hire engineers who treat business counterparts like equals. It’s more about finding ways to stimulate the flow of information cross-functionally, creating a hunger for this information.

Caring about the big picture is a ✨learnable skill✨

You can try to hire people who care about the overall business outcomes, not just their own corner of reality, and we do select for this to  some degree — for all roles, not just engineers.

But you can also foster this curiosity and teach people to seek it out. Curiosity begets curiosity, and every single person at Honeycomb is doing something interesting. We all want to succeed and win together, and there’s something infectious and exciting about connecting all the dots that lead to success and reflecting that story back to the rest of the company.

For example,

    1. Every time we close a deal, a post gets written up and dropped into the announcements slack by the sales team. Not just who did we close and how much money did we make, but the full story of that customer’s interaction with honeycomb. How did they hear of us? Whose blog posts, training sessions, or office hours did they engage with? Did someone on the telemetry pull a record-fast turnaround on an integration they needed to get going? What pains did we solve effectively for them as a tool, and where were the rough edges that we can improve on in the future?

      The story is often half a page long or more, and tags a dozen or more people throughout all parts of the company, showing how everyone’s hard work added up and materially contributed to the final result.

 

    1. We have orgs take turn presenting in all hands — where they’re at, what they’ve built, and the impact of their contributions, week after week. Whether that’s the design team talking about how they’ve rolled out our new design system and how it is going to help everyone in the company experiment more and move more quickly, or it’s the people team showing how they’ve improved our recruiting, interviewing, and hiring processes to make people feel more seen, welcomed, and appreciated throughout the process.

      We expect people to be curious about the rest of the company. We expect honeybees to be interested in, excited about and celebratory of each other’s hard work. And it’s easy to be excited when you see people showing off work that they were excited to do.

 

    1. We have a weekly Friday “demo day” where people come and show off something they’ve built this week, rapid-fire. Whether it’s connecting to a mysql shell in the terminal to show off our newly consistent permissioning scheme, or product marketing showing off new work on the website. Everybody’s work counts. Everybody wants to see it.

 

  1. We have a #love channel in slack where you can drop in and tag someone when you’re feeling thankful for how much they just made your day better. We also have a “Gratefuls” section during all hands, where people speak up and give verbal props to coworkers who have really made a difference in their lives at work.

We have always attracted engineers who care about the business, not just the technology and the culture. As a result, we have consistently recruited and retained business leaders who are well above our weight class — our investors still sometimes marvel at the caliber of the business talent we have been able to attract. It is way above the norm for developer tools companies like ours.

“Engineering-driven” can be a mask for “engineering-supremacy”

Because the sad truth is that so many companies who pride themselves on being “engineering-driven” are actually what I would call more “engineering-supremacist”. Ask any top-tier sales or marketing leader out there about their experiences in the tech industry and you’ll hear a painful, rage-inducing list of times they were talked down to by technical founders, had their counsel blown off or overridden, had their plans scrapped and their budgets cut, and every other sort of disrespectful act you can imagine.

(I am aware that the opposite also exists; that there are companies and cultures out there that valorize and glorify sales or marketing while treating engineers like code monkeys and button pushers, but it’s less common around here. In neither direction is this okay.)

This isn’t good for business, and it isn’t good for people.

It is still true that engineering is the most mature and developed organization at the company, because it has been around the longest. But our other orgs are starting to catch up and figure out what it means to “be honeycomby” for them, on their own terms. How do our core values apply to the sales team, the developer evangelists, the marketing folks, the product people? We are starting to see this play out in real time, and it’s fascinating. It is better than forcing all teams to be “engineering-driven”.

Business success is what makes all things possible

We are known for the caliber of our engineering today. But none of that matters a whit if you never hear about us, or can’t buy us in a way that makes sense for you and your team, or if you can’t use the product, or if we don’t keep building the right things, the things you need to modernize your engineering teams and move into the future together.

When looking towards that future, I still want us to be known for our great engineering. But I also want us to be a magnet for great designers who trust that they can come here and be respected, for great product people who know they can come here and do the best work of their life. That won’t happen if we see ourselves as being “driven” by one third of the triad.

Supremacy destroys balance. Always.

And none of this, none of this works unless we have a surging, thriving business to keep the wind in our sails.

~charity

How “Engineering-Driven” Leads to “Engineering-Supremacy”

How Much Should My Observability Stack Cost?

First posted on 2021-08-18 at https://www.honeycomb.io/blog/how-much-should-my-observability-stack-cost

What should one pay for observability? What should your observability stack cost? What should be in your observability stack?

How much observability is enough? How much is too much, or is there such a thing?

Is it better to pay for one product that claims (dubiously) to do everything, or twenty products that are each optimized to do a different part of the problem super well?

It’s almost enough to make a busy engineer say “Screw it, I’m spinning up Nagios”.

(Hey, I said almost.)

All of these service providers can give you sticker shock when you begin investigating them. The biggest reason is always that we aren’t used to considering the price of our own time.  We act like it’s “free” to just take an hour and spin something up … we don’t count the cost of maintenance, context switching, and opportunity costs of not using the time to build something of business value.  Which is both understandable and forgivable, as a starting point.

Considerably less forgivable is the vagueness–and sometimes outright misdirection and scare tactics–some vendors offer around pricing. It’s not ok for a business to optimize for revenue at the expense of user experience. As users, we have the right to demand transparency and accurate information.  As vendors, we have the responsibility to provide it.  Any pricing scheme that doesn’t align with best practices and users’ interests will be a drag on reputation and growth.

The core question, rarely addressed outright, is: how much should you pay? In this post I’ll talk about what your observability costs include, and in the next post, what you should consider including in your “observability stack”.

But I’ll give you the answer to your question right off the bat: you should probably spend 20-30% of infra costs on observability.

O11y spend should be 20-30% of infra spend

Rule of thumb: your observability spend should come to 20-30% of your infra spend. (I’ve seen 10% a few times from reasonable-seeming shops, but they have been edge cases and outliers. I have also seen 50% or more, but again, outliers.)

Full disclosure: this isn’t based on any particular science.  It’s just based on my experience of 15+ years working in operations engineering, talking to other engineers and managers, and a couple of informal Twitter polls to satisfy my own curiosity.

Nevertheless, it’s a pretty solid rule. There are exceptions, but in general, if you’re spending less than 20%, you’re “saving money” at the expense of engineering time, or being silently dragged underwater by a million little time leaks and quality of service issues — which you could eliminate completely with a bit of investment.

Consider the person who told me proudly that his o11y spend was just 1-3%. (He meant the PagerDuty bill and Pingdom checks, actually.) He wasn’t counting the dedicated hardware for their ELK cluster (80k/month), or the 2-3 extra engineers they had to recruit, train and hire (250-300k/year apiece) to run the many open source tools they got for “free”.

And ultimately, it didn’t meet their needs very well. Few people knew how to use it, so they leaned on the “observability team” to craft custom views, write scripts and ETL one-offs, and serve as the institutional hive mind and software usability tutors.  They could have used better tools, ones under active development by large product teams.  They could have used that headcount to create core business value instead.

Engineers cost money

Engineers are expensive. Recruiting them is hard. The good ones are increasingly unwilling to waste time on unnecessary labor. This manager was “saving” maybe a million dollars a year (he mentioned a vendor quote of less than 100k/month)–but spending a couple million more than that in less-visible ways.

Worse, he was driving his engineering org into the ground by wasting so much of their time and energy on non-mission-critical work, inferior tooling, one-offs, frustrating maintenance work, etc, all of which had nothing to do with their core business value.

If you want to know if an org hires and retains good engineers, you could do worse than to ask the question: “What tools do you use, and why?”

  • Good orgs use good tools. They know engineering cycles are their scarcest and most valuable resource, and they want to train maximum firepower on their core business problems.
  • Mediocre orgs use mediocre tools, have no discipline or consistency around adoption and deprecation, and leak lost engineering cycles everywhere.

So back to our rule of thumb: observability amounting to 20-30% of total spend is where most shops should fall. This refers to cloud-native infrastructure, using third-party services to instrument and monitor code, with the basics covered — resource utilization graphs, end to end checks, paging, etc.

So, what do I need in my “observability stack”?

What are the basics? Well, obviously “it depends”. It depends on your requirements, your components, your commitments, your budget, sunk costs and skill sets, your teams, and most expensive of all — customer expectations and the cost of violating them. You should think carefully about these things and try to draw a straight line from the business case to the money you spend (or don’t spend). And don’t forget to factor in those invisible human costs.

 

How Much Should My Observability Stack Cost?

Notes on the Perfidy of Dashboards

The other day I said this on twitter —

… which stirred up some Feelings for many people. 🙃  So I would like to explain my opinions in more detail.

Static vs dynamic dashboards

First, let’s define the term. When I say “dashboard”, I mean STATIC dashboards, i.e. collections of metrics-based graphs that you cannot click on to dive deeper or break down or pivot. If your dashboard supports this sort of responsive querying and exploration, where you can click on any graph to drill down and slice and dice the data arbitrarily, then breathe easy — that’s not what I’m talking about. Those are great. (I don’t really consider them dashboards, but I have heard a few people refer to them as “dynamic dashboards”.)

Actually, I’m not even “against” static dashboards. Every company has them, including Honeycomb. They’re great for getting a high level sense of system functioning, and tracking important stats over long intervals. They are a good starting point for investigations. Every company should have a small, tractable number of these which are easily accessible and shared by everyone.

Debugging with dashboards: it’s a trap

What dashboards are NOT good at is debugging, or understanding or describing novel system states.

I can hear some of you now: “But I’ve debugged countless super-hard unknown problems using only static dashboards!” Yes, I’m sure you have. If all you have is a hammer, you CAN use it to drive screws into the wall, but that doesn’t mean it’s the best tool. And It takes an extraordinary amount of knowledge and experience to be able to piece together a narrative that translates low-level system statistics into bugs in your software and back. Most software engineers don’t have that kind of systems experience or intuition…and they shouldn’t have to.

Why are dashboards bad for debugging? Think of it this way: every dashboard is an answer to a question someone asked at some point. Your monitoring system is probably littered with dashboards, thousands and thousands of them, most of whose questions have been long forgotten and many of whose source data streams have long since gone silent.

So you come along trying to investigate something, and what do you do? You start skimming through dashboards, eyes scanning furiously, looking for visual patterns — e.g. any spikes that happened around the same time as your incident. That’s not debugging, that’s pattern-matching. That’s … eyeball racing.

if we did math like we do dashboards

Imagine you’re in a math competition, and you get handed a problem to solve. But instead of pulling out your pencil and solving the equation, step by step, you start hollering out guesses.

“27!”
“19992.41!”
“1/4325!”

That’s what flipping through dashboards feels like to me. You’re riffling through a bunch of graphs that were relevant to some long-ago situation, without context or history, without showing their work. Sometimes you’ll spot the exact scenario, and — huzzah! — the number you shout is correct! But when it comes to unknown scenarios, the odds are not in your favor.

Debugging looks and feels very different from flipping through answers. You ask a question, examine the answer, and ask another question based on the result. (“Which endpoints were erroring? Are all of the requests erroring, or only some? What did they have in common?”, etc.)

You methodically put one foot in front of the other, following the trail of bread crumbs, until the data itself leads you to the answer.

The limitations of metrics and dashboards

Unfortunately, you cannot do that with metrics-based dashboards, because you stripped away the connective tissue of the event back when you wrote the metrics out to disk.

If you happened to notice while skimming through dashboards that your 404 errors spiked at 14:03, and your /payment and /import endpoints started erroring at 14.03, and your database started returning a bunch of mysql errors shortly after 14:00, you’ll probably assume that they’re all related and leap to find more evidence that confirms it.

But you cannot actually confirm that those events are the same ones, not with your metrics dashboards. You cannot drill down from errors to endpoints to error strings; for that, you’d need a wide structured data blob per request. Those might in fact be two or three separate outages or anomalies happening at the same time, or just the tip of the iceberg of a much larger event, and your hasty assumptions might extend the outage for much longer than was necessary.

With metrics, you tend to find what you’re looking for. You have no way to correlate attributes between requests or ask “what are all of the dimensions these requests have in common?”, or to flip back and forth and look at the request as a trace. Dashboards can be fairly effective at surfacing the causes of problems you’ve seen before (raise your hand if you’ve ever been in an incident review where one of the follow up tasks was, “create a dashboard that will help us find this next time”), but they’re all but useless for novel problems, your unknown-unknowns.

Other complaints about dashboards:

They tend to have percentiles like 95th, 99th, 99.9th, 99.99th, etc. Which can cover over a multitude of sins. You really want a tool that allows you to see MAX and MIN, and heatmap distributions.

A lot of dashboards end up getting created that are overly specific to the incident you just had — naming specific hosts, etc — which just creates clutter and toil. This is how your dashboards become that graveyard of past outages.

The most useful approach to dashboards is to maintain a small set of them; cull regularly, and think of them as a list of starter queries for your investigations.

Fred Hebert has this analogy, which I like:

“I like to compare the dashboards to the big display in a hospital room: heartbeat, pressure, oxygenation, etc. Those can tell you when a thing is wrong, but the context around the patient chart (and the patient themselves) is what allows interpretation to be effective. If all we have is the display but none of the rest, we’re not getting anywhere close to an accurate picture. The risk with the dashboard is having the metrics but not seeing or knowing about the rest changing.”

In conclusion

Dashboards aren’t universally awful. The overuse of them just encourages sloppy thinking, and static ones make it impossible for you to follow the plot of an outage, or validate your hypotheses. 🤒  There’s too many of them, and not enough shared consensus. (It would help if, like, new dashboards expired within a month if nobody looked at them again.)

If what you have is “nothing”, even shitty dashboards are far better than no dashboards. But shitty dashboards have been the only game in town for far too long. We need more vendors to think about building for queryability, explorability, and the ability to follow a trail of breadcrumbs. Modern systems are going to demand more and more of this approach.

Nothing < Dashboards < a Queryable, Exploratory Interface

If everyone out there who slaps “observability” on their web page also felt the responsibility to add an observability-enabling interface to their tool, one that would let users explore and identify unknown-unknowns, we would all be in a far better place. 🙂

 

 

 

 

 

Notes on the Perfidy of Dashboards

Observability is a Many-Splendored Definition

Last weekend, @swyx posted a great little primer to instrumentation titled “Observability Tools in JavaScript”.  A friend sent me the link and suggested that I might want to respond and clarify some things about observability, so I did, and we had a great conversation!  Here is a lightly edited transcript of my reply tweet storm.

First of all, confusion over terminology is understandable, because there are some big players out there actively trying to confuse you!  Big Monitoring is indeed actively trying to define observability down to “metrics, logs and traces”.  I guess they have been paying attention to the interest heating up around observability, and well… they have metrics, logs, and tracing tools to sell?  So they have hopped on the bandwagon with some undeniable zeal.

But metrics, logs and traces are just data types.  Which actually has nothing to do with observability.  Let me explain the difference, and why I think you should care about this.

“Observability? I do not think it means what you think it means.”

Observability is a borrowed term from mechanical engineering/control theory.  It means, paraphrasing: “can you understand what is happening inside the system — can you understand ANY internal state the system may get itself into, simply by asking questions from the outside?”  We can apply this concept to software in interesting ways, and we may end up using some data types, but that’s putting the cart before the horse.

It’s a bit like saying that “database replication means structs, longints and elegantly diagrammed English sentences.”  Er, no.. yes.. missing the point much?

This is such a reliable bait and switch that any time you hear someone talking about “metrics, logs and traces”, you can be pretty damn sure there’s no actual observability going on.  If there were, they’d be talking about that instead — it’s far more interesting!  If there isn’t, they fall back to talking about whatever legacy products they do have, and that typically means, you guessed it: metrics, logs and traces.

❌ Metrics

Metrics in particular are actually quite hostile to observability.  They are usually pre-aggregated, which means you are stuck with whatever questions you defined in advance, and even when they aren’t pre-aggregated they permanently discard the connective tissue of the request at write time, which destroys your ability to correlate issues across requests or track down any individual requests or drill down into a set of results — FOREVER.

Which doesn’t mean metrics aren’t useful!  They are useful for many things!  But they are useful for things like static dashboards, trend analysis over time, or monitoring that a dimension stays within defined thresholds.  Not observability.  (Liz would interrupt here and say that Google’s observability story involves metrics, and that is true — metrics with exemplars.  But this type of solution is not available outside Google as far as we know..)

❌ Logs

Ditto logs.  When I say “logs”, you think “unstructured strings, written out to disk haphazardly during execution, “many” log lines per request, probably contains 1-5 dimensions of useful data per log line, probably has a schema and some defined indexes for searching.”  Logs are at their best when you know exactly what to look for, then you can go and find it.

Again, these connotations and assumptions are the opposite of observability’s requirements, which deals with highly structured data only.  It is usually generated by instrumentation deep within the app, generally not buffered to local disk, issues a single event per request per service, is schemaless and indexless (or inferred schemas and autoindexed), and typically containing hundreds of dimensions per event.

❓ Traces

Traces?  Now we’re getting closer.  Tracing IS a big part of observability, but tracing just means visualizing events in order by time.  It certainly isn’t and shouldn’t be a standalone product, that just creates unnecessary friction and distance.  Hrmm … so what IS observability again, as applied to the software domain??

As a reminder, observability applied to software systems means having the ability to ask any question of your systems — understand any user’s behavior or subjective experience — without having to predict that question, behavior or experience in advance.

Observability is about unknown-unknowns.

At its core, observability is about these unknown-unknowns.

Plenty of tools are terrific at helping you ask the questions you could predict wanting to ask in advance.  That’s the easy part.  “What’s the error rate?”  “What is the 99th percentile latency for each service?”  “How many READ queries are taking longer than 30 seconds?”

  • Monitoring tools like DataDog do this — you predefine some checks, then set thresholds that mean ERROR/WARN/OK.
  • Logging tools like Splunk will slurp in any stream of log data, then let you index on questions you want to ask efficiently.
  • APM tools auto-instrument your code and generate lots of useful graphs and lists like “10 slowest endpoints”.

But if you *can’t* predict all the questions you’ll need to ask in advance, or if you *don’t* know what you’re looking for, then you’re in o11y territory.

  • This can happen for infrastructure reasons — microservices, containerization, polyglot storage strategies can result in a combinatorial explosion of components all talking to each other, such that you can’t usefully pre-generate graphs for every combination that can possibly degrade.
  • And it can happen — has already happened — to most of us for product reasons, as you’ll know if you’ve ever tried to figure out why a spike of errors was being caused by users on ios11 using a particular language pack but only in three countries, and only when the request hit the image export microservice running build_id 789782 if the user’s last name starts with “MC” and they then try to click on a particular button which then issues a db request using the wrong cache key for that shard.

Gathering the right data, then exploring the data.

Observability starts with gathering the data at the right level of abstraction, organized around the request path, such that you can slice and dice and group and  look for patterns and cross-correlations in the requests.

To do this, we need to stop firing off metrics and log lines willynilly and be more disciplined.  We need to issue one single arbitrarily-wide event per service per request, and it must contain the *full context* of that request. EVERYTHING you know about it, anything you did in it, all the parameters passed into it, etc.  Anything that might someday help you find and identify that request.

Then, when the request is poised to exit or error the service, you ship that blob off to your o11y store in one very wide structured event per request per service.

In order to deliver observability, your tool also needs to support high cardinality and high dimensionality.  Briefly, cardinality refers to the number of unique items in a set, and dimensionality means how many adjectives can describe your event.  If you want to read more, here is an overview of the space, and more technical requirements for observability

You REQUIRE the ability to chain and filter as many dimensions as you want with infinitely high cardinality for each one if you’re going to be able to ask arbitrary questions about your unknown unknowns.  This functionality is table stakes.  It is non negotiable.  And you cannot get it from any metrics or logs tool on the market today.

Why this matters.

Alright, this is getting pretty long. Let me tell you why I care so much, and why I want people like you specifically (referring to frontend engineers and folks earlier in their careers) to grok what’s at stake in the observability term wars.

We are way behind where we ought to be as an industry. We are shipping code we don’t understand, to systems we have never understood. Some poor sap is on call for this mess, and it’s killing them, which makes the software engineers averse to owning their own code in prod.  What a nightmare.

Meanwhile developers readily admit they waste >40% of their day doing bullshit that doesn’t move the business forward.  In large part this is because they are flying blind, just stabbing around in the dark.

We all just accept this.  We shrug and say well that’s just what it’s like, working on software is just a shit salad with a side of frustration, it’s just the way it is.

But it is fucking not.  It is un fucking necessary.  If you instrument your code, watch it deploy, then ask “is it doing what I expect, does anything else look weird” as a habit?  You can build a system that is both understandable and well-understood.  If you can see what you’re doing, and catch errors swiftly, it never has to become a shitty hairball in the first place.  That is a choice.

🌟 But observability in the original technical sense is a necessary prerequisite to this better world. 🌟

If you can’t break down by high cardinality dimensions like build ids, unique ids, requests, and function names and variables, if you cannot explore and swiftly skim through new questions on the fly, then you cannot inspect the intersection of (your code + production + users) with the specificity required to associate specific changes with specific behaviors.  You can’t look where you are going.

Observability as I define it is like taking off the blindfold and turning on the light before you take a swing at the pinata.  It is necessary, although not sufficient alone, to dramatically improve the way you build software.  Observability as they define it gets you to … exactly where you already are.  Which of these is a good use of a new technical term?

 

Do better.

And honestly, it’s the next generation who are best poised to learn the new ways and take advantage of them. Observability is far, far easier than the old ways and workarounds … but only if you don’t have decades of scar tissue and old habits to unlearn.

The less time you’ve spent using monitoring tools and ops workarounds, the easier it will be to embrace a new and better way of building and shipping well-crafted code.

Observability matters.  You should care about it.  And vendors need to stop trying to confuse people into buying the same old bullshit tools by smooshing them together and slapping on a new label.  Exactly how long do they expect to fool people for, anyway?

Observability is a Many-Splendored Definition

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

This is part three of a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it.  🙏❤️  charity + friends


“Now Roll It Out And Keep Them Happy”

This is the third in a series of blog posts; previously we analyzed the security challenges of using a third party service, and we worked together with the security team to build empathy to deliver the project.  You might want to read those first, since we are going to build on a lot of the ideas there to ship and maintain this integration.

Ready for launch

You’ve convinced the security team and other stakeholders, you’ve gotten the integration running, you’re getting promising results from dev-test or staging environments… now it’s time to move from proof-of-concept to full implementation.  Depending on your situation this might be a transition from staging to production, or it might mean increasing a feature flipper flag from 5% to 100%, or it might mean increasing coverage of an integration from one API endpoint to cover your entire developer footprint.

Taking into account Murphy’s Law, we expect that some things will go wrong during the rollout.  Perhaps during coverage, a developer realizes that the schema designed to handle the app’s event mechanism can’t represent a scenario, requiring a redesign or a hacky solution.  Or perhaps the metrics dashboard shows elevated error rates from the API frontend, and while there’s no smoking gun, the ops oncall decides to rollback the integration Just In Case it’s causing the incident.

This gives us another chance to practice empathy — while it’s easy, wearing the champion hat, to dismiss any issues found by looking for someone to blame, ultimately this poisons trust within your organization and will hamper success.  It’s more effective, in the long run (and often even in the short run), to find common ground with your peers in other disciplines and teams, and work through to solutions that satisfy everybody.

Keeping the lights on

In all likelihood as integration succeeds, the team will rapidly develop experts and expertise, as well as idiomatic ways to use the product.  Let the experts surprise you; folks you might not expect can step up when given a chance.  Expertise flourishes when given guidance and goals; as the team becomes comfortable with the integration, explicitly recognize a leader or point person for each vendor relationship.  Having one person explicitly responsible for a relationship lets them pay attention to those vendor emails, updates, and avoid the tragedy of the “but I thought *you* were” commons.  This Integration Lead is also a center of knowledge transfer for your organization — they won’t know everything or help every user come up to speed, but they can help empower the local power users in each team to ramp up their teams on the integration.

As comfort grows you will start to consider ways to change your usage, for example growing into new kinds of data.  This is a good time to revisit that security checklist — does the change increase PII exposure to your vendor?  Would the new data lead to additional requirements such as per-field encryption?  Don’t let these security concerns block you from gaining valuable insight using the new tool, but do take the chance to talk it over with your security experts as appropriate.

Throughout this organic growth, the Integration Lead remains core to managing your changing profile of usage of the vendor they shepherd; as new categories of data are added to the integration, the Lead has responsibility to ensure that the vendor relationship and risk profile are well matched to the needs that the new usage (and presumably, business value) is placing on the relationship.

Documenting the Intergation Lead role and responsibilities is critical. The team should know when to check in, and writing it down helps it happen.  When new code has a security implication, or a new use case potentially amplifies the cost of an integration, bringing the domain expert in will avoid unhappy surprises.  Knowing how to find out who to bring in, and when to bring them in, will keep your team getting the right eyes on their changes.

Security threats and other challenges change over time, too.  Collaborating with your security team so that they know what systems are in use helps your team take note of new information that is relevant to your business. A simple example is noting when your vendors publish a breach announcement, but more complex examples happen too — your vendor transitions cloud providers from AWS to Azure and the security team gets an alert about unexpected data flows from your production cluster; with transparency and trust such events become part of a routine process rather than an emergency.

It’s all operational

Monitoring and alerting is a fact of operations life, and this has to include vendor integrations (even when the vendor integration is a monitoring product.)  All of your operations best practices are needed here — keep your alerts clean and actionable so that you don’t develop pager fatigue, and monitor performance of the integration so that you don’t get blindsided by a creeping latency monster in your APIs.

Authentication and authorization are changing as the threat landscape evolves and industry moves from SMS verification codes to U2F/WebAuthn.  Does your vendor support your SSO integration?  If they can’t support the same SSO that you use everywhere else and can’t add it — or worse, look confused when you mention SSO — that’s probably a sign you should consider a different vendor.

A beautiful sunset

Have a plan beforehand for what needs to be done should you stop using the service.  Got any mobile apps that depend on APIs that will go away or start returning permission errors?  Be sure to test these scenarios ahead of time.

What happens at contract termination to data stored on the service?  Do you need to explicitly delete data when ceasing use?

Do you need to remove integrations from your systems before ending the commercial relationship, or can the technical shutdown and business shutdown run in parallel?

In all likelihood these are contingency plans that will never be needed, and they don’t need to be fully fleshed out to start, but a little bit of forethought can avoid unpleasant surprises.

Year after year

Industry best practice and common sense dictate that you should revisit the security questionnaire annually (if not more frequently). Use this chance to take stock of the last year and check in — are you getting value from the service?  What has changed in your business needs and the competitive landscape? 

It’s entirely possible that a new year brings new challenges, which could make your current vendor even more valuable (time to negotiate a better contract rate!) or could mean you’d do better with a competing service.  Has the vendor gone through any major changes?  They might have new offerings that suit your needs well, or they may have pivoted away from the features you need. 

Check in with your friends on the security team as well; standards evolve, and last year’s sufficient solution might not be good enough for new requirements.

 

Andy thinks out loud about security, society, and the problems with computers on Twitter.


 

❤️ Thanks so much reading, folks.  Please feel free to drop any complaints, comments, or additional tips to us in the comments, or direct them to me on twitter.

Have fun!  Stay (a little bit) Paranoid!!

— charity

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

Outsource Your O11y: Get Aligned With Security (part 2/3)

This is part two of a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it.  🙏❤️  charity + friends


“Get Aligned With Security”

by Lilly Ryan

If your team has decided on a third-party service to help you gather data and debug product issues, how do you convince an often overeager internal security team to help you adopt it?

When this service is something that provides a pathway for developers to access production data, as analytics tools often do, making the case for access to that data can screech to a halt at the mention of the word “production”. Progressing past that point will take time, empathy, and consideration.

I have been on both sides of the “adopting a new service” fence: as a developer hoping to introduce something new and useful to our stack, and now as a security professional who spends her days trying to bust holes in other people’s setups. I understand both sides of the sometimes-conflicting needs to both ship software and to keep systems safe.  

This guide has advice to help you solve the immediate problem of choosing and deploying a third-party service with the approval of your security team.  But it also has advice for how to strengthen the working relationship between your security and development teams over the longer term. No two companies are the same, so please adapt these ideas to fit your circumstances.

Understanding the security mindset

The biggest problems in technology are never really about technology, but about people. Seeing your security team as people and understanding where they are coming from will help you to establish empathy with them so that both of you want to help each other get what you want, not block each other.

First, understand where your security team is coming from. Development teams need to build features, improve the product, understand and ship good code. Security teams need to make sure you don’t end up on the cover of the NYT for data breaches, that your business isn’t halted by ransomware, and that you’re not building your product on a vulnerable stack.

This can be an unfamiliar frame of mind for developers.  Software development tends to attract positive-minded people who love creating things and are excited about the possibilities of new technology. Software security tends to attract negative thinkers who are skilled at finding all the flaws in a system.  These are very different mentalities, and the people who occupy them tend to have very different assumptions, vocabularies, and worldviews.   

But if you and your security team can’t share the same worldview, it will be hard to trust each other and come to agreement.  This is where practicing empathy can be helpful.

Before approaching your security team with your request to approve a new vendor, you may want to run some practice exercises for putting yourselves in their shoes and forcing yourselves to deliberately cultivate a negative thinking mindset to experience how they may react — not just in terms of the objective risk to the business, or the compliance headaches it might cause, but also what arguments might resonate with them and what emotional reactions they might have.

My favourite exercise for getting teams to think negatively is what I call the Land Astronaut approach.

The “Land Astronaut” Game

Imagine you are an astronaut on the International Space Station. Literally everything you do in space has death as a highly possible outcome. So astronauts spend a lot of time analysing, re-enacting, and optimizing their reactions to events, until it becomes muscle memory. By expecting and training for failure, astronauts use negative thinking to anticipate and mitigate flaws before they happen. It makes their chances of survival greater and their people ready for any crisis.

Your project may not be as high-stakes as a space mission, and your feet will most likely remain on the ground for the duration of your work, but you can bet your security team is regularly indulging in worst-case astronaut-type thinking. You and your team should try it, too.

The Game:

Pick a service for you and your team to game out.  Schedule an hour, book a room with a whiteboard, put on your Land Astronaut helmets.  Then tell your team to spend half an hour brainstorming about all the terrible things that can happen to that service, or to the rest of your stack when that service is introduced.  Negative thoughts only!

Start brainstorming together. Start out by being as outlandish as possible (what happens if their data centre is suddenly overrun by a stampede of elephants?). Eventually you will find that you’ll tire of the extreme worst case scenarios and come to consider more realistic outcomes — some of which which you may not have thought of outside of the structure of the activity.

After half an hour, or whenever you feel like you’re all done brainstorming, take off your Land Astronaut helmets, sift out the most plausible of the worst case scenarios, and try to come up with answers or strategies that will help you counteract them.  Which risks are plausible enough that you should mitigate them?  Which are you prepared to gamble on never happening?  How will this risk calculus change as your company grows and takes on more exposure?

Doing this with your team will allow you all to practice the negative thinking mindset together and get a feel for how your colleagues in the security team might approach this request. (While this may seem similar to threat modelling exercises you might have done in the past, the focus here is on learning to adopt a security mindset and gaining empathy for this thought process, rather than running through a technical checklist of common areas of concern.)

While you still have your helmets within reach, use your negative thinking mindset to fill out the spreadsheet from the first piece in this series.  This will help you anticipate most of the reasonable objections security might raise, and may help you include useful detail the security team might not have known to ask for.

Once you have prepared your list of answers to George’s worksheet and held a team Land Astronaut session together, you will have come most of the way to getting on board with the way your security team thinks.

Preparing for compromise

You’ve considered your options carefully, you’ve learned how to harness negative thinking to your advantage, and you’re ready to talk to your colleagues in security – but sometimes, even with all of these tools at your disposal, you may not walk away with all of the things you are hoping for.

Being willing to compromise and anticipating some of those compromises before you approach the security team will help you negotiate more successfully.

While your Land Astronaut helmets are still within reach, consider using your negative thinking mindset game to identify areas where you may be asked to compromise. If you’re asking for production access to this new service for observability and debugging purposes, think about what kinds of objections may be raised about this and how you might counter them or accommodate them. Consider continuing the activity with half of the team remaining in the Land Astronaut role while the other half advocates from a positive thinking standpoint. This dynamic will get you having conversations about compromise early on, so that when the security team inevitably raises eyebrows, you are ready with answers.

Be prepared to consider compromises you had not anticipated, and enter into discussions with the security team with as open a mind as possible. Remember the team is balancing priorities of not only your team, but other business and development teams as well.  If you and your security colleagues are doing the hard work to meet each other halfway then you are more likely to arrive at a solution that satisfies both parties.

Working together for the long term

While the previous strategies we’ve covered focus on short-term outcomes, in this continuous-deployment, shift-left world we now live in, the best way to convince your security team of the benefits of a third-party service – or any other decision – is to have them along from day one, as part of the team.

Roles and teams are increasingly fluid and boundary-crossing, yet security remains one of the roles least likely to be considered for inclusion on a software development team. Even in 2019, the task of ensuring that your product and stack are secure and well-defended is often left until the end of the development cycle.  This contributes a great deal to the combative atmosphere that is common.

Bringing security people into the development process much earlier builds rapport and prevents these adversarial, territorial dynamics. Consider working together to build Disaster Recovery plans and coordinating for shared production ownership.

If your organisation isn’t ready for that kind of structural shift, there are other ways to work together more closely with your security colleagues.

Try having members of your team spend a week or two embedded with the security team. You may even consider a rolling exchange – a developer for a security team member – so that developers build the security mindset, and the security team is able to understand the problems your team is facing (and why you are looking at introducing this new service).

At the very least, you should make regular time to meet with the security team, get to know them as people, and avoid springing things on them late in the project when change is hardest.

Riding off together into the sunset…?

If you’ve taken the time to get to know your security team and how they think, you’ll hopefully be able to get what you want from them – or perhaps you’ll understand why their objections were valid, and come up with a better solution that works well for both of you.

Investing in a strong relationship between your development and security teams will rarely lead to the apocalypse. Instead, you’ll end up with a better product, probably some new work friends, and maybe an exciting idea for a boundary-crossing new career in tech.

But this story isn’t over! Once you get the green light from security, you’ll need to think about how to roll your new service out safely, maintain it, and consider its full lifespan within your company.  Which leads us to part three of this series, on rolling it out and maintaining it … both your integration and your relationship with the security team.

 

Lilly Ryan is a pen tester, Python wrangler, and recovering historian from Melbourne. She writes and speaks internationally about ethical software, social identities after death, teamwork, and the telegraph. More recently she has researched the domestic use of arsenic in Victorian England, attempted urban camouflage, reverse engineered APIs, wielded the Oxford comma, and baked a really good lemon shortbread.

Outsource Your O11y: Get Aligned With Security (part 2/3)

Outsource Your O11y: How To Be A Champion (part 1/3)

I hear variations on this question constantly: “I’d really like to use a service like Honeycomb for my observability, but I’m told I can’t ship any data off site.  Do you have any advice on how to convince my security team to let me?”

I’ve given lots of answers, most of them unsatisfactory.  “Strip the PII/PHI from your operational data.”  “Validate server side.”  “Use our secure tenancy proxy.”  (I’m not bad at security from a technical perspective, but I am not fluent with the local lingo, and I’ve never actually worked with an in-house security team — i’ve always *been* the security team, de facto as it may be.) 

So I’ve invited three experts to share their wisdom in a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

My ✨first-ever guest posts✨!  Yippee.  I hope these are useful to you, wherever you are in the process of outsourcing your tools.  You are on the right path: outsourcing your observability to a vendor for whom it’s their One Job is almost always the right call, in terms of money and time and focus — and yes, even security. 

All this pain will someday be worth it.  🙏❤️  charity + friends


“How to be a Champion”

by George Chamales

You’ve found a third party service you want to bring into your company, hooray!

To you, it’s an opportunity to deploy new features in a flash, juice your team’s productivity, and save boatloads of money.

To your security and compliance teams, it’s a chance to lose your customers’ data, cause your applications to fall over, and do inordinate damage to your company’s reputation and bottom line.

The good news is, you’re absolutely right.  The bad news is, so are they.

Successfully championing a new service inside your organization will require you to convince people that the rewards of the new service are greater than the risks it will introduce (there’s a guide below to help you).  

You’re convinced the rewards are real. Let’s talk about the risks.

The past year has seen cases of hackers using third party services to target everything from government agencies, to activists, to Targetagain.  Not to be outdone, attention-seeking security companies have been actively hunting for companies exposing customer data then issuing splashy press releases as a means to flog their products and services.  

A key feature of these name-and-shame campaigns is to make sure that the headlines are rounded up to the most popular customer – the clickbait lead “MBM Inc. Loses Customer Data” is nowhere near as catchy as “Walmart Jewelry Partner Exposes Personal Data Of 1.3M Customers.”

While there are scary stories out there, in many, many cases the risks will be outweighed by the rewards. Telling the difference between those innumerable good calls and the one career-limiting move requires thoughtful consideration and some up-front risk mitigation.

When choosing a third party service, keep the following in mind:

    • The security risks of a service are highly dependent on how you use it.  
      You can adjust your usage to decrease your risk.  There’s a big difference between sending a third party your server metrics vs. your customer’s personal information.  Operational metrics are categorically less sensitive than, say, PII or PHI (if you have scrubbed them properly).
    • There’s no way to know how good a service’s security really is.  
      History is full of compromised companies who had very pretty security pages and certifications (here’s Equifax circa September 2017).  Security features are a stronger indicator, but there are a lot more moving parts that go into maintaining a service’s security.
    • Always weigh the risks vs. the rewards.

 

 

There’s risk no matter what you do – bringing in the service is risky, doing nothing is risky.  You can only mitigate risks up to a point. Beyond that point, it’s the rewards that make risks worthwhile.

Context is critical in understanding the risks and rewards of a new service.  

You can use the following guide to put things in context as you champion a new service through the gauntlet of management, security, and compliance teams.  That context becomes even more powerful when you can think about the approval process from the perspective of the folks you’ll need to win over to get the okay to move forward.

In the next part of this series Lilly Ryan shares a variety of techniques to take on the perspective of your management, security and compliance teams, enabling you to constructively work through responses that can include everything from “We have concerns…” to “No” to “Oh Helllllllll No.”

Championing a new service is hard – it can be equally worthwhile.  Good luck!

 

George Chamales is a useful person to have around. Please send critiques of this post to george@criticalsec.com

“A Security Guide for Third Party Services” Worksheet

Note to thoughtful service providers:  You may want to fill parts of this out ahead of time and give it to your prospective customers.  It will provide your champion with good fortune in the compliance wars to come.  (Also available as a nicely formatted spreadsheet.)

 

Our Reasons
Why this service? This is the justification for the service – the compelling rewards that will outweigh the inevitable risks.
What will be true once the service is online?
Good reasons are ones that a fifth grader would understand.
Our Data
Data it will / won’t collect? Describe the classes or types of data the service will access / store and why that’s necessary for the service to operate.
If there are specific types of sensitive data the service won’t collect (e.g. passwords, Personally Identifiable Information, Patient Health Information) explicitly call them out.
How is data be accessed? Describe the process for getting data to the service.  
Do you have to run their code on your servers, on your customer’s computers?
Our Costs
Costs of NOT doing it? This are the financial risks / liabilities of not going with this service. What’s the worst and average cost?
Have you had costly problems in the past that could have been avoided if you were using this service?
Costs of doing it? Include the cost for the service and, if possible, the amount of person-time it’s going to take to operate the service.  
Ideally less than the cost of not doing it.
Our Risk – how mad will important people be…
If it’s compromised. What would happen if hackers or attention-seeking security companies publicly released the data you sent the service?  Is it catastrophic or an annoyance?
When it goes down? When this service goes down (and it will go down), will it be a minor inconvenience or will it take out your primary application and infuriate your most valuable customers?
Their Security  – in order of importance
SSO & 2FA Support? This is a security smoke test:  If a service doesn’t support SSO or 2FA, it’s safe to assume that they don’t prioritize security.
Also a good idea to investigate SSO support up front since some vendors charge extra for it (which is a shame).
Fine-grained permissions? This is another key indicator of the service’s maturity level since it takes time and effort to build in.  It’s also something else they might make you pay extra for.
Security certifications? These aren’t guarantees of quality, but it does indicate that the company’s put in some effort and money into their processes.
Check their website for general security compliance merit badges such as SOC2, ISO27001 or industry-specific things like PCI or HIPAA.
Security & privacy pages? If there is, it means that they’re willing to publicly state that they do something about security.  The more specific and detailed, the better.
Vendor’s security history? Have there been any spectacular breaches that demonstrated a callous disregard for security, gross incompetence, or both?
BONUS Questions Want to really poke and prod the internal security of your vendor?  Ask if they can answer the following questions:

  • How many known vulnerabilities (CVEs) exist on your production infrastructure right now?
  • At what time (exactly) was the last successful backup of all your customer data completed?
  • What were the last three secrets accessed in the production environment?
Our Decision
Is it worth it? Look back through the previous sections and ask whether it makes sense to:

* Use the 3rd party service

* Build it yourself

* Not do it at all
Would a thoughtful person agree with you?

 

 

 

Outsource Your O11y: How To Be A Champion (part 1/3)