From Cloudwashing to O11ywashing

November 24, 2025 mipsytipsymonitoring, rageguy, three pillars, unified storage, vendors6 Comments

I was just watching a panel on observability, with a handful of industry executives and experts who shall remain nameless and hopefully duly obscured—their identities are not the point, the point is that this is a mainstream view among engineering executives and my head is exploding.

Scene: the moderator asked a fairly banal moderator-esque question about how happy and/or disappointed each exec has been with their observability investments.

One executive said that as far as traditional observability tools are concerned (“are there faults in our systems?”), that stuff “generally works well.”

However, what they really care about is observing the quality of their product from the customer’s perspective. EACH customer’s perspective.

Nines don't matter if users aren't happy — Nines don’t matter if users aren’t happy

“Did you know,” he mused, “that there are LOTS of things that can interrupt service or damage customer experience that won’t impact your nines of availability?”

(I begin screaming helplessly into my monitor.)

“You could have a dependency hiccup,” he continued, oblivious to my distress. “There could be an issue with rendering latency in your mobile app. All kinds of things.”

(I look down and realize that I am literally wearing this shirt.)

He finishes with,“And that is why we have invested in our own custom solution to measure key workflows through startup payment and success.”

(I have exploded. Pieces of my head now litter this office while my headless corpse types on and on.)

It’s twenty fucking twenty five. How have we come to this point?

Observability is now a billion dollar market for a meaningless term

My friends, I have failed you.

It is hard not to register this as a colossal fucking failure on a personal level when a group of modern, high performing tech execs and experts can all sit around a table nodding their heads at the idea that “traditional observability” is about whether your systems are UP👆 or DOWN👇, and that the idea of observing the quality of service from each customer’s perspective remains unsolved! unexplored! a problem any modern company needs to write custom tooling from scratch to solve.

This guy is literally describing the original definition of observability, and he doesn’t even know it. He doesn’t know it so hard that he went and built his own thing.

You guys know this, right? When he says “traditional observability tools”, he means monitoring tools. He means the whole three fucking pillars model: metrics, logging, and tracing, all separate things. As he notes, these traditional tools are entirely capable of delivering on basic operational outcomes (are we up, down, happy, sad?). They can DO this. They are VERY GOOD tools if that is your goal.

But they are not capable of solving the problem he wants to solve, because that would require combining app, business, and system telemetry in a unified way. Data that is traceable, but not just tracing. With the ability to slice and dice by any customer ID, site location, device ID, blah blah. Whatever shall we call THAT technological innovation, when someone invents it? Schmobservability, perhaps?

So anyway, “traditional observability” is now part of the mainstream vernacular. Fuck. What are we going to do about it? What CAN be done about it?

From cloudwashing to o11ywashing

I learned a new term yesterday: cloudwashing. I learned this from Rick Clark, who tells a hilarious story about the time IBM got so wound up in the enthusiasm for cloud computing that they reclassified their Z series mainframe as “cloud” back in 2008.

(Even more hilarious: asking Google about the precipitating event, and following the LLM down a decade-long wormhole of incredibly defensive posturing from the IBM marketing department and their paid foot soldiers in tech media about how this always gets held up as an example of peak cloudwashing but it was NOT AT ALL cloudwashing due to being an extension of the Z/Series Mainframe rather than a REPLACEMENT of the Z/Series Mainframe, and did you know that Mainframes are bigger business and more relevant today than ever before?)

(Sorry, but I lost a whole afternoon to this nonsense, I had to bring you along for the ride.)

Rick says the same thing is happening right now with observability. And of course it is. It’s too big of a problem, with too big a budget: an irresistible target. It’s not just the legacy behemoths anymore. Any vendor that does anything remotely connected to telemetry is busy painting on a fresh coat of o11ywashing. From a marketing perspective, It would be irresponsible not to.

How to push back on *-washing

Anyway, here are the key takeaways from my weekend research into cloudwashing.

This o11ywashing problem isn’t going away. It is only going to get bigger, because the problem keeps getting bigger, because the traditional vendors aren’t solving it, because they can’t solve it.
The Gartners of the world will help users sort this out someday, maybe, but only after we win. We can’t expect them to alienate multibillion dollar companies in the pursuit of technical truth, justice and the American Way. If we ever want to see “Industry Experts” pitching in to help users spot o11ywashing, as they eventually did with cloudwashing (see exhibit A), we first need to win in the market.
Exhibit A: “How to Spot Cloudwashing”
And (this is the only one that really matters.) we have to do a better job of telling this story to engineering executives, not just engineers. Results and outcomes, not data structures and algorithms.

(I don’t want to make this sound like an epiphany we JUST had…we’ve been working hard on this for a couple years now, and it’s starting to pay off. But it was a powerful confirmation.)

Talking to execs is different than talking to engineers

When Christine and I started Honeycomb, nearly ten years ago, we were innocent, doe-eyed engineers who truly believed on some level that if we just explained the technical details of cardinality and dimensionality clearly and patiently enough to the world, enough times, the consequences to the business would become obvious to everyone involved.

It has now been ten years since I was a hands-on engineer every day (say it again, like pressing on a bruise makes it hurt less), and I would say I’ve been a decently functioning exec for about the last three or four of those years.

What I’ve learned in that time has actually given me a lot of empathy for the different stresses and pressures that execs are under.

I wouldn’t say it’s less or more than the stresses of being an SRE on call for some of the world’s biggest databases, but it is a deeply and utterly different kind of stress, the kind of stress less expiable via fine whiskey and poor life choices. (You just wake up in the morning with a hangover, and the existential awareness of your responsibilities looming larger than ever.)

This is a systems problem, not an operational one

There is a lot of noise in the field, and executives are trying to make good decisions that satisfy all parties and constraints amidst the unprecedented stress-panic-opportunity-terror of AI changing everything. That takes storytelling skills and sales discipline on our part, in addition to technical excellence.

Companies are dumping more and more and more money into their so-called observability tools, and not getting any closer to a solution. Nor will they, so long as they keep thinking about observability in terms of operational outcomes (and buying operational tools). Observability is a systems problem. It’s the most powerful lever in your arsenal when it comes to disrupting software doom spirals and turning them into positive feedback loops. Or it should be.

As Fred Hebert might say, it’s great you’re so good at firefighting, but maybe it’s time to go read the city fire codes.

Execs don’t know what they don’t know, because we haven’t been speaking to them. But we’re starting to.

What will be the next term that gets invented and coopted in the search to solve this problem?

Where to start, with a project so big? Google’s AI says that “experts suggest looking for specific features to identify true ~~cloud~~ observability solutions versus ~~cloudwashed~~ o11ywashed ones.”

I guess this is a good place to start as any: If your “observability” tooling doesn’t help you understand the quality of your product from the customer’s perspective, EACH customer’s perspective, it isn’t fucking observability.

It’s just monitoring dressed up in marketing dollars.

Call it o11ywashing.

Questionable Advice: “People Used To Take Me Seriously. Then I Became A Software Vendor”

March 29, 2023July 14, 2023 mipsytipsyadvice, culture, startups, vendorsLeave a comment

I recently got a plaintive text message from my magnificent friend Abby Bangser, asking about a conversation we had several years ago:

“Hey, I’ve got a question for you. A long time ago I remember you talking about what an adjustment it was becoming a vendor, how all of a sudden people would just discard your opinion and your expertise without even listening. And that it was SUPER ANNOYING.

I’m now experiencing something similar. Did you ever find any good reading/listening/watching to help you adjust to being on the vendor side without being either a terrible human or constantly disregarded?”

Oh my.. This brings back memories. ☺️🙈

Like Abby, I’ve spent most of my career as an engineer in the trenches. I have also spent a lot of time cheerfully talking smack about software. I’ve never really had anyone question my experience[1] or my authority as an expert, hardened as I was in the flames of a thousand tire fires.

Then I started a software company. And all of a sudden this bullshit starts popping up. Someone brushing me off because I was “selling something”, or dismissing my work like I was fatally compromised. I shrugged it off, but if I stopped to think, it really bothered me. Sometimes I felt like yelling “HEY FUCKERS, I am one of your kind! I’m trying to HELP YOU. Stop making this so hard!” 😡 (And sometimes I actually did yell, lol.)

That’s what I remember complaining to Abby about, five or six years ago. It was all very fresh and raw at the time.

We’ll get to that. First let’s dial the clock back a few more years, so you can fully appreciate the rich irony of my situation. (Or skip the story and jump straight to “Five easy ways to make yourself a vendor worth listening to“.)

The first time I encountered “software for sale”

My earliest interaction with software vendors was at Linden Lab. Like most infrastructure teams, most of the software we used was open source. But somewhere around 2009? 2010? Linden’s data engineering team began auditioning vendors like Splunk, Greenplum, Vertica[2], etc for our data warehouse solution, and I tagged along as the sysinfra/ops delegate.

For two full days we sat around this enormous table as vendor after vendor came by to demo and plump their wares, then opened the floor for questions.

One of the very first sales guys did something that pissed me off me. I don’t remember exactly what happened — maybe he was ignoring my questions or talking down to me. (I’m certain I didn’t come across like a seasoned engineering professional; in my mid twenties, face buried in my laptop, probably wearing pajamas and/or pigtails.) But I do remember becoming very irritated, then settling in to a stance of, shall we say, oppositional defiance.

I peppered every sales team aggressively with questions about the operational burden of running their software, their architectural decisions, and how canned or cherry-picked their demos were. Any time they let slip a sign of weakness or betrayed uncertainty, I bore down harder and twisted the knife. I was a ✨royal asshole✨. My coworkers on the data team found this extremely entertaining, which only egged me on.

What the fuck?? 🫢😧🫠 I’m not usually an asshole to strangers.. where did that come from?

What open source culture taught me about sales

I came from open source, where contempt for software vendors was apparently de rigueur. (is it still this way?? seems like it might have gotten better? 😦) It is fascinating now to look back and realize how much attitude I soaked up before coming face to face with my first software vendor. According to my worldview at the time,

Vendors are liars
They will say anything to get you to buy
Open source software is always the safest and best code
Software written for profit is inherently inferior, and will ultimately be replaced by the inevitable rise of better, faster, more democratic open source solutions
Sales exists to create needs that ought never to have existed, then take you to the cleaners
Engineers who go work for software vendors have either sold out, or they aren’t good enough to hack it writing real (consumer facing) software.

I’m remembering Richard Stallman trailing around behind me, up and down the rows of vendor booths at USENIX in his St IGNUcious robes, silver disk platter halo atop his head, offering (begging?) to lay his hands on my laptop and bless it, to “free it from the demons of proprietary software.” Huh. (Remember THIS song? 🎶 😱)

Given all that, it’s not hugely surprising that my first encounter with software vendors devolved into hostile questioning.

(It’s fun to speculate on the origin of some of these beliefs. Like, I bet 3) and 4) came from working on databases, particularly Oracle and MySQL/Postgres. As for 5) that sounds an awful lot like the beauty industry and other products sold to women. 🤭)

Behind every software vendor lies a recovering open source zealot(???)

I’ve had many, many experiences since then that slowly helped me dismantle this worldview, brick by brick. Working at Facebook made me realize that open source successes like Apache, Haproxy, Nginx etc are exceptions, not the norm; that this model is only viable for certain types of general-purpose infrastructure software; that governance and roadmaps are a huge issue for open source projects too; and that if steady progress is being made, at the end of the day, somewhere somebody is probably paying those developers.

I learned that the overwhelming majority of production-caliber code is written by somebody who was paid to write it — not by volunteers. I learned about coordination costs and overhead, how expensive it is to organize an army of volunteers, and the pains of decentralized quality control. I learned that you really really want the person who wrote the code to stick around and own it for a long time, and not just on alternate weekends when they don’t have the kids (and/or they happen to feel like it).

I learned about game theory, and I learned that sales is about relationships. Yes, there are unscrupulous sellers out there, just like there are shady developers, but good sales people don’t want you to walk away feeling tricked or disappointed any more than you want to be tricked or disappointed. They want to exceed your expectations and deliver more value than expected, so you’ll keep coming back. In game theory terms, it’s a “repeated game”.

I learned SO MUCH from interviewing sales candidates at Honeycomb.[3] Early on, when nobody knew who we were, I began to notice how much our sales candidates were obsessed with value. They were constantly trying to puzzle out out how much value Honeycomb actually brought to the companies we were selling to. I was not used to talking or thinking about software in terms of “value”, and initially I found this incredibly offputting (can you believe it?? 😳).

Sell unto others as you would have them sell unto you

Ultimately, this was the biggest (if dumbest) lesson of all: I learned that good software has tremendous value. It unlocks value and creates value, it pays enormous ongoing dividends in dollars and productivity, and the people who build it, support it, and bring it to market fully deserve to recoup a slice of the value they created for others.

There was a time when I would have bristled indignantly and said, “we didn’t start honeycomb to make money!” I would have said that the reason we built honeycomb because we knew as engineers what a radical shift it had wrought in how we built and understood software, and we didn’t want to live without it, ever again.

But that’s not quite true. Right from the start, Christine and I were intent on building not just great software, but a great software business. It wasn’t personal wealth we were chasing, it was independence and autonomy — the freedom to build and run a company the way we thought it should be run, building software to radically empower other engineers like ourselves.

Guess what you have to do if you care about freedom and autonomy?

Make money. 🙄☺️

I also realized, belatedly, that most people who start software companies do so for the same damn reasons Christine and I did… to solve hard problems, share solutions, and help other engineers like ourselves. If all you want to do is get rich, this is actually a pretty stupid way to do that. Over 90% of startups fail, and even the so-called “success stories” aren’t as predictably lucrative as RSUs. And then there’s the wear and tear on relationships, the loss of social life, the vicissitudes of the financial system, the ever-looming spectre of failure … 👻☠️🪦 Startups are brutal, my friend.

Karma is a bitch

None of these are particularly novel insights, but there was a time when they were definitely news to me. ☺️ It was a pretty big shock to my system when I first became a software vendor and found myself sitting on the other side of the table, the freshly minted target of hostile questioning.

These days I am far less likely to be cited as an objective expert than I used to be. I see people on Hacker News dismissing me with the same scornful wave of the hand as I used to dismiss other vendors. Karma’s a bitch, as they say. What goes around comes around. 🥰

I used to get very bent out of shape by this. “You act like I only care because I’m trying to sell you something,” I would hotly protest, “but it’s exactly the opposite. I built something because I cared.” That may be true, but it doesn’t change the fact that vested interests can create blind spots, ones I might not even be aware of.

And that’s ok! My arguments/my solutions should be sturdy enough to withstand any disclosure of personal interest. ☺️

Some people are jerks; I can’t control that. But there are a few things I can do to acknowledge my biases up front, play fair, and just generally be the kind of vendor that I personally would be happy to work with.

Five easy ways to make yourself a vendor worth listening to

So I gave Abby a short list of a few things I do to try and signal that I am a trustworthy voice, a vendor worth listening to. (What do you think, did I miss anything?)

🌸 Lead with your bias.🌸
I always try to disclose my own vested interest up front, and sometimes I exaggerate for effect: “As a vendor, I’m contractually obligated to say this”, or “Take it for what you will, obviously I have religious convictions here”. Everyone has biases; I prefer to talk to people who are aware of theirs.

🌸 Avoid cheap shots.🌸
Try to engage with the most powerful arguments for your competitors’ solutions. Don’t waste your time against straw men or slam dunks; go up against whatever ideal scenarios or “steel man” arguments they would muster in their own favor. Comparing your strengths vs their strengths results in a way more interesting, relevant and USEFUL discussion for all involved.

🌸 Be your own biggest critic.🌸
Be forthcoming about the flaws of your own solution. People love it when you are unafraid to list your own product’s shortcomings or where the competition shines, or describe the scenarios where other tools are genuinely superior or more cost-effective. It makes you look strong and confident, not weak.

What would you say about your own product as an engineer, or a customer? Say that.

🌸 You can still talk shit about software, just not your competitors‘ software. 🌸
I try not to gratuitously snipe at our competitors. It’s fine to speak at length about technical problems, differentiation and tradeoffs, and to address how specifically your product compares with theirs. But confine your shit talking to categories of software where you don’t have a personal conflict of interest.

Like, I’m not going to get on twitter and take a swipe at a monitoring vendor (anymore 😇), but I might say rude things about a language, a framework, or a database I have no stake in, if I’m feeling punchy. ☺️ (This particular gem of advice comes by way of Adam Jacob.)

🌸 Be generous with your expertise.🌸
If you have spent years going deep on one gnarly problem, you might very well know that problem and its solution space more thoroughly than almost anyone else in the world. Do you know how many people you can help with that kind of mastery?! A few minutes from you could potentially spare someone days or weeks of floundering. This is a gift few can give.

It feels good, and it’s a nice break from battering your head against unsolvable problems. Don’t restrict your help to paying customers, and, obviously, don’t give self-serving advice. Maybe they can’t buy/don’t need your solution today, but maybe someday they will.

In conclusion

There’s a time and place for being oppositional. Sometimes a vendor gets all high on their own supply, or starts making claims that aren’t just an “optimistic” spin on the facts but are provably untrue. If any vendor is operating in poor faith they deserve to to be corrected.

But it’s a shitty, self-limiting stance to take as a default. We are all here to build things, not tear things down. No one builds software alone. The code you write that defines your business is just the wee tippy top of a colossal iceberg of code written by other people — device drivers, libraries, databases, graphics cards, routers, emacs. All of this value was created by other people, yet we collectively benefit.

Think of how many gazillion lines of code are required for you to run just one AWS Lambda function! Think of how much cooperation and trust that represents. And think of all the deals that brokered that trust and established that value, compensating the makers and allowing them to keep building and improving the software we all rely on.

We build software together. Vendors exist to help you. We do what we do best, so you can spend your engineering cycles doing what you do best, working on your core product. Good sales deals don’t leave anyone feeling robbed or cheated, they leave both sides feeling happy and excited to collaborate.[4]

🐝💜Charity.

[1] Yes, I know this experience is far from universal; LOTS of people in tech have not felt like their voices are heard or their expertise acknowledged. This happens disproportionately to women and other under-represented groups, but it also happens to plenty of members of the dominant groups. It’s just a really common thing! However that has not really been my experience — or if it has, I haven’t noticed — nor Abby’s, as far as I’m aware.

[2] My first brush with columnar storage systems! Which is what makes Honeycomb possible today.

[3] I have learned SO MUCH from watching the world class sales professionals we have at Honeycomb. Sales is a tough gig, and doing it well involves many disciplines — empathy, creativity, business acumen, technical expertise, and so much more. Selling to software engineers in particular means you are often dealing with cocky little shits who think they could do your job with a few lines of code. On behalf of my fellow ~~little shits~~ engineers, I am sorry. 🙈

[4] Like our sales team says: “Never do a deal unless you’d do both sides of the deal.” I fucking love that.

How Much Should My Observability Stack Cost?

August 18, 2021July 14, 2023 mipsytipsyadvice, cost, crossposted, vendorsLeave a comment

First posted on 2021-08-18 at https://www.honeycomb.io/blog/how-much-should-my-observability-stack-cost

What should one pay for observability? What should your observability stack cost? What should be in your observability stack?

How much observability is enough? How much is too much, or is there such a thing?

Is it better to pay for one product that claims (dubiously) to do everything, or twenty products that are each optimized to do a different part of the problem super well?

It’s almost enough to make a busy engineer say “Screw it, I’m spinning up Nagios”.

(Hey, I said almost.)

All of these service providers can give you sticker shock when you begin investigating them. The biggest reason is always that we aren’t used to considering the price of our own time. We act like it’s “free” to just take an hour and spin something up … we don’t count the cost of maintenance, context switching, and opportunity costs of not using the time to build something of business value. Which is both understandable and forgivable, as a starting point.

Considerably less forgivable is the vagueness–and sometimes outright misdirection and scare tactics–some vendors offer around pricing. It’s not ok for a business to optimize for revenue at the expense of user experience. As users, we have the right to demand transparency and accurate information. As vendors, we have the responsibility to provide it. Any pricing scheme that doesn’t align with best practices and users’ interests will be a drag on reputation and growth.

The core question, rarely addressed outright, is: how much should you pay? In this post I’ll talk about what your observability costs include, and in the next post, what you should consider including in your “observability stack”.

But I’ll give you the answer to your question right off the bat: you should probably spend 20-30% of infra costs on observability.

O11y spend should be 20-30% of infra spend

Rule of thumb: your observability spend should come to 20-30% of your infra spend. (I’ve seen 10% a few times from reasonable-seeming shops, but they have been edge cases and outliers. I have also seen 50% or more, but again, outliers.)

Full disclosure: this isn’t based on any particular science. It’s just based on my experience of 15+ years working in operations engineering, talking to other engineers and managers, and a couple of informal Twitter polls to satisfy my own curiosity.

Nevertheless, it’s a pretty solid rule. There are exceptions, but in general, if you’re spending less than 20%, you’re “saving money” at the expense of engineering time, or being silently dragged underwater by a million little time leaks and quality of service issues — which you could eliminate completely with a bit of investment.

Consider the person who told me proudly that his o11y spend was just 1-3%. (He meant the PagerDuty bill and Pingdom checks, actually.) He wasn’t counting the dedicated hardware for their ELK cluster (80k/month), or the 2-3 extra engineers they had to recruit, train and hire (250-300k/year apiece) to run the many open source tools they got for “free”.

And ultimately, it didn’t meet their needs very well. Few people knew how to use it, so they leaned on the “observability team” to craft custom views, write scripts and ETL one-offs, and serve as the institutional hive mind and software usability tutors. They could have used better tools, ones under active development by large product teams. They could have used that headcount to create core business value instead.

Engineers cost money

Engineers are expensive. Recruiting them is hard. The good ones are increasingly unwilling to waste time on unnecessary labor. This manager was “saving” maybe a million dollars a year (he mentioned a vendor quote of less than 100k/month)–but spending a couple million more than that in less-visible ways.

Worse, he was driving his engineering org into the ground by wasting so much of their time and energy on non-mission-critical work, inferior tooling, one-offs, frustrating maintenance work, etc, all of which had nothing to do with their core business value.

If you want to know if an org hires and retains good engineers, you could do worse than to ask the question: “What tools do you use, and why?”

Good orgs use good tools. They know engineering cycles are their scarcest and most valuable resource, and they want to train maximum firepower on their core business problems.
Mediocre orgs use mediocre tools, have no discipline or consistency around adoption and deprecation, and leak lost engineering cycles everywhere.

So back to our rule of thumb: observability amounting to 20-30% of total spend is where most shops should fall. This refers to cloud-native infrastructure, using third-party services to instrument and monitor code, with the basics covered — resource utilization graphs, end to end checks, paging, etc.

So, what do I need in my “observability stack”?

What are the basics? Well, obviously “it depends”. It depends on your requirements, your components, your commitments, your budget, sunk costs and skill sets, your teams, and most expensive of all — customer expectations and the cost of violating them. You should think carefully about these things and try to draw a straight line from the business case to the money you spend (or don’t spend). And don’t forget to factor in those invisible human costs.

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

February 13, 2019March 7, 2023 mipsytipsyaws, monitoring, operations, outsourcing, security, vendors1 Comment

This is part three of a three-part series of guest posts:

How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team. (George Chamales)
Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it. 🙏❤️ charity + friends

“Now Roll It Out And Keep Them Happy”

This is the third in a series of blog posts; previously we analyzed the security challenges of using a third party service, and we worked together with the security team to build empathy to deliver the project. You might want to read those first, since we are going to build on a lot of the ideas there to ship and maintain this integration.

Ready for launch

You’ve convinced the security team and other stakeholders, you’ve gotten the integration running, you’re getting promising results from dev-test or staging environments… now it’s time to move from proof-of-concept to full implementation. Depending on your situation this might be a transition from staging to production, or it might mean increasing a feature flipper flag from 5% to 100%, or it might mean increasing coverage of an integration from one API endpoint to cover your entire developer footprint.

Taking into account Murphy’s Law, we expect that some things will go wrong during the rollout. Perhaps during coverage, a developer realizes that the schema designed to handle the app’s event mechanism can’t represent a scenario, requiring a redesign or a hacky solution. Or perhaps the metrics dashboard shows elevated error rates from the API frontend, and while there’s no smoking gun, the ops oncall decides to rollback the integration Just In Case it’s causing the incident.

This gives us another chance to practice empathy — while it’s easy, wearing the champion hat, to dismiss any issues found by looking for someone to blame, ultimately this poisons trust within your organization and will hamper success. It’s more effective, in the long run (and often even in the short run), to find common ground with your peers in other disciplines and teams, and work through to solutions that satisfy everybody.

Keeping the lights on

In all likelihood as integration succeeds, the team will rapidly develop experts and expertise, as well as idiomatic ways to use the product. Let the experts surprise you; folks you might not expect can step up when given a chance. Expertise flourishes when given guidance and goals; as the team becomes comfortable with the integration, explicitly recognize a leader or point person for each vendor relationship. Having one person explicitly responsible for a relationship lets them pay attention to those vendor emails, updates, and avoid the tragedy of the “but I thought *you* were” commons. This Integration Lead is also a center of knowledge transfer for your organization — they won’t know everything or help every user come up to speed, but they can help empower the local power users in each team to ramp up their teams on the integration.

As comfort grows you will start to consider ways to change your usage, for example growing into new kinds of data. This is a good time to revisit that security checklist — does the change increase PII exposure to your vendor? Would the new data lead to additional requirements such as per-field encryption? Don’t let these security concerns block you from gaining valuable insight using the new tool, but do take the chance to talk it over with your security experts as appropriate.

Throughout this organic growth, the Integration Lead remains core to managing your changing profile of usage of the vendor they shepherd; as new categories of data are added to the integration, the Lead has responsibility to ensure that the vendor relationship and risk profile are well matched to the needs that the new usage (and presumably, business value) is placing on the relationship.

Documenting the Intergation Lead role and responsibilities is critical. The team should know when to check in, and writing it down helps it happen. When new code has a security implication, or a new use case potentially amplifies the cost of an integration, bringing the domain expert in will avoid unhappy surprises. Knowing how to find out who to bring in, and when to bring them in, will keep your team getting the right eyes on their changes.

Security threats and other challenges change over time, too. Collaborating with your security team so that they know what systems are in use helps your team take note of new information that is relevant to your business. A simple example is noting when your vendors publish a breach announcement, but more complex examples happen too — your vendor transitions cloud providers from AWS to Azure and the security team gets an alert about unexpected data flows from your production cluster; with transparency and trust such events become part of a routine process rather than an emergency.

It’s all operational

Monitoring and alerting is a fact of operations life, and this has to include vendor integrations (even when the vendor integration is a monitoring product.) All of your operations best practices are needed here — keep your alerts clean and actionable so that you don’t develop pager fatigue, and monitor performance of the integration so that you don’t get blindsided by a creeping latency monster in your APIs.

Authentication and authorization are changing as the threat landscape evolves and industry moves from SMS verification codes to U2F/WebAuthn. Does your vendor support your SSO integration? If they can’t support the same SSO that you use everywhere else and can’t add it — or worse, look confused when you mention SSO — that’s probably a sign you should consider a different vendor.

A beautiful sunset

Have a plan beforehand for what needs to be done should you stop using the service. Got any mobile apps that depend on APIs that will go away or start returning permission errors? Be sure to test these scenarios ahead of time.

What happens at contract termination to data stored on the service? Do you need to explicitly delete data when ceasing use?

Do you need to remove integrations from your systems before ending the commercial relationship, or can the technical shutdown and business shutdown run in parallel?

In all likelihood these are contingency plans that will never be needed, and they don’t need to be fully fleshed out to start, but a little bit of forethought can avoid unpleasant surprises.

Year after year

Industry best practice and common sense dictate that you should revisit the security questionnaire annually (if not more frequently). Use this chance to take stock of the last year and check in — are you getting value from the service? What has changed in your business needs and the competitive landscape?

It’s entirely possible that a new year brings new challenges, which could make your current vendor even more valuable (time to negotiate a better contract rate!) or could mean you’d do better with a competing service. Has the vendor gone through any major changes? They might have new offerings that suit your needs well, or they may have pivoted away from the features you need.

Check in with your friends on the security team as well; standards evolve, and last year’s sufficient solution might not be good enough for new requirements.

Andy thinks out loud about security, society, and the problems with computers on Twitter.

❤️ Thanks so much reading, folks. Please feel free to drop any complaints, comments, or additional tips to us in the comments, or direct them to me on twitter.

Have fun! Stay (a little bit) Paranoid!!

— charity

Outsource Your O11y: Get Aligned With Security (part 2/3)

February 13, 2019February 13, 2019 mipsytipsyaws, engineers, monitoring, operations, outsourcing, security, sre, vendors1 Comment

This is part two of a three-part series of guest posts:

How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team. (George Chamales)
Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it. 🙏❤️ charity + friends

“Get Aligned With Security”

by Lilly Ryan

If your team has decided on a third-party service to help you gather data and debug product issues, how do you convince an often overeager internal security team to help you adopt it?

When this service is something that provides a pathway for developers to access production data, as analytics tools often do, making the case for access to that data can screech to a halt at the mention of the word “production”. Progressing past that point will take time, empathy, and consideration.

I have been on both sides of the “adopting a new service” fence: as a developer hoping to introduce something new and useful to our stack, and now as a security professional who spends her days trying to bust holes in other people’s setups. I understand both sides of the sometimes-conflicting needs to both ship software and to keep systems safe.

This guide has advice to help you solve the immediate problem of choosing and deploying a third-party service with the approval of your security team. But it also has advice for how to strengthen the working relationship between your security and development teams over the longer term. No two companies are the same, so please adapt these ideas to fit your circumstances.

Understanding the security mindset

The biggest problems in technology are never really about technology, but about people. Seeing your security team as people and understanding where they are coming from will help you to establish empathy with them so that both of you want to help each other get what you want, not block each other.

First, understand where your security team is coming from. Development teams need to build features, improve the product, understand and ship good code. Security teams need to make sure you don’t end up on the cover of the NYT for data breaches, that your business isn’t halted by ransomware, and that you’re not building your product on a vulnerable stack.

This can be an unfamiliar frame of mind for developers. Software development tends to attract positive-minded people who love creating things and are excited about the possibilities of new technology. Software security tends to attract negative thinkers who are skilled at finding all the flaws in a system. These are very different mentalities, and the people who occupy them tend to have very different assumptions, vocabularies, and worldviews.

But if you and your security team can’t share the same worldview, it will be hard to trust each other and come to agreement. This is where practicing empathy can be helpful.

Before approaching your security team with your request to approve a new vendor, you may want to run some practice exercises for putting yourselves in their shoes and forcing yourselves to deliberately cultivate a negative thinking mindset to experience how they may react — not just in terms of the objective risk to the business, or the compliance headaches it might cause, but also what arguments might resonate with them and what emotional reactions they might have.

My favourite exercise for getting teams to think negatively is what I call the Land Astronaut approach.

The “Land Astronaut” Game

Imagine you are an astronaut on the International Space Station. Literally everything you do in space has death as a highly possible outcome. So astronauts spend a lot of time analysing, re-enacting, and optimizing their reactions to events, until it becomes muscle memory. By expecting and training for failure, astronauts use negative thinking to anticipate and mitigate flaws before they happen. It makes their chances of survival greater and their people ready for any crisis.

Your project may not be as high-stakes as a space mission, and your feet will most likely remain on the ground for the duration of your work, but you can bet your security team is regularly indulging in worst-case astronaut-type thinking. You and your team should try it, too.

The Game:

Pick a service for you and your team to game out. Schedule an hour, book a room with a whiteboard, put on your Land Astronaut helmets. Then tell your team to spend half an hour brainstorming about all the terrible things that can happen to that service, or to the rest of your stack when that service is introduced. Negative thoughts only!

Start brainstorming together. Start out by being as outlandish as possible (what happens if their data centre is suddenly overrun by a stampede of elephants?). Eventually you will find that you’ll tire of the extreme worst case scenarios and come to consider more realistic outcomes — some of which which you may not have thought of outside of the structure of the activity.

After half an hour, or whenever you feel like you’re all done brainstorming, take off your Land Astronaut helmets, sift out the most plausible of the worst case scenarios, and try to come up with answers or strategies that will help you counteract them. Which risks are plausible enough that you should mitigate them? Which are you prepared to gamble on never happening? How will this risk calculus change as your company grows and takes on more exposure?

Doing this with your team will allow you all to practice the negative thinking mindset together and get a feel for how your colleagues in the security team might approach this request. (While this may seem similar to threat modelling exercises you might have done in the past, the focus here is on learning to adopt a security mindset and gaining empathy for this thought process, rather than running through a technical checklist of common areas of concern.)

While you still have your helmets within reach, use your negative thinking mindset to fill out the spreadsheet from the first piece in this series. This will help you anticipate most of the reasonable objections security might raise, and may help you include useful detail the security team might not have known to ask for.

Once you have prepared your list of answers to George’s worksheet and held a team Land Astronaut session together, you will have come most of the way to getting on board with the way your security team thinks.

Preparing for compromise

You’ve considered your options carefully, you’ve learned how to harness negative thinking to your advantage, and you’re ready to talk to your colleagues in security – but sometimes, even with all of these tools at your disposal, you may not walk away with all of the things you are hoping for.

Being willing to compromise and anticipating some of those compromises before you approach the security team will help you negotiate more successfully.

While your Land Astronaut helmets are still within reach, consider using your negative thinking mindset game to identify areas where you may be asked to compromise. If you’re asking for production access to this new service for observability and debugging purposes, think about what kinds of objections may be raised about this and how you might counter them or accommodate them. Consider continuing the activity with half of the team remaining in the Land Astronaut role while the other half advocates from a positive thinking standpoint. This dynamic will get you having conversations about compromise early on, so that when the security team inevitably raises eyebrows, you are ready with answers.

Be prepared to consider compromises you had not anticipated, and enter into discussions with the security team with as open a mind as possible. Remember the team is balancing priorities of not only your team, but other business and development teams as well. If you and your security colleagues are doing the hard work to meet each other halfway then you are more likely to arrive at a solution that satisfies both parties.

Working together for the long term

While the previous strategies we’ve covered focus on short-term outcomes, in this continuous-deployment, shift-left world we now live in, the best way to convince your security team of the benefits of a third-party service – or any other decision – is to have them along from day one, as part of the team.

Roles and teams are increasingly fluid and boundary-crossing, yet security remains one of the roles least likely to be considered for inclusion on a software development team. Even in 2019, the task of ensuring that your product and stack are secure and well-defended is often left until the end of the development cycle. This contributes a great deal to the combative atmosphere that is common.

Bringing security people into the development process much earlier builds rapport and prevents these adversarial, territorial dynamics. Consider working together to build Disaster Recovery plans and coordinating for shared production ownership.

If your organisation isn’t ready for that kind of structural shift, there are other ways to work together more closely with your security colleagues.

Try having members of your team spend a week or two embedded with the security team. You may even consider a rolling exchange – a developer for a security team member – so that developers build the security mindset, and the security team is able to understand the problems your team is facing (and why you are looking at introducing this new service).

At the very least, you should make regular time to meet with the security team, get to know them as people, and avoid springing things on them late in the project when change is hardest.

Riding off together into the sunset…?

If you’ve taken the time to get to know your security team and how they think, you’ll hopefully be able to get what you want from them – or perhaps you’ll understand why their objections were valid, and come up with a better solution that works well for both of you.

Investing in a strong relationship between your development and security teams will rarely lead to the apocalypse. Instead, you’ll end up with a better product, probably some new work friends, and maybe an exciting idea for a boundary-crossing new career in tech.

But this story isn’t over! Once you get the green light from security, you’ll need to think about how to roll your new service out safely, maintain it, and consider its full lifespan within your company. Which leads us to part three of this series, on rolling it out and maintaining it … both your integration and your relationship with the security team.

Lilly Ryan is a pen tester, Python wrangler, and recovering historian from Melbourne. She writes and speaks internationally about ethical software, social identities after death, teamwork, and the telegraph. More recently she has researched the domestic use of arsenic in Victorian England, attempted urban camouflage, reverse engineered APIs, wielded the Oxford comma, and baked a really good lemon shortbread.

Outsource Your O11y: How To Be A Champion (part 1/3)

February 13, 2019February 13, 2019 mipsytipsybest practices, operations, outsourcing, security, software, sre, vendors1 Comment

I hear variations on this question constantly: “I’d really like to use a service like Honeycomb for my observability, but I’m told I can’t ship any data off site. Do you have any advice on how to convince my security team to let me?”

I’ve given lots of answers, most of them unsatisfactory. “Strip the PII/PHI from your operational data.” “Validate server side.” “Use our secure tenancy proxy.” (I’m not bad at security from a technical perspective, but I am not fluent with the local lingo, and I’ve never actually worked with an in-house security team — i’ve always *been* the security team, de facto as it may be.)

So I’ve invited three experts to share their wisdom in a three-part series of guest posts:

How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team. (George Chamales)
Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

My ✨first-ever guest posts✨! Yippee. I hope these are useful to you, wherever you are in the process of outsourcing your tools. You are on the right path: outsourcing your observability to a vendor for whom it’s their One Job is almost always the right call, in terms of money and time and focus — and yes, even security.

All this pain will someday be worth it. 🙏❤️ charity + friends

“How to be a Champion”

by George Chamales

You’ve found a third party service you want to bring into your company, hooray!

To you, it’s an opportunity to deploy new features in a flash, juice your team’s productivity, and save boatloads of money.

To your security and compliance teams, it’s a chance to lose your customers’ data, cause your applications to fall over, and do inordinate damage to your company’s reputation and bottom line.

The good news is, you’re absolutely right. The bad news is, so are they.

Successfully championing a new service inside your organization will require you to convince people that the rewards of the new service are greater than the risks it will introduce (there’s a guide below to help you).

You’re convinced the rewards are real. Let’s talk about the risks.

The past year has seen cases of hackers using third party services to target everything from government agencies, to activists, to Target…again. Not to be outdone, attention-seeking security companies have been actively hunting for companies exposing customer data then issuing splashy press releases as a means to flog their products and services.

A key feature of these name-and-shame campaigns is to make sure that the headlines are rounded up to the most popular customer – the clickbait lead “MBM Inc. Loses Customer Data” is nowhere near as catchy as “Walmart Jewelry Partner Exposes Personal Data Of 1.3M Customers.”

While there are scary stories out there, in many, many cases the risks will be outweighed by the rewards. Telling the difference between those innumerable good calls and the one career-limiting move requires thoughtful consideration and some up-front risk mitigation.

When choosing a third party service, keep the following in mind:

- The security risks of a service are highly dependent on how you use it.
  You can adjust your usage to decrease your risk. There’s a big difference between sending a third party your server metrics vs. your customer’s personal information. Operational metrics are categorically less sensitive than, say, PII or PHI (if you have scrubbed them properly).
- There’s no way to know how good a service’s security really is.
  History is full of compromised companies who had very pretty security pages and certifications (here’s Equifax circa September 2017). Security features are a stronger indicator, but there are a lot more moving parts that go into maintaining a service’s security.
- Always weigh the risks vs. the rewards.

There’s risk no matter what you do – bringing in the service is risky, doing nothing is risky. You can only mitigate risks up to a point. Beyond that point, it’s the rewards that make risks worthwhile.

Context is critical in understanding the risks and rewards of a new service.

You can use the following guide to put things in context as you champion a new service through the gauntlet of management, security, and compliance teams. That context becomes even more powerful when you can think about the approval process from the perspective of the folks you’ll need to win over to get the okay to move forward.

In the next part of this series Lilly Ryan shares a variety of techniques to take on the perspective of your management, security and compliance teams, enabling you to constructively work through responses that can include everything from “We have concerns…” to “No” to “Oh Helllllllll No.”

Championing a new service is hard – it can be equally worthwhile. Good luck!

George Chamales is a useful person to have around. Please send critiques of this post to george@criticalsec.com

“A Security Guide for Third Party Services” Worksheet

Note to thoughtful service providers: You may want to fill parts of this out ahead of time and give it to your prospective customers. It will provide your champion with good fortune in the compliance wars to come. (Also available as a nicely formatted spreadsheet.)

Our Reasons
Why this service?	This is the justification for the service – the compelling rewards that will outweigh the inevitable risks. What will be true once the service is online? Good reasons are ones that a fifth grader would understand.
Our Data
Data it will / won’t collect?	Describe the classes or types of data the service will access / store and why that’s necessary for the service to operate. If there are specific types of sensitive data the service won’t collect (e.g. passwords, Personally Identifiable Information, Patient Health Information) explicitly call them out.
How is data be accessed?	Describe the process for getting data to the service. Do you have to run their code on your servers, on your customer’s computers?
Our Costs
Costs of NOT doing it?	This are the financial risks / liabilities of not going with this service. What’s the worst and average cost? Have you had costly problems in the past that could have been avoided if you were using this service?
Costs of doing it?	Include the cost for the service and, if possible, the amount of person-time it’s going to take to operate the service. Ideally less than the cost of not doing it.
Our Risk – how mad will important people be…
If it’s compromised.	What would happen if hackers or attention-seeking security companies publicly released the data you sent the service? Is it catastrophic or an annoyance?
When it goes down?	When this service goes down (and it will go down), will it be a minor inconvenience or will it take out your primary application and infuriate your most valuable customers?
Their Security – in order of importance
SSO & 2FA Support?	This is a security smoke test: If a service doesn’t support SSO or 2FA, it’s safe to assume that they don’t prioritize security. Also a good idea to investigate SSO support up front since some vendors charge extra for it (which is a shame).
Fine-grained permissions?	This is another key indicator of the service’s maturity level since it takes time and effort to build in. It’s also something else they might make you pay extra for.
Security certifications?	These aren’t guarantees of quality, but it does indicate that the company’s put in some effort and money into their processes. Check their website for general security compliance merit badges such as SOC2, ISO27001 or industry-specific things like PCI or HIPAA.
Security & privacy pages?	If there is, it means that they’re willing to publicly state that they do something about security. The more specific and detailed, the better.
Vendor’s security history?	Have there been any spectacular breaches that demonstrated a callous disregard for security, gross incompetence, or both?
BONUS Questions	Want to really poke and prod the internal security of your vendor? Ask if they can answer the following questions: How many known vulnerabilities (CVEs) exist on your production infrastructure right now? At what time (exactly) was the last successful backup of all your customer data completed? What were the last three secrets accessed in the production environment?
Our Decision
Is it worth it?	Look back through the previous sections and ask whether it makes sense to: * Use the 3rd party service * Build it yourself * Not do it at all Would a thoughtful person agree with you?

Ten Platform Commandments

October 24, 2018July 14, 2023 mipsytipsyengineering, platform engineering, vendors3 Comments

On Monday I gave a talk at DOES18 called “All the World’s a Platform”, where I talked about a bunch of the lessons learned by using and abusing and running and building platforms at scale.

I promised to do a blog post with the takeaways, so here they are.

Platform Commandment #1: Any time you have to think about one particular user, you have failed in some way. It doesn’t scale. Just a few one-offs a day will drag you down and drown your forward momentum.

Corollary: you will always have to do this every day. Solution: turn one-offs into a support problem, not an engineering problem.

Platform Commandment #2: keep your critical path as small and independent as possible. Have explicit tiers of importance. You cannot care about everything equally, sacrifices must be made.

Example: at Parse the core API was tier 1, push was tier 2, website was somewhere down around tier 10. We always knew what to bring up and care about first.

Platform Commandment #3: It is the job of the platform to protect itself at all costs, including at the expense of your app.

Platform Commandment #4: Remember that your platform is a magical black box to your users. You can’t expect them to behave reasonably without feedback loops and a rich mental model. Help them out — esp your super-users. It will save you time if you can help them help themselves.

Platform Commandment #5: Always expose a visible request id, shard id, uuid, trace id, any other relevant diagnostic information in user-visible errors. Up to the point where it reveals too much exploitable information about your service, which is probably much farther than you think. Poorly obfuscated infrastructure decisions are usually less of a threat to your business than befuddled users are.

Platform Commandment #6: Your observability must center your users’ perspective, not your own. The health of the system doesn’t matter. The health of every request, and every high-cardinality grouping of requests — those are what matter.

You must be able to care about and inspect the perf and quality from the perspective of every single application and/or user and their users, as richly as though theirs was the *only* application. In real-time.

Dashboards are practically useless unless you can drill down into them. Top-10 lists are useless — your biggest customers may not be your most important customers.

Solution: Invest in tooling (like Honeycomb) that lets you slice and dice on dimensions of arbitrary cardinality, so you can do things like a) break down by one uuid out of millions, b) break down by endpoint, latency percentile, raw query, data store, etc — to see what the experience actually looks like for that user, not for a high level aggregate like a dashboard.

Platform Commandment #7: Use end-to-end checks to traverse all the key code paths and architecture paths.

You will be tempted to disable them because they seem flappy and flaky and need to be fixed. But this is actually what your users are suffering through every day they use your platform. Don’t disable them, fix them.

Platform Commandment #8: Invest early in every kind of throttle, blacklist, velvet rope, in-flight rewrite, custom url/error responder, content inspection, etc … both partial and total, for every slice of events or users. You will need all these fine-grained controls to keep your platform alive for 99.9% of users while you debug the .1% who are outliers and bad actors.

Platform Commandment #9: And use a multi-threaded language ffs.

Platform Commandment #10: USE YOUR OWN PLATFORM. For work, if possible. Feel the pain that you inflict on others.

Bonus Commandment: all cotenancy isolation guarantees are bullshit**

**from a perf standpoint, not security