We hear this phrase constantly: “I worked at breaking down silos.” “We need to break down silos.” “What did I do in my last role? I broke down silos.”
It sets my fucking teeth on edge.
What is a ‘silo’, anyway? What specifically wasn’t working well, and how did you solve it; or how was it solved, and what was your contribution to the solution? did you just follow orders, or did you personally diagnose the problem, or did some of your suggestions pan out?
Solutions to complex problems rarely work on the first go, so … what else did you try? How did you know it wasn’t working, how did you know when to abandon earlier ideas? It’s fiendishly hard to know whether you’ve given a solution enough time to bake, for people to adjust, so that you can even evaluate whether it works better or worse off than before.
Communication is not magic pixie dust
Breaking down silos is supposed to be about increasing communication, removing barriers and roadblocks to collaboration.
But you can’t just blindly throw “more communication” at your teams. Too much communication can be just as much of a problem and a burden as too little. It can distract, and confuse, and create little eddies of information that is incorrect or harmful.
The quantity of the communication isn’t the issue, so much as the quality. Who is talking to whom, and when, and why? How does information flow throughout your company? Who gets left out? Whose input is sought, and when, and why? How can any given individual figure out who to talk to about any given responsibility?
When someone says they are “breaking down silos”, whether in an interview, a panel, or casual conversation, it tells me jack shit about what they actually did.
cliches are a substitute for critical thinking
It’s just like when people say “it’s a culture problem”, or “fix your culture”, or “everything is about people”. These phrases tell me nothing except that the speaker has gone to a lot of conferences and wants to to sound cool.
If someone says “breaking down silos”, it immediately generates a zillion questions in my mind. I’m curious, because these problems are genuinely hard and people who solve them are incredibly rare.
Unfortunately, the people who use these phrases are almost never the ones who are out there in the muck and grind, struggling to solve real problems.
When asked, people who have done the hard labor of building better organizations with healthy communication flows, less inefficiency, and alignment around a single mission — people who have gotten all the people rowing in the same direction — tend to talk about the work.
People who haven’t, say they were “breaking down silos.”
There exist some wonderful teams out there who have valid, well thought through, legitimate reasons for enforcing “NO FRIDAY DEPLOYS” week in and week out, for not hooking CI/CD up to autodeploy, and for not shipping one person’s changes at a time.
“Maybe that works for some teams, like baby startups, but not real software systems that people rely on” (false-uniqueness bias)
“We’ve already invested a ton of engineering hours into building a deployment framework that doesn’t ship especially fast and doesn’t ship one change at a time, but it works, and we don’t want to have to redo everything from scratch.” (surrogation, ostrich effect, irrational escalation, ikea-effect,law of the instrument)
“I get why it’s important, but the most important thing right now is for us to ship all of these features. At some point we’ll have the spare time to fix our deploys.” (hyperbolic discounting)
“Deploys are just inherently scary; there’s nothing that can be done about that. You should do them sparingly and with someone monitoring it closely.” (availability bias, dread aversion, functional fixedness)
“There is nothing anyone could say or do to convince me that the best thing I can do as a manager to protect their weekends is not blocking Friday deploys. I just feel it in my gut. I just know.” (… I got nuthin)
We’re humans. 💜 We leap to conclusions with the wetware we have doing the best it can based on heuristics that feel objectively true, but are ultimately just emotional reactions based on past lived experience. And then we retroactively enshrine those goofy gut feelings with the language of noble motive and moral values.
“I tell people not to deploy to production … because I care so deeply about my team and their ability to have a quiet weekend.”
Barf. 🙄 That’s just like saying you tell your kid not to brush his teeth at night, because you care SO DEEPLY about him and his ability to go to bed calm and happy.
Once the retcon engine in your brain gets running, it comes up with all sorts of reasons. Plausible-sounding reasons! But every single argument of the items in the list above is materially false.
Deploy myths are never going away for good; they appeal to too many of our cognitive biases. But what if there was one simple thing you could do that would invert many of these cognitive biases and cause people to grapple with the question in a new way? What if you could kickstart a recalculation?
My next post will pick up right here. I’ll tell you all about the One Simple Trick you can do to fix your deploys and set you on the virtuous path of high-performing teams.
Til then, here’s what I’ve previously written on the topic.
Availability bias: The tendency to overestimate the likelihood of events with greater “availability” in memory, which can be influenced by how recent the memories are or how unusual or emotionally charged they may be.
Continued influence effect: The tendency to believe previously learned misinformation even after it has been corrected. Misinformation can still influence inferences one generates after a correction has occurred.
Conservatism bias: The tendency to revise one’s belief insufficiently when presented with new evidence.
Default effect: When given a choice between several options, the tendency to favor the default one.
Dread aversion: Just as losses yield double the emotional impact of gains, dread yields double the emotional impact of savouring
False-uniqueness bias: The tendency of people to see their projects and themselves as more singular than they actually are.
Functional fixedness: Limits a person to using an object only in the way it is traditionally used
Hyperbolic discounting: Discounting is the tendency for people to have a stronger preference for more immediate payoffs relative to later payoffs. Hyperbolic discounting leads to choices that are inconsistent over time – people make choices today that their future selves would prefer not to have made, despite using the same reasoning
IKEA effect: The tendency for people to place a disproportionately high value on objects that they partially assembled themselves, such as furniture from IKEA, regardless of the quality of the end product
Illusory truth effect: A tendency to believe that a statement is true if it is easier to process, or if it has been stated multiple times, regardless of its actual veracity.
Irrational escalation: The phenomenon where people justify increased investment in a decision, based on the cumulative prior investment, despite new evidence suggesting that the decision was probably wrong. Also known as the sunk cost fallacy
Law of the instrument: An over-reliance on a familiar tool or methods, ignoring or under-valuing alternative approaches. “If all you have is a hammer, everything looks like a nail”
Mere exposure effect: The tendency to express undue liking for things merely because of familiarity with them
Negativity bias: Psychological phenomenon by which humans have a greater recall of unpleasant memories compared with positive memories
Non-adaptive choice switching: After experiencing a bad outcome with a decision problem, the tendency to avoid the choice previously made when faced with the same decision problem again, even though the choice was optimal
Omission bias: The tendency to judge harmful actions (commissions) as worse, or less moral, than equally harmful inactions (omissions).
Ostrich effect: Ignoring an obvious (negative) situation
Plan continuation bias: Failure to recognize that the original plan of action is no longer appropriate for a changing situation or for a situation that is different than anticipated
Prevention bias: When investing money to protect against risks, decision makers perceive that a dollar spent on prevention buys more security than a dollar spent on timely detection and response, even when investing in either option is equally effective
Pseudocertainty effect: The tendency to make risk-averse choices if the expected outcome is positive, but make risk-seeking choices to avoid negative outcomes
Salience bias: The tendency to focus on items that are more prominent or emotionally striking and ignore those that are unremarkable, even though this difference is often irrelevant by objective standards
Selective perception bias: The tendency for expectations to affect perception
Status-quo bias: If no special action is taken, the default action that will happen is that the code will go live. You will need an especially compelling reason to override this bias and manually stop the code from going live, as it would by default.
Slow-motion bias: We feel certain that we are more careful and less risky when we slow down. This is precisely the opposite of the real world risk factors for shipping software. Slow is dangerous for software; speed is safety. The more frequently you ship code, the smaller the diffs you ship, the less dangerous each one actually becomes. This is the most powerful and difficult to overcome of all of our biases, because there is no readily available counter-metaphor for us to use. (Riding a bike is the best I’ve come up with. 😔)
Surrogation: Losing sight of the strategic construct that a measure is intended to represent, and subsequently acting as though the measure is the construct of interest
Time-saving bias: Underestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively low speed and overestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively high speed.
Zero-risk bias: Preference for reducing a small risk to zero over a greater reduction in a larger risk.
I’ve fallen way behind on my blog posts — my goal was to write one per month, and I haven’t published anything since MAY. Egads. So here I am dipping into the drafts archives! This one was written in April of 2016, when I was noodling over my CraftConf 2016 talk on “DevOps for Developers (see slides).”
So I got to the part in my talk where I’m talking about how to interview and hire software engineers who aren’t going to burn the fucking house down, and realized I could spend a solid hour on that question alone. That’s why I decided to turn it into a blog post instead.
Stop telling ops people to code better, start telling SWEs to ops better
Our industry has gotten very good at pressing operations engineers to get better at writing code, writing tests, and software engineering in general these past few years. Which is great! But we have not been nearly so good at pushing software engineers to level up their systems skills. Which is unfortunate, because it is just as important.
Most systems suffer from the syndrome of running too much software. Tossing more software into the heap is as likely to cause more problems as often as it solves them.
We see this play out at companies stacked with good software engineers who have built horrifying spaghetti messes of their infrastructure, and then commence paging themselves to death.
The only way to unwind this is to reset expectations, and make it clear that
you are still responsible for your code after it’s been deployed to production, and
operational excellence is everyone’s job.
Operations is the constellation of tools, practices, policies, habits, and docs around shipping value to users, and every single one of us needs to participate in order to do this swiftly and safely.
Every software engineering interviewing loop should have an ops component.
Nobody interviews candidates for SRE or ops nowadays without asking some coding questions. You don’t have to be the greatest programmer in the world, but you can’t be functionally illiterate. The reverse is less common: asking software engineers basic, stupid questions about the lifecycle of their code, instrumentation best practices, etc.
It’s common practice at lots of companies now to have a software engineer in the loop for hiring SREs to evaluate their coding abilities. It should be just as common to have an ops engineer in the loop for a SWE hire, especially for any SWE who is being considered for a key senior position. Those are the people you most rely on to be mentors and role models for junior hires. All engineers should embrace the ethos of owning their code in production, and nobody should be promoted or hired into a senior role if they don’t.
And yes, that means all engineers!Even your iOS/Android engineers and website developers should be interested in what happens to their code after they hit deploy.They should care about things like instrumentation, and what kind of data they may need later to debug their problems, and how their features may impact other infrastructure components.
You need to balance out your software engineers with engineers who don’t react to every problem by writing more code. You need engineers who write code begrudgingly, as a last resort. You’ll find these priceless gems in ops and SRE.
ops questions for software engineers
The best questions are broad and start off easy, with plenty of reasonable answers and pathways to explore. Even beginners can give a reasonable answer, while experts can go on talking for hours.
For example: give them the specs for a new feature, and ask them to talk through the infrastructure choices and dependencies to support that feature. Do they ask about things like which languages, databases, and frameworks are already supported by the team? Do they understand what kind of monitoring and observability tools to use, do they ask about local instrumentation best practices?
Or design a full deployment pipeline together. Probe what they know about generating artifacts, versioning, rollbacks, branching vs master, canarying, rolling restarts, green/blue deploys, etc. How might they design a deploy tool? Talk through the tradeoffs.
Some other good starting points:
“Tell me about the last time you caused a production outage. What happened, how did you find out, how was it resolved, and what did you learn?”
“What are some of your favorite tools for visibility, instrumentation, and debugging?
“Latency seems to have doubled over the last 6 hours. Where do you start looking, how do you start debugging?”
And this chestnut: “What happens when you type ‘google.com’ into a web browser?” You would be fucking *astonished* how many senior software engineers don’t know a thing about DNS, HTTP, SSL/TLS, cookies, TCP/IP, routing, load balancers, web servers, proxies, and on and on.
Another question I really like is: “what’s your favorite API (or database, or language) and why?” followed up by “… and what are the worst things about it?” (True love doesn’t mean blind worship.)
Remember, you’re exploring someone’s experience and depth here, not giving them a pass-fail quiz. It’s okay if they don’t know it all. You’re also evaluating them on communication skills, which is severely underrated by most people but is actually as a key technical skill.
Signals to look for
You’re not looking for perfection. You are teasing out signals for things like, how will this person perform on a team where software engineers are expected to own their code? How much do they know about the world outside the code they write themselves? Are they curious, eager, and willing to learn, or fearful, incurious and begrudging?
Do they expect networks to be reliable? Do they expect databases to respond, retries to succeed? Are they offended by the idea of being on call? Are they overly clever or do they look to simplify? (God, I hate clever software engineers 🙃.)
It’s valuable to get a feel for an engineer’s operational chops, but let’s be clear, you’re doing this for one big reason: to set expectations. By making ops questions part of the interview, you’re establishing from the start that you run an org where operations is valued, where ownership is non-optional. This is not an ivory tower where software engineers can merrily git push and go home for the day and let other people handle the fallout
It can be toxic when you have an engineer who thinks all ops work is toil and operations engineering is lesser-than. It tends to result in operations work being done very poorly. This is your best chance to let those people self-select out.
You know what, I’m actually feeling uncharacteristically optimistic right now. I’m remembering how controversial some of this stuff was when I first wrote it, five years ago in 2016. Nowadays it just sounds obvious. Like table stakes.
What should one pay for observability? What should your observability stack cost? What should be in your observability stack?
How much observability is enough? How much is too much, or is there such a thing?
Is it better to pay for one product that claims (dubiously) to do everything, or twenty products that are each optimized to do a different part of the problem super well?
It’s almost enough to make a busy engineer say “Screw it, I’m spinning up Nagios”.
(Hey, I said almost.)
All of these service providers can give you sticker shock when you begin investigating them. The biggest reason is always that we aren’t used to considering the price of our own time. We act like it’s “free” to just take an hour and spin something up … we don’t count the cost of maintenance, context switching, and opportunity costs of not using the time to build something of business value. Which is both understandable and forgivable, as a starting point.
Considerably less forgivable is the vagueness–and sometimes outright misdirection and scare tactics–some vendors offer around pricing. It’s not ok for a business to optimize for revenue at the expense of user experience. As users, we have the right to demand transparency and accurate information. As vendors, we have the responsibility to provide it. Any pricing scheme that doesn’t align with best practices and users’ interests will be a drag on reputation and growth.
The core question, rarely addressed outright, is: how much should you pay? In this post I’ll talk about what your observability costs include, and in the next post, what you should consider including in your “observability stack”.
But I’ll give you the answer to your question right off the bat: you should probably spend 20-30% of infra costs on observability.
O11y spend should be 20-30% of infra spend
Rule of thumb: your observability spend should come to 20-30% of your infra spend. (I’ve seen 10% a few times from reasonable-seeming shops, but they have been edge cases and outliers. I have also seen 50% or more, but again, outliers.)
Full disclosure: this isn’t based on any particular science. It’s just based on my experience of 15+ years working in operations engineering, talking to other engineers and managers, and a couple of informal Twitter polls to satisfy my own curiosity.
Nevertheless, it’s a pretty solid rule. There are exceptions, but in general, if you’re spending less than 20%, you’re “saving money” at the expense of engineering time, or being silently dragged underwater by a million little time leaks and quality of service issues — which you could eliminate completely with a bit of investment.
Consider the person who told me proudly that his o11y spend was just 1-3%. (He meant the PagerDuty bill and Pingdom checks, actually.) He wasn’t counting the dedicated hardware for their ELK cluster (80k/month), or the 2-3 extra engineers they had to recruit, train and hire (250-300k/year apiece) to run the many open source tools they got for “free”.
And ultimately, it didn’t meet their needs very well. Few people knew how to use it, so they leaned on the “observability team” to craft custom views, write scripts and ETL one-offs, and serve as the institutional hive mind and software usability tutors. They could have used better tools, ones under active development by large product teams. They could have used that headcount to create core business value instead.
Engineers cost money
Engineers are expensive. Recruiting them is hard. The good ones are increasingly unwilling to waste time on unnecessary labor. This manager was “saving” maybe a million dollars a year (he mentioned a vendor quote of less than 100k/month)–but spending a couple million more than that in less-visible ways.
Worse, he was driving his engineering org into the ground by wasting so much of their time and energy on non-mission-critical work, inferior tooling, one-offs, frustrating maintenance work, etc, all of which had nothing to do with their core business value.
If you want to know if an org hires and retains good engineers, you could do worse than to ask the question: “What tools do you use, and why?”
Good orgs use good tools. They know engineering cycles are their scarcest and most valuable resource, and they want to train maximum firepower on their core business problems.
Mediocre orgs use mediocre tools, have no discipline or consistency around adoption and deprecation, and leak lost engineering cycles everywhere.
So back to our rule of thumb: observability amounting to 20-30% of total spend is where most shops should fall. This refers to cloud-native infrastructure, using third-party services to instrument and monitor code, with the basics covered — resource utilization graphs, end to end checks, paging, etc.
So, what do I need in my “observability stack”?
What are the basics? Well, obviously “it depends”. It depends on your requirements, your components, your commitments, your budget, sunk costs and skill sets, your teams, and most expensive of all — customer expectations and the cost of violating them. You should think carefully about these things and try to draw a straight line from the business case to the money you spend (or don’t spend). And don’t forget to factor in those invisible human costs.
every dashboard is a sunk cost every dashboard is an answer to some long-forgotten question every dashboard is an invitation to pattern-match the past instead of interrogate the present every dashboard gives the illusion of correlation every dashboard dampens your thinking https://t.co/OIEowa1COa
… which stirred up some Feelings for many people. 🙃 So I would like to explain my opinions in more detail.
Static vs dynamic dashboards
First, let’s define the term. When I say “dashboard”, I mean STATIC dashboards, i.e. collections of metrics-based graphs that you cannot click on to dive deeper or break down or pivot. If your dashboard supports this sort of responsive querying and exploration, where you can click on any graph to drill down and slice and dice the data arbitrarily, then breathe easy — that’s not what I’m talking about. Those are great. (I don’t really consider them dashboards, but I have heard a few people refer to them as “dynamic dashboards”.)
Actually, I’m not even “against” static dashboards. Every company has them, including Honeycomb. They’re great for getting a high level sense of system functioning, and tracking important stats over long intervals. They are a good starting point for investigations. Every company should have a small, tractable number of these which are easily accessible and shared by everyone.
Debugging with dashboards: it’s a trap
What dashboards are NOT good at is debugging, or understanding or describing novel system states.
I can hear some of you now: “But I’ve debugged countless super-hard unknown problems using only static dashboards!” Yes, I’m sure you have. If all you have is a hammer, you CAN use it to drive screws into the wall, but that doesn’t mean it’s the best tool. And It takes an extraordinary amount of knowledge and experience to be able to piece together a narrative that translates low-level system statistics into bugs in your software and back. Most software engineers don’t have that kind of systems experience or intuition…and they shouldn’t have to.
Why are dashboards bad for debugging? Think of it this way: every dashboard is an answer to a question someone asked at some point. Your monitoring system is probably littered with dashboards, thousands and thousands of them, most of whose questions have been long forgotten and many of whose source data streams have long since gone silent.
So you come along trying to investigate something, and what do you do? You start skimming through dashboards, eyes scanning furiously, looking for visual patterns — e.g. any spikes that happened around the same time as your incident. That’s not debugging, that’s pattern-matching. That’s … eyeball racing.
if we did math like we do dashboards
Imagine you’re in a math competition, and you get handed a problem to solve. But instead of pulling out your pencil and solving the equation, step by step, you start hollering out guesses.
“27!”
“19992.41!”
“1/4325!”
That’s what flipping through dashboards feels like to me. You’re riffling through a bunch of graphs that were relevant to some long-ago situation, without context or history, without showing their work. Sometimes you’ll spot the exact scenario, and — huzzah! — the number you shout is correct! But when it comes to unknown scenarios, the odds are not in your favor.
Debugging looks and feels very different from flipping through answers. You ask a question, examine the answer, and ask another question based on the result. (“Which endpoints were erroring? Are all of the requests erroring, or only some? What did they have in common?”, etc.)
You methodically put one foot in front of the other, following the trail of bread crumbs, until the data itself leads you to the answer.
The limitations of metrics and dashboards
Unfortunately, you cannot do that with metrics-based dashboards, because you stripped away the connective tissue of the event back when you wrote the metrics out to disk.
If you happened to notice while skimming through dashboards that your 404 errors spiked at 14:03, and your /payment and /import endpoints started erroring at 14.03, and your database started returning a bunch of mysql errors shortly after 14:00, you’ll probably assume that they’re all related and leap to find more evidence that confirms it.
But you cannot actually confirm that those events are the same ones, not with your metrics dashboards. You cannot drill down from errors to endpoints to error strings; for that, you’d need a wide structured data blob per request. Those might in fact be two or three separate outages or anomalies happening at the same time, or just the tip of the iceberg of a much larger event, and your hasty assumptions might extend the outage for much longer than was necessary.
With metrics, you tend to find what you’re looking for. You have no way to correlate attributes between requests or ask “what are all of the dimensions these requests have in common?”, or to flip back and forth and look at the request as a trace. Dashboards can be fairly effective at surfacing the causes of problems you’ve seen before (raise your hand if you’ve ever been in an incident review where one of the follow up tasks was, “create a dashboard that will help us find this next time”), but they’re all but useless for novel problems, your unknown-unknowns.
Other complaints about dashboards:
They tend to have percentiles like 95th, 99th, 99.9th, 99.99th, etc. Which can cover over a multitude of sins. You really want a tool that allows you to see MAX and MIN, and heatmap distributions.
A lot of dashboards end up getting created that are overly specific to the incident you just had — naming specific hosts, etc — which just creates clutter and toil. This is how your dashboards become that graveyard of past outages.
The most useful approach to dashboards is to maintain a small set of them; cull regularly, and think of them as a list of starter queries for your investigations.
“I like to compare the dashboards to the big display in a hospital room: heartbeat, pressure, oxygenation, etc. Those can tell you when a thing is wrong, but the context around the patient chart (and the patient themselves) is what allows interpretation to be effective. If all we have is the display but none of the rest, we’re not getting anywhere close to an accurate picture. The risk with the dashboard is having the metrics but not seeing or knowing about the rest changing.”
In conclusion
Dashboards aren’t universally awful. The overuse of them just encourages sloppy thinking, and static ones make it impossible for you to follow the plot of an outage, or validate your hypotheses. 🤒 There’s too many of them, and not enough shared consensus. (It would help if, like, new dashboards expired within a month if nobody looked at them again.)
If what you have is “nothing”, even shitty dashboards are far better than no dashboards. But shitty dashboards have been the only game in town for far too long. We need more vendors to think about building for queryability, explorability, and the ability to follow a trail of breadcrumbs. Modern systems are going to demand more and more of this approach.
Nothing < Dashboards < a Queryable, Exploratory Interface
If everyone out there who slaps “observability” on their web page also felt the responsibility to add an observability-enabling interface to their tool, one that would let users explore and identify unknown-unknowns, we would all be in a far better place. 🙂