Is it ethical to discriminate in whom you will sell to as a business? What would you do if you found out that the work you do every day was being used to target and kill migrants at the border?
Is it ethical or defensible to pay two people doing the same job different salaries if they live in different locations and have a different cost of living? What if paying everyone the same rate means you are outcompeted by those who peg salaries to local rates, because they can vastly out-hire you?
You’re at the crowded hotel bar after a company-sponsored event, and one of your most valued customers begins loudly venting opinions about minorities in tech that you find alarming and abhorrent. What responsibility do you have, if any? How should you react?
If we were close to running out of money in the hypothetical future, should we do layoffs or offer pay cuts?
It’s not getting any simpler to live in this world, is it? 💔
Ethical problems are hard. Even the ones that seem straightforward on the face of them get stickier the closer you look at them. There are more stakeholders, more caveats, more cautionary tales, more unintended consequences than you can generally see at face value. It’s like fractal hardness, and anyone who thinks it’s easy is fooling themselves.
We’ve been running an experiment at Honeycomb for the past 6 months, where we talk through hypothetical ethical questions like these once a month. Sometimes they are ripped from the headlines, sometimes they are whatever I can invent the night before. I try to send them around in advance. The entire company is invited.**
Honeycomb is not a democracy, nor do I think that would be an effective way to run a company, any more than I think we should design our SDKs by committee or give everyone an equal vote on design mocks.
But I do think that we have a responsibility to act in the best interests of our stakeholders, to the best of our abilities, and to represent our employees. And that means we need to know where the team stands.
That’s one reason. Another is that people make the worst possible decisions when they’re taken off guard, when they are in an unfamiliar situation (and often panicking). Talking through a bunch of nightmare scenarios is a way for us to exercise these decision-making muscles while the stakes are low. We all get to experience what it’s like to hear a problem, have a kneejerk reaction .. then peeling back the onion to reveal layer after layer of dismaying complexities that muddy our snap certainties.
Honeycomb is a pretty transparent company; we believe that companies are created every day by the people who show up to labor together, so those people have a right to know most things. But it’s not always possible or ethically desirable to share all the gritty details that factor into a decision. My hope is that these practice runs help amplify employees’ voices, help them understand the way we approach big decisions, and help everyone make better decisions — and trust each other’s decisions — when things move fast and times get hard.
(Plus, these ethical puzzles are astonishingly fun to work through together. I highly recommend you borrow this idea and try it out at your own company.)
cheers, and please let me know if you do try it ☺️
** We used to limit attendance to the first 6 people to show up, to try and keep the discussion more authentic and less performative. We recently relaxed this rule since it doesn’t seem to matter, peacocking hasn’t really been an issue.
Last night I was talking with Mark Ferlatte about the advice we have given our respective companies in this pandemic era. He shared with me this link, on how to salvage a disastrous day. It’s a good link: you should read it.
My favorite part: “Your feelings will follow your actions. Just do it.”
The hardest part for me is, “Book-end your day. Don’t push it into the midnight hours.” Ugh. I really, really struggle with this because my brain takes a long long time to settle in and get started on a task to the point where I feel like I’m on a roll with it, and once I’m on a roll I do not want to stop until I’m done. Because god knows how long it will be — days? weeks?? — until I can catch this wave again, feel inspired again. But it’s true, if I stay up all night working I’m just setting myself up for a fuzzy, blundery tomorrow.
The advice we gave Honeycombers was differently shaped, though similar in spirit. I’ve had a few people ask me to share it, so here it is.
We formally request …
First, we would like to point out that what you are all being asked to do right now is impossible. Parenting, homeschooling, working, caregiving, correcting misinformed neighbors, being an engaged citizen … it is fifteen people’s worth of work. It is literally impossible.
But hey, it has always been impossible. We have never been able to do everything we want to do — there isn’t enough time. There was never enough time! We succeed as a company not by doing everything on our list, but by saying no to the right things; by NOT-doing enough most things so we can focus on the few things we have identified that matter most. That was true before COVID, it’s just truer now.
So: let’s all focus hard on our top priority. Shed as much of the other stuff as you have to. Shed more. Ask your manager for help figuring out what to shed, until you are down to an amount you can probably manage.
And speaking of focus:
You aren’t operating at full capacity. We all get that right now: none of us are. And nobody expects you to. So please spend zero energy on performing like you’re doing work, or acting extra-responsive, or keeping up a front like things are normal and you’re doing fine. That performance costs you precious energy, while doing nothing to get us closer to our goals.
What we need from you is not performance or busy-busy-ness but your engaged creative self — your active, curious mind engaging with our top problem. I would rather have 30 minutes of your creative energy applied to our biggest problem today than five hours of your distracted split-brain, juggling, trying to keep up with chat and seem like you’re as available per usual today.
So when you’re figuring out your schedule, please optimize for that — focused time on our biggest problem — and then communicate your availability to your team. If you’re a parent and you can only really work three days a week, calendar that. (If you’re not a parent, remember that you too are allowed to feel overwhelmed and underwater. Just because some have it even harder, doesn’t invalidate what you’re going through.)
Take care of yourself
Take care of your loved ones
Say no to as much as you possibly can
Focus on impact
No performative normalcy
Remember: this is temporary 🖤
We are incredibly fortunate — to be here, to have these resources, to have each other. It’s okay to have bad days; this is why we have teams, to carry each other through the hardest spots. Do your best. Everything is going to be okay, more or less.
Last Wednesday I walked into my living room and saw three gay rednecks in hot pink shirts being married as a “throuple” on a TV screen at close range, followed by one of the grooms singing a country song about a woman feeding her husband’s remains to her tigers.
In Blood Rites, Ehrenreich asks why we sacralize war. Not why we fight wars, or why we are violent necessarily, but why we are drawn to the idea of war, why we compulsively imbue it with an aura of honor and noble sacrifice. If you kill one person, you’re a murderer and we shut you out from society; kill ten and you are a monster; but if you kill thousands, or kill on behalf of the state, we give you medals and write books about you.
And it’s not only about scale or being backed by state power. The calling of war brings out the highest and finest experiences our species can know: it sings of heroism and altruism, of discipline, self-sacrifice, common ground, a life lived well in service; of belonging to something larger than one’s self. Even if, as generations of weary returning soldiers have told us, it remains the same old butchery on the ground, the near-religious allure of war is never dented for long in the popular imagination.
What the fuck is going on?
Ehrenreich is impatient with the traditional scholarship, which locates the origin of war in some innate human aggression or turf wars over resources. She is at her dryly funniest when dispatching feminist theories about violence being intrinsically male or “testosterone poisoning”, showing that the bloodthirstiest of the gods have usually been feminine. (Although there are fascinating symmetries between girls becoming women through menstruation, and boys becoming men through … some form of culturally sanctioned ritual, usually involving bloodshed.)
Rather, she shows that our sacred feelings towards blood shed in war are the direct descendents of our veneration of blood shed in sacrifice — originally towards human sacrifice and other animal sacrifice, in a reenactment of our own ever-so-recent role inversion from prey to predator. Prehistoric sacrifice was likely a way of exerting control over our environment and reenacting the death that gave us life through food.
In her theory, humans do not go to war because we are natural predators. Just the blink of an eye ago, on an evolutionary scale, humans were not predators by any means: we were prey. Weak, blind, deaf, slow, clawless and naked; we scrawny, clever little apes we were easy pickings for the many large carnivores who roamed the planet. We scavenged in the wake of predators and worshiped them as gods. We are the nouveaux riche of predators, constantly re-asserting our dominance to soothe our insecurities.
We go to war not because we are predators, in other words, but because we are prey — and this makes us very uncomfortable! War exists as a vestigial relic of when we venerated the shedding of blood and found it holy — as anyone who has ever opened the Old Testament can attest. It was not until the Axial Age that religions of the world underwent a wholesale makeover into a less bloody, more universalistic set of aspirations.
When I first read this book, years ago, I remember picking it up with a roll of the eyes. “Sounds like some overly-metaphorical liberal academic nonsense” or something like that. But I was hooked within ten pages, my mind racing ahead with even more evidence than she marshals in this lively book. It shifted the way I saw many things in the world.
Like horror movies, for example. Or why cannibalism is so taboo. How Jesus became the Son of God, the Brothers’ Grimm, the sacrament of Communion. The primal fear of being food still resonates through our culture in so many sublimated ways.
And whether what you’re watching is “Tiger King” or the Tiger-King-watchers, it will make A LOT more sense after reading this book too.
Stay safe and don’t kill each other,
 Ehrenreich is best known for her stunning book on the precariousness of the middle class, “Nickel and Dimed”, where she tried to subsist for a year only on whatever work she could get with a high school education. Ehrenreich is a journalist, and this is a piece of science journalism, not scientific research; yet it is well-researched and scrupulously cited, and it’s worth noting that she has a PhD in biology and was once a practicing scientist.
First of all, confusion over terminology is understandable, because there are some big players out there actively trying to confuse you! Big Monitoring is indeed actively trying to define observability down to “metrics, logs and traces”. I guess they have been paying attention to the interest heating up around observability, and well… they have metrics, logs, and tracing tools to sell? So they have hopped on the bandwagon with some undeniable zeal.
But metrics, logs and traces are just data types. Which actually has nothing to do with observability. Let me explain the difference, and why I think you should care about this.
“Observability? I do not think it means what you think it means.”
Observability is a borrowed term from mechanical engineering/control theory. It means, paraphrasing: “can you understand what is happening inside the system — can you understand ANY internal state the system may get itself into, simply by asking questions from the outside?” We can apply this concept to software in interesting ways, and we may end up using some data types, but that’s putting the cart before the horse.
It’s a bit like saying that “database replication means structs, longints and elegantly diagrammed English sentences.” Er, no.. yes.. missing the point much?
This is such a reliable bait and switch that any time you hear someone talking about “metrics, logs and traces”, you can be pretty damn sure there’s no actual observability going on. If there were, they’d be talking about that instead — it’s far more interesting! If there isn’t, they fall back to talking about whatever legacy products they do have, and that typically means, you guessed it: metrics, logs and traces.
Metrics in particular are actually quite hostile to observability. They are usually pre-aggregated, which means you are stuck with whatever questions you defined in advance, and even when they aren’t pre-aggregated they permanently discard the connective tissue of the request at write time, which destroys your ability to correlate issues across requests or track down any individual requests or drill down into a set of results — FOREVER.
Which doesn’t mean metrics aren’t useful! They are useful for many things! But they are useful for things like static dashboards, trend analysis over time, or monitoring that a dimension stays within defined thresholds. Not observability. (Liz would interrupt here and say that Google’s observability story involves metrics, and that is true — metrics with exemplars. But this type of solution is not available outside Google as far as we know..)
Ditto logs. When I say “logs”, you think “unstructured strings, written out to disk haphazardly during execution, “many” log lines per request, probably contains 1-5 dimensions of useful data per log line, probably has a schema and some defined indexes for searching.” Logs are at their best when you know exactly what to look for, then you can go and find it.
Again, these connotations and assumptions are the opposite of observability’s requirements, which deals with highly structured data only. It is usually generated by instrumentation deep within the app, generally not buffered to local disk, issues a single event per request per service, is schemaless and indexless (or inferred schemas and autoindexed), and typically containing hundreds of dimensions per event.
Traces? Now we’re getting closer. Tracing IS a big part of observability, but tracing just means visualizing events in order by time. It certainly isn’t and shouldn’t be a standalone product, that just creates unnecessary friction and distance. Hrmm … so what IS observability again, as applied to the software domain??
As a reminder, observability applied to software systems means having the ability to ask any question of your systems — understand any user’s behavior or subjective experience — without having to predict that question, behavior or experience in advance.
Observability is about unknown-unknowns.
At its core, observability is about these unknown-unknowns.
Plenty of tools are terrific at helping you ask the questions you could predict wanting to ask in advance. That’s the easy part. “What’s the error rate?” “What is the 99th percentile latency for each service?” “How many READ queries are taking longer than 30 seconds?”
Monitoring tools like DataDog do this — you predefine some checks, then set thresholds that mean ERROR/WARN/OK.
Logging tools like Splunk will slurp in any stream of log data, then let you index on questions you want to ask efficiently.
APM tools auto-instrument your code and generate lots of useful graphs and lists like “10 slowest endpoints”.
But if you *can’t* predict all the questions you’ll need to ask in advance, or if you *don’t* know what you’re looking for, then you’re in o11y territory.
This can happen for infrastructure reasons — microservices, containerization, polyglot storage strategies can result in a combinatorial explosion of components all talking to each other, such that you can’t usefully pre-generate graphs for every combination that can possibly degrade.
And it can happen — has already happened — to most of us for product reasons, as you’ll know if you’ve ever tried to figure out why a spike of errors was being caused by users on ios11 using a particular language pack but only in three countries, and only when the request hit the image export microservice running build_id 789782 if the user’s last name starts with “MC” and they then try to click on a particular button which then issues a db request using the wrong cache key for that shard.
Gathering the right data, then exploring the data.
Observability starts with gathering the data at the right level of abstraction, organized around the request path, such that you can slice and dice and group and look for patterns and cross-correlations in the requests.
To do this, we need to stop firing off metrics and log lines willynilly and be more disciplined. We need to issue one single arbitrarily-wide event per service per request, and it must contain the *full context* of that request. EVERYTHING you know about it, anything you did in it, all the parameters passed into it, etc. Anything that might someday help you find and identify that request.
Then, when the request is poised to exit or error the service, you ship that blob off to your o11y store in one very wide structured event per request per service.
In order to deliver observability, your tool also needs to support high cardinality and high dimensionality. Briefly, cardinality refers to the number of unique items in a set, and dimensionality means how many adjectives can describe your event. If you want to read more, here is an overview of the space, and more technical requirements for observability
You REQUIRE the ability to chain and filter as many dimensions as you want with infinitely high cardinality for each one if you’re going to be able to ask arbitrary questions about your unknown unknowns. This functionality is table stakes. It is non negotiable. And you cannot get it from any metrics or logs tool on the market today.
Why this matters.
Alright, this is getting pretty long. Let me tell you why I care so much, and why I want people like you specifically (referring to frontend engineers and folks earlier in their careers) to grok what’s at stake in the observability term wars.
We are way behind where we ought to be as an industry. We are shipping code we don’t understand, to systems we have never understood. Some poor sap is on call for this mess, and it’s killing them, which makes the software engineers averse to owning their own code in prod. What a nightmare.
Meanwhile developers readily admit they waste >40% of their day doing bullshit that doesn’t move the business forward. In large part this is because they are flying blind, just stabbing around in the dark.
We all just accept this. We shrug and say well that’s just what it’s like, working on software is just a shit salad with a side of frustration, it’s just the way it is.
But it is fucking not. It is un fucking necessary. If you instrument your code, watch it deploy, then ask “is it doing what I expect, does anything else look weird” as a habit? You can build a system that is both understandable and well-understood. If you can see what you’re doing, and catch errors swiftly, it never has to become a shitty hairball in the first place. That is a choice.
🌟 Butobservability in the original technical sense is a necessary prerequisite to this better world. 🌟
If you can’t break down by high cardinality dimensions like build ids, unique ids, requests, and function names and variables, if you cannot explore and swiftly skim through new questions on the fly, then you cannot inspect the intersection of (your code + production + users) with the specificity required to associate specific changes with specific behaviors. You can’t look where you are going.
Observability as I define it is like taking off the blindfold and turning on the light before you take a swing at the pinata. It is necessary, although not sufficient alone, to dramatically improve the way you build software. Observability as they define it gets you to … exactly where you already are. Which of these is a good use of a new technical term?
And honestly, it’s the next generation who are best poised to learn the new ways and take advantage of them. Observability is far, far easier than the old ways and workarounds … but only if you don’t have decades of scar tissue and old habits to unlearn.
The less time you’ve spent using monitoring tools and ops workarounds, the easier it will be to embrace a new and better way of building and shipping well-crafted code.
Observability matters. You should care about it. And vendors need to stop trying to confuse people into buying the same old bullshit tools by smooshing them together and slapping on a new label. Exactly how long do they expect to fool people for, anyway?
Welcome to the second installment of my advice column! Last time we talked about the emotional impact of going back to engineering after a stint in management. If you have a question you’d like to ask, please email me or DM it to me on twitter.
Hi Charity! I hope it’s ok to just ask you this…
I’m trying to get our company more aware of observability and I’m finding it difficult to convince people to look more into it. We currently don’t have the kind of systems that would require it much – but we will in future and I want us to be ahead of the game.
If you have any tips about how to explain this to developers (who are aware that quality is important but don’t always advocate for it / do it as much as I’d prefer), or have concrete examples of “here’s a situation that we needed observability to solve – and here’s how we solved it”, I’d be super grateful.
If this is too much to ask, let me know too 🙂
I’ve been talking to Abby Bangser a lot recently – and I’m “classifying” observability as “exploring in production” in my mental map – if you have philosophical thoughts on that, I’d also love to hear them 🙂
Yay, what a GREAT note! I feel like I get asked some subset or variation of these questions several times a week, and I am delighted for the opportunity to both write up a response for you and post it for others to read. I bet there are orders of magnitude more people out there with the same questions who *don’t* ask, so I really appreciate those who do. <3
I want to talk about the nuts and bolts of pitching to engineering teams and shepherding technical decisions like this, and I promise I will offer you some links to examples and other materials. But first I want to examine some of the assumptions in your note, because they elegantly illuminate a couple of common myths and misconceptions.
Myth #1: you don’t need observability til you have problems of scale
First of all, there’s this misconception that observability is something you only need when you have really super duper hard problems, or that it’s only justified when you have microservices and large distributed systems or crazy scaling problems. No, no no nononono.
There may come a point where you are ABSOLUTELY FUCKED if you don’t have observability, but it is ALWAYS better to develop with it. It is never not better to be able to see what the fuck you are doing! The image in my head is of a hiker with one of those little headlamps on that lets them see where they’re putting their feet down. Most teams are out there shipping opaque, poorly understood code blindly — shipping it out to systems which are themselves crap snowballs of opaque, poorly understood code. This is costly, dangerous, and extremely wasteful of engineering time.
Ever seen an engineering team of 200, and struggled to understand how the product could possibly need more than one or two teams of engineers? They’re all fighting with the crap snowball.
Developing software with observability is better at ANY scale. It’s better for monoliths, it’s better for tiny one-person teams, it’s better for pre-production services, it’s better for literally everyone always. The sooner and earlier you adopt it, the more compounding value you will reap over time, and the more of your engineers’ time will be devoted to forward progress and creating value.
Myth #2: observability is harder and more technically advancedthan monitoring
Actually, it’s the opposite — it’s much easier. If you sat a new grad down and asked them to instrument their code and debug a small problem, it would be fairly straightforward with observability. Observability speaks the native language of variables, functions and API endpoints, the mental model maps cleanly to the request path, and you can straightforwardly ask any question you can come up with. (A key tenet of observability is that it gives an engineer the ability to ask any question, without having had to anticipate it in advance.)
With metrics and logging libraries, on the other hand, it’s far more complicated.you have to make a bunch of awkward decisions about where to emit various types of statistics, and it is terrifyingly easy to make poor choices (with terminal performance implications for your code and/or the remote data source). When asking questions, you are locked in to asking only the questions that you chose to ask a long time ago. You spend a lot of time translating the relationships between code and lowlevel systems resources, and since you can’t break down by users/apps you are blocked from asking the most straightforward and useful questions entirely!
Doing it the old way Is. Fucking. Hard. Doing it the newer way is actually much easier, save for the fact that it is, well, newer — and thus harder to google examples for copy-pasta. But if you’re saturated in decades of old school ops tooling, you may have some unlearning to do before observability seems obvious to you.
Myth #3: observability is a purely technical solution
To be clear, you can just add an observability tool to your stack and go on about your business — same old things, same old way, but now with high cardinality!
You can, but you shouldn’t.
These are sociotechnical systems and they are best improved with sociotechnical solutions. Tools are an absolutely necessary and inextricable part of it. But so are on call rotations and the fundamental virtuous feedback loop of you build it, you run it. So are code reviews, monitoring checks, alerts, escalations, and a blameless culture. So are managers who allocate enough time away from the product roadmap to truly fix deep technical rifts and explosions, even when it’s inconvenient, so the engineers aren’t in constant monkeypatch mode.
I believe that observability is a prerequisite for any major effort to have saner systems, simply because it’s so powerful being able to see the impact of what you’ve done. In the hands of a creative, dedicated team, simply wearing a headlamp can be transformational.
Observability is your five senses for production.
You’re right on the money when you ask if it’s about exploring production, but you could also use words that are even more basic, like “understanding” or “inspecting”. Observability is to software systems as a debugger is to software code. It shines a light on the black box. It allows you to move much faster, with more confidence, and catch bugs much sooner in the lifecycle — before users have even noticed. It rewards you for writing code that is easy to illuminate and understand in production.
So why isn’t everyone already doing it? Well, making the leap isn’t frictionless. There’s a minimal amount of instrumentation to learn (easier than people expect, but it’s nonzero) and then you need to learn to see your code through the lens of your own instrumentation. You might need to refactor your use of older tools, such as metrics libraries, monitoring checks and log lines. You’ll need to learn another query interface and how it behaves on your systems. You might find yourself amending your code review and deploy processes a bit.
Nothing too terrible, but it’s all new. We hate changing our tool kits until absolutely fucking necessary. Back at Parse/Facebook, I actually clung to my sed/awk/shell wizardry until I was professionally shamed into learning new ways when others began debugging shit faster than I could. (I was used to being the debugger of last resort, so this really pissed me off.) So I super get it! So let’s talk about how to get your team aligned and hungry for change.
Okay okay okay already, how do I get my team on board?
If we were on the phone right now, I would be peppering you with a bunch of questions about your organization. Who owns production? Who is on call? Who runs the software that devs write? What is your deploy process, and how often does it get updated, and by who? Does it have an owner? What are the personalities of your senior folks, who made the decisions to invest in the current tools (and what are they), what motivates them, who are your most persuasive internal voices? Etc. Every team is different. <3
There’s a virtuous feedback loop you need to hook up and kickstart and tweak here, where the people with the original intent in their heads (software engineers) are also informed and motivated, i.e. empowered to make the changes and personally impacted when things are broken. I recommend starting by putting your software engineers on call for production (if you haven’t). This has a way of convincing even the toughest cases that they have a strong personal interest in quality and understandability.
Pay attention to your feedback loop and the alignment of incentives, and make sure your teams are given enough time to actually fix the broken things, and motivation usually isn’t a problem. (If it is, then perhaps another feedback loop is lacking: your engineers feeling sufficiently aligned with your users and their pain. But that’s another post.)
Technical ownership over technical outcomes
I appreciate that you want your team to own the technical decisions. I believe very strongly that this is the right way to go. But it doesn’t mean you can’t have influence or impact, and particularly in times like this.
It is literally your job to have your head up, scanning the horizon for opportunities and relevant threats. It’s their job to be heads down, focusing on creating and delivering excellent work. So it is absolutely appropriate for you to flag something like observability as both an opportunity and a potential threat, if ignored.
If I were in your situation and wanted my team to check out some technical concept, I might send around a great talk or two and ask folks to watch it, and then maybe schedule a lunchtime discussion. Or I might invite a tech luminary in to talk with the team, give a presentation and answer their questions. Or schedule a hack week to apply the concept to a current top problem, or something else of that nature.
But if I really wanted them to take it fucking seriously, I would put my thumb on the scale. I would find myself a champion, load them up with context, and give them ample time and space to skill up, prototype, and eventually present to the team a set of recommendations. (And I would stay in close contact with them throughout that period, to make sure they didn’t veer too far off course or lose sight of my goals.)
Get a champion.
Ideally you want to turn the person who is most invested in the old way of doing things — the person who owns the ELK cluster, say, or who was responsible for selecting the previous monitoring toolkit, or the goto person for ops questions — from your greatest obstacle into your proxy warrior. This only works if you know that person is open-minded and secure enough to give it a fair shot & publicly change course, has sufficiently good technical judgment to evaluate and project into the future, and has the necessary clout with their peers. If they don’t, or if they’re too afraid to buck consensus: pick someone else.
Give them context.
Take them for a long walk. Pour your heart and soul out to them. Tell them what you’ve learned, what you’ve heard, what you hope it can do for you, what you fear will happen if you don’t. It’s okay to get personal and to admit your uncertainties. The more context they have, the better the chance they will come out with an outcome you are happy with. Get them worried about the same things that worry you, get them excited about the same possibilities that excite you. Give them a sense of the stakes.
And don’t forget to tell them why you are picking them — because they are listened to by their peers, because they are already expert in the problem area, because you trust their technical judgment and their ability to evaluate new things — all the reasons for picking them will translate well into the best kind of flattery — the true kind.
Give them a deadline.
A week or two should be plenty. Most likely, the decision is not going to be unilaterally theirs (this also gives you a bit of wiggle room should they come back going “ah no ELK is great forever and ever”), but their recommendations should carry serious weight with the team and technical leadership. Make it clear what sort of outcome you would be very pleased with (e.g. a trial period for a new service) and what reasons you would find compelling for declining to pursue the project (i.e. your tech is unsupported, cost prohibitive, etc). Ideally they should use this time to get real production data into the services they are testing out, so they can actually experience and weigh the benefits, not just read the marketing copy.
As a rule of thumb, I always assume that managers can’t convince engineers to do things: only other engineers can. But what you can do instead is set up an engineer to be your champion. And then just sit quietly in the corner, nodding, with an interested look on your face.
The nuclear option
You have one final option. If there is no appropriate champion to be found, or insufficient time, or if you have sufficient trust with the team that you judge it the right thing to do: you can simply order them to do something your way. This can feel squicky. It’s not a good habit to get into. It usually results in things being done a bit slower, more reluctantly, more half-assedly. And you sacrifice some of your power every time you lean on your authority to get your team to do something.
But it’s just as bad for a leader to take it off the table entirely.
Sometimes you will see things they can’t. If you cannot wield your power when circumstances call for it, then you don’t fucking have real power — you have unilaterally disarmed yourself, to the detriment of your org. You can get away with this maybe twice a year, tops.
But here’s the thing: if you order something to be done, and it turns out in the end that you were right? You earn back all the power you expended on it plus interest. If you were right, unquestionably right in the eyes of the team, they will respect you more for having laid down the law and made sure they did the right thing.
One of my stretch goals for 2019 was to start writing an advice column. I get a lot of questions about everything under the sun: observability, databases, career advice, management problems, what the best stack is for a startup, how to hire and interview, etc. And while I enjoy this, having a high opinion of my own opinions and all, it doesn’t scale as well as writing essays. I do have a (rather all-consuming) day job.
So I’d like to share some of the (edited and lightly anonymized) questions I get asked and some of the answers I have given. With permission, of course. And so, with great appreciation to my anonymous correspondent for letting me publish this, here is one.
I’ve been in tech for 25 years. I don’t have a degree, but I worked my way up from menial jobs to engineering, and since then I have worked on some of the biggest sites in the world. I have been offered a management role many times, but every time I refused. Until about two years ago, when I said “fuck it, I’m almost 40; why not try.”
I took the job with boundless enthusiasm and motivation, because the team was honestly a mess. We were building everything on-prem, and ops was constantly bullying developers over their supposed incompetence. I had gone to conferences, listened to podcasts, and read enough blog posts that my head was full of “DevOps/CloudNative/ServiceOriented//You-build-it-you-run-it/ServantLeaders” idealism. I knew I couldn’t make it any worse, and thought maybe, just maybe I could even make it better.
Soon after I took the job, though, there were company-wide layoffs. It was not done well, and morale was low and sour. People started leaving for happier pastures. But I stayed. It was an interesting challenge, and I threw my heart and soul into it.
For two years I have stayed and grinded it out: recruiting (oh that is so hard), hiring, and then starting a migration to a cloud provider, and with the help of more and more people on the new team, slowly shifted the mindset of the whole engineering group to embrace devops best practices. Now service teams own their code in production and are on-call for them, migrate themselves to the cloud with my team supporting them and building tools for them. It is almost unrecognizable compared to where we were when I began managing.
A beautiful story isn’t it? I hope you’re still reading. 🙂
Now I have to say that with my schedule full of 1:1s, budgeting, hiring, firing, publishing papers of mission statements and OKRs, shaping the teams, wielding influence, I realized that I enjoyed none of the above. I read your 17 reasons not to be a manager, and I check so many boxes. It is a pain in the ass to constantly listen to people’s egos, talk to them and keep everybody aligned (which obviously never happens). And of course I am being crushed between top-down on-the-spot business decisions and bottom-up frustration of poorly executed engineering work under deadlines. I am also destroyed by the mistrust and power games I am witnessing (or involved in, sometimes). while I long for collaboration and trust. And of course when things go well my team gets all the praise, and when things go wrong I take all the blame. I honestly don’t know how one can survive without the energy provided by praise and a sense of achievement.
All of the above makes me miss being an IC (Individual Contributor), where I could work for 8 hours straight without talking to anyone, build stuff, say what I wanted when I wanted, switch jobs if I wasn’t happy, and basically be a little shit like the ones you mention in your article.
But when I think about doing it, I get stuck. I don’t know if I would be able to do it again, or if I could still enjoy it. I’ve seen too many things, I’ve tasted what it’s like to be (sometimes) in control, and I did have a big impact on the company’s direction over time. I like that. If I went back to being an IC, I would feel small and meaningless, like just another cog in the machine. And of course, being 40-ish, I will compete with all those 20-something smartasses who were born with kubernetes.
Thank you for reading. Could you give me your thoughts on this? In any case, it was good to get it off my chest.
Holy shitballs! What an amazing story! That is an incredible achievement in just two years, let alone as a rookie manager. You deserve huge props for having the vision, the courage, and the tenacity to drive such a massive change through.
Of COURSE you’re feeling bored and restless. You didn’t set out on a glorious quest for a life of updating mission statements and OKRs, balancing budgets, tending to people’s egos and fluffing their feelings, tweaking job descriptions, endless 1x1s and meetings meetings meetings, and the rest of the corporate middle manager’s portfolio. You wanted something much bigger. You wanted to change the world. And you did!
But now you’ve done it. What’s next?
First of all, YOUR COMPANY SUCKS. You don’t once mention your leadership — where are they in all this? If you had a good manager, they would be encouraging you and eagerly lining up a new and bigger role to keep you challenged and engaged at work. They are not, so they don’t deserve you. Fuck em. Please leave.
Another thing I am hearing from you is, you harbor no secret desire to climb the managerial ranks at this time. You don’t love the daily rhythms of management (believe it or not, some do); you crave novelty and mastery and advancement. It sounds like you are willing to endure being a manager, so long as that is useful or required in order to tackle bigger and harder problems. Nothing wrong with that! But when the music stops, it’s time to move on. Nobody should be saddled with a manager whose heart isn’t in the work.
You’re at the two year mark. This is a pivotal moment, because it’s the beginning of the end of the time when you can easily slip back into technical work. It will get harder and harder over the next 2-3 years, and at some point you will no longer have the option.
Picking up another technical role is the most strategic option, the one that maximizes your future opportunities as a technical leader. But you do not seem excited by this option; instead you feel many complex and uncomfortable things. It feels like going backwards. It feels like losing ground. It feels like ceding status and power.
“Management isn’t a promotion, it’s a career change.”
But if management is not a promotion, then going back to an engineering role should not feel like a demotion! What the fuck?!
It’s one thing to say that. Whether it’s true or not is another question entirely, a question of policy and org dynamics. The fact is that in most places, most of the power does go to the managers, and management IS a promotion. Power flows naturally away from engineers and towards managers unless the org actively and vigorously pushes back on this tendency by explicitly allocating certain powers and responsibilities to other roles.
I’m betting your org doesn’t do this. So yeah, going back to being an IC WILL be a step down in terms of your power and influence and ability to set the agenda. That’s going to feel crappy, no question. We humans hate that.
You cannot go back to doing exactly what you did before, for the very simple reason that you are not the same person. You are going to be attuned to power dynamics and ways of influencing that you never were before — and remember, leadership is primarily exercised through influence, not explicit authority.Senior ICs who have been managers are supremely powerful beings, who tend to wield outsize influence. Smart managers will lean on them extensively for everything from shadow management and mentorship to advice, strategy, etc. (Dumb managers don’t. So find a smart manager who isn’t threatened by your experience.)
You’re a short-timer here, remember? Your company sucks. You’re just renewing your technical skills and pulling a paycheck while finding a company that will treat you better, that is more aligned with your values.
Lastly (and most importantly), I have a question. Why did you need to become a manager in order to drive sweeping technical change over the past two years? WHY couldn’t you have done it as a senior IC? Shouldn’t technical people be responsible for technical decisions, and people managers responsible for people decisions? Could this be your next challenge, or part of it? Could you go back to being an engineer, equipped with your shiny new powers of influence and mystical aura of recent management experience, and use it to organize the other senior ICs to assert their rightful ownership over technical decisions? Could you use your newfound clout with leadership and upper management to convince them that this will help them recruit and retain better talent, and is a better way to run a technical org — for everyone?
I believe this is a better way, but I have only ever seen these changes happen when agitated for and demanded by the senior ICs. If the senior ICs don’t assert their leadership, managers are unlikely to give it to them. If managers try, but senior ICs don’t inhabit their power, eventually the managers just shrug and go back to making all the decisions. That is why ultimately this is a change that must be driven and owned — at a minimum co-owned — by the senior individual contributors.
I hope you can push back against that fear of being small and meaningless as an individual contributor. The fact that it very often is this way, especially in strongly hierarchical organizations, does not mean that it has to be this way; and in healthy organizations it is not this way. Command-and-control systems are not conducive to creative flourishing. We have to fight the baggage of the authoritarian structures we inherited in order to make better ones.
Organizations are created afresh each and every day — not created for us, but by us. Help create the organization you want to work at, where senior people are respected equally and have domains of ownership whether they manage people or technology. If your current gig won’t value that labor, find one that will..
They exist. And they want to hire you.
Lots of companies are DYING to hire this kind of senior IC, someone who is still hands on yet feels responsibility for the team as a whole, who knows the business side, who knows how to mentor and craft a culture and can herd cats when nec
There are companies that know how to use ICs at the strategic level, even executive level. There are bosses who will see you not as a threat, but as a *huge asset* they can entrust with monumental work.
As a senior contributor who moves fluidly between roles, you are especially well-equipped to help shape a sociotechnical organization. Could you make it your mission to model the kind of relationship you want to see between management and ICs, whichever side you happen to be on? We need more people figuring out how to build organizations where management is not a promotion, just a change of career, and where going back and forth carries no baggage about promotions and demotions. Help us.
And when you figure it out, please don’t keep it to yourself. Expand your influence and share your findings by writing your experiences in blog posts, in articles, in talks. Tell stories. Show people people how much better it is this way. Be so magnificently effective and mysteriously influential as a senior IC that all the baby engineers you work with want to grow up to be just like you.
Hope this helps.
P.S. — Oh and stop fretting about “competing” with the 20-somethings kuberneteheads, you dork. You have been learning shit your whole career and you’ll learn this shit too. The tech is the easy part. The tech will always be the easy part. 🙂
Over a year and a half ago, I wrote up a post about the rights and responsibilities due any engineer at Honeycomb. At the time we were in the middle of a growth spurt, had just hired several new engineers, and I was in the process of turning over day-to-day engineering management over to Emily. Writing things down helped me codify what I actually cared about, and helped keep us true to our principles as we grew.
Tacked on to the end of the post was a list of manager responsibilities, almost as an afterthought. Many people protested, “don’t managers get any rights??” (and naturally I snapped “NO! hahahahahha”)
I always intended to circle back and write a followup post with the rights and responsibilities for managers. But it wasn’t til recently, as we are gearing up for another hiring spurt and have expanded our managerial ranks, that it really felt like its time had come.
The time has come, the time is now, as marvin k. mooney once said. Added the bill of rights, and updated and expanded the list of responsibilities. Thanks Emily Nakashima for co-writing it with me.
Manager’s Bill of Rights
You shall receive honest, courageous, timely feedback about yourself and your team, from your reports, your peers, and your leaders. (No one is exempt from feeding the hungry hungry feedback hippo! NOO ONNEEEE!) 🦛🦛🦛🦛🦛🦛🦛
Management will be treated with the same respect and importance as individual work.
You have the final say over hiring, firing, and leveling decisions for your team. It is expected that you solicit feedback from your team and peers and drive consensus where possible. But in the end, the say is yours.
Management can be draining, difficult work, even at places that do it well. You will get tactical, strategic, and emotional support from other managers.
You cannot take care of others unless you first practice self-care. You damn well better take vacations. (Real ones.)
You have the right to personal development, career progression, and professional support. We will retain a leadership coach for you.
You do not have to be a manager if you do not want to. No one will ever pressure you.
Recruit and hire and train your team. Foster a sense of solidarity and “teaminess” as well as real emotional safety.
Cultivate an inclusive culture and redistribute opportunity. Fuck a pedigree. Resist monoculture.
Care for the people on your team. Support them in their career trajectory, personal goals, work/life balance, and inter- and intra-team dynamics.
Keep an eye out for people on other teams who aren’t getting the support they need, and work with your leadership and manager peers to fix the situation.
Give feedback early and often. Receive feedback gracefully. Always say the hard things, but say them with love.
Move us relentlessly forward, staying alert for rabbit-holing and work that doesn’t contribute to our goals. Ensure redundancy/coverage of critical areas.
Own the planning process for your team, be accountable for the goals you set. Allocate resources by communicating priorities and requesting support. Add focus or urgency where needed.
Own your time and attention. Be accessible. Actively manage your calendar. Try not to make your emotions everyone else’s problems (but do lean on your own manager and your peers for support).
Make your own personal growth and self-care a priority. Model the values and traits we want employees to pattern themselves after.
I just read this piece, which is basically a very long subtweet about my Friday deploy threads. Go on and read it: I’ll wait.
Here’s the thing. After getting over some of the personal gibes (smug optimism? literally no one has ever accused me of being an optimist, kind sir), you may be expecting me to issue a vigorous rebuttal. But I shan’t. Because we are actually in violent agreement, almost entirely.
I have repeatedly stressed the following points:
I want to make engineers’ lives better, by giving them more uninterrupted weekends and nights of sleep. This is the goal that underpins everything I do.
Anyone who ships code should develop and exercise good engineering judgment about when to deploy, every day of the week
Every team has to make their own determination about which policies and norms are right given their circumstances and risk tolerance
A policy of “no Friday deploys” may be reasonable for now but should be seen as a smell, a sign that your deploys are risky. It is also likely to make things WORSE for you, not better, by causing you to adopt other risky practices (e.g. elongating the interval between merge and deploy, batching changes up in a single deploy)
This has been the most frustrating thing about this conversation: that a) I am not in fact the absolutist y’all are arguing against, and b) MY number one priority is engineers and their work/life balance. Which makes this particularly aggravating:
Lastly there is some strange argument that choosing not to deploy on Friday “Shouldn’t be a source of glee and pride”. That one I haven’t figured out yet, because I have always had a lot of glee and pride in being extremely (overly?) protective of the work/life balance of the engineers who either work for me, or with me. I don’t expect that to change.
Hold up. Did you catch that clever little logic switcheroo? You defined “not deploying on Friday” as being a priori synonymous with “protecting the work/life balance of engineers”. This is how I know you haven’t actually grasped my point, and are arguing against a straw man. My entire point is that the behaviors and practices associated with blocking Friday deploys are in fact hurting your engineers.
I, too, take a lot of glee and pride in being extremely, massively, yes even OVERLY protective of the work/life balance of the engineers who either work for me, or with me.
AND THAT IS WHY WE DEPLOY ON FRIDAYS.
Because it is BETTER for them. Because it is part of a deploy ecosystem which results in them being woken up less and having fewer weekends interrupted overall than if I had blocked deploys on Fridays.
It’s not about Fridays. It’s about having a healthy ecosystem and feedback loop where you trust your deploys, where deploys aren’t a big deal, and they never cause engineers to have to work outside working hours. And part of how you get there is by not artificially blocking off a big bunch of the week and not deploying during that time, because that breaks up your virtuous feedback loop and causes your deploys to be much more likely to fail in terrible ways.
The other thing that annoys me is when people say, primly, “you can’t guarantee any deploy is safe, but you can guarantee people have plans for the weekend.”
Know what else you can guarantee? That people would like to sleep through the fucking night, even on weeknights.
When I hear people say this all I hear is that they don’t care enough to invest the time to actually fix their shit so it won’t wake people up or interrupt their off time, seven days a week. Enough with the virtue signaling already.
You cannot have it both ways, where you block off a bunch of undeployable time AND you have robust, resilient, swift deploys. Somehow I keep not getting this core point across to a substantial number of very intelligent people. So let me try a different way.
Let’s try telling a story.
A tale of two startups
Here are two case studies.
Company X is a three-year-old startup. It is a large, fast-growing multi-tenant platform on a large distributed system with spiky traffic, lots of user-submitted data, and a very green database. Company X deploys the API about once per day, and does a global deploy of all services every Tuesday. Deploys often involve some firefighting and a rollback or two, and Tuesdays often involve deploying and reverting all day (sigh).
Pager volume at Company X isn’t the worst, but usually involves getting woken up a couple times a week, and there are deploy-related alerts after maybe a third of deploys, which then need to be triaged to figure out whose diff was the cause.
Company Z is a three-year-old startup. It is a large, fast-growing multi-tenant platform on a large distributed system with spiky traffic, lots of user-submitted data, and a very green house-built distributed storage engine. Company Z automatically triggers a deploy within 30 minutes of a merge to master, for all services impacted by that merge. Developers at company Z practice observability-driven deployment, where they instrument all changes, ask “how will I know if this change doesn’t work?” during code review, and have a muscle memory habit of checking to see if their changes are working as intended or not after they merge to master.
Deploys rarely result in the pager going off at Company Z; most problems are caught visually by the engineer and reverted or fixed before any paging alert can fire. Pager volume consists of roughly one alert per week outside of working hours, and no one is woken up more than a couple times per year.
Same damn problem, better damn solutions.
If it wasn’t extremely obvious, these companies are my last two jobs, Parse (company X, from 2012-2016) and Honeycomb (company Z, from 2016-present).
They have a LOT in common. Both are services for developers, both are platforms, both are running highly elastic microservices written in golang, both get lots of spiky traffic and store lots of user-defined data in a young, homebrewed columnar storage engine. They were even built by some of the same people (I built infra for both, and they share four more of the same developers).
At Parse, deploys were run by ops engineers because of how common it was for there to be some firefighting involved. We discouraged people from deploying on Fridays, we locked deploys around holidays and big launches. At Honeycomb, none of these things are true. In fact, we literally can’t remember a time when it was hard to debug a deploy-related change.
What’s the difference between Company X and Company Z?
So: what’s the difference? Why are the two companies so dramatically different in the riskiness of their deploys, and the amount of human toil it takes to keep them up?
I’ve thought about this a lot. It comes down to three main things.
Single merge per deploy
I think that I’ve been reluctant to hammer this home as much as I ought to, because I’m exquisitely sensitive about sounding like an obnoxious vendor trying to sell you things. 😛 (Which has absolutely been detrimental to my argument.)
When I say observability, I mean in the precise technical definition as I laid out in this piece: with high cardinality, arbitrarily wide structured events, etc. Metrics and other generic telemetry will not give you the ability to do the necessary things, e.g. break down by build id in combination with all your other dimensions to see the world through the lens of your instrumentation. Here, for example, are all the deploys for a particular service last Friday:
Each shaded area is the duration of an individual deploy: you can see the counters for each build id, as the new versions replace the old ones,
2. Observability-driven development.
This is cultural as well as technical. By this I mean instrumenting a couple steps ahead of yourself as you are developing and shipping code. I mean making a cultural practice of asking each other “how will you know if this is broken?” during code review. I mean always going and looking at your service through the lens of your instrumentation after every diff you ship. Like muscle memory.
3. Single merge per deploy.
The number one thing you can do to make your deploys intelligible, other than observability and instrumentation, is this: deploy one changeset at a time, as swiftly as possible after it is merged to master. NEVER glom multiple changesets into a single deploy — that’s how you get into a state where you aren’t sure which change is at fault, or who to escalate to, or if it’s an intersection of multiple changes, or if you should just start bisecting blindly to try and isolate the source of the problem. THIS is what turns deploys into long, painful marathons.
And NEVER wait hours or days to deploy after the change is merged. As a developer, you know full well how this goes. After you merge to master one of two things will happen. Either:
you promptly pull up a window to watch your changes roll out, checking on your instrumentation to see if it’s doing what you intended it to or if anything looks weird, OR
you close the project and open a new one.
When you switch to a new project, your brain starts rapidly evicting all the rich context about what you had intended to do and and overwriting it with all the new details about the new project.
Whereas if you shipped that changeset right after merging, then you can WATCH it roll out. And 80-90% of all problems can be, should be caught right here, before your users ever notice — before alerts can fire off and page you. If you have the ability to break down by build id, zoom in on any errors that happen to arise, see exactly which dimensions all the errors have in common and how they differ from the healthy requests, see exactly what the context is for any erroring requests.
Healthy feedback loops == healthy systems.
That tight, short feedback loop of build/ship/observe is the beating heart of a healthy, observable distributed system that can be run and maintained by human beings, without it sucking your life force or ruining your sleep schedule or will to live.
Most engineers have never worked on a system like this. Most engineers have no idea what a yawning chasm exists between a healthy, tractable system and where they are now. Most engineers have no idea what a difference observability can make. Most engineers are far more familiar with spending 40-50% of their week fumbling around in the dark, trying to figure out where in the system is the problem they are trying to fix, and what kind of context do they need to reproduce.
Most engineers are dealing with systems where they blindly shipped bugs with no observability, and reports about those bugs started to trickle in over the next hours, days, weeks, months, or years. Most engineers are dealing with systems that are obfuscated and obscure, systems which are tangled heaps of bugs and poorly understood behavior for years compounding upon years on end.
That’s why it doesn’t seem like such a big deal to you break up that tight, short feedback loop. That’s why it doesn’t fill you with horror to think of merging on Friday morning and deploying on Monday. That’s why it doesn’t appall you to clump together all the changes that happen to get merged between Friday and Monday and push them out in a single deploy.
It just doesn’t seem that much worse than what you normally deal with. You think this raging trash fire is, unfortunately … normal.
How realistic is this, though, really?
Maybe you’re rolling your eyes at me now. “Sure, Charity, that’s nice for you, on your brand new shiny system. Ours has years of technical debt, It’s unrealistic to hold us to the same standard.”
Yeah, I know. It is much harder to dig yourself out of a hole than it is to not create a hole in the first place. No doubt about that.
Harder, yes. But not impossible.
I have done it.
Parse in 2013 was a trash fire. It woke us up every night, we spent a lot of time stabbing around in the dark after every deploy. But after we got acquired by Facebook, after we started shipping some data sets into Scuba, after (in retrospect, I can say) we had event-level observability for our systems, we were able to start paying down that debt and fixing our deploy systems.
We started hooking up that virtuous feedback loop, step by step.
We reworked our CI/CD system so that it built a new artifact after every single merge.
We put developers at the steering wheel so they could push their own changes out.
We got better at instrumentation, and we made a habit of going to look at it during or after each deploy.
We hooked up the pager so it would alert the person who merged the last diff, if an alert was generated within an hour after that service was deployed.
We started finding bugs quicker, faster, and paying down the tech debt we had amassed from shipping code without observability/visibility for many years.
Developers got in the habit of shipping their own changes, and watching them as they rolled out, and finding/fixing their bugs immediately.
It took some time. But after a year of this, our formerly flaky, obscure, mysterious, massively multi-tenant service that was going down every day and wreaking havoc on our sleep schedules was tamed. Deploys were swift and drama-free. We stopped blocking deploys on Fridays, holidays, or any other days, because we realized our systems were more stable when we always shipped consistently and quickly.
Allow me to repeat. Our systems were more stable when we always shipped right after the changes were merged. Our systems were less stable when we carved out times to pause deployments. This was not common wisdom at the time, so it surprised me; yet I found it to be true over and over and over again.
This is literally why I started Honeycomb.
When I was leaving Facebook, I suddenly realized that this meant going back to the Dark Ages in terms of tooling. I had become so accustomed to having the Parse+scuba tooling and being able to iteratively explore and ask any question without having to predict it in advance. I couldn’t fathom giving it up.
The idea of going back to a world without observability, a world where one deployed and then stared anxiously at dashboards — it was unthinkable. It was like I was being asked to give up my five senses for production — like I was going to be blind, deaf, dumb, without taste or touch.
Look, I agree with nearly everything in the author’s piece. I could have written that piece myself five years ago.
But since then, I’ve learned that systems can be better. They MUST be better. Our systems are getting so rapidly more complex, they are outstripping our ability to understand and manage them using the past generation of tools. If we don’t change our ways, it will chew up another generation of engineering lives, sleep schedules, relationships.
Observability isn’t the whole story. But it’s certainly where it starts. If you can’t see where you’re going, you can’t go very far.
Get you some observability.
And then raise your standards for how systems should feel, and how much of your human life they should consume. Do better.
Because I couldn’t agree with that other post more: it really is all about people and their real lives.
Listen, if you can swing a four day work week, more power to you (most of us can’t). Any day you aren’t merging code to master, you have no need to deploy either. It’s not about Fridays; it’s about the swift, virtuous feedback loop.
And nobody should be shamed for what they need to do to survive, given the state of their systems today.
But things aren’t gonna get better unless you see clearly how you are contributing to your present pain. And congratulating ourselves for blocking Friday deploys is like congratulating ourselves for swatting ourselves in the face with the flyswatter. It’s a gross hack.
Maybe you had a good reason. Sure. But I’m telling you, if you truly do care about people and their work/life balance: we can do a lot better.
(With 🙏 to Joe Beda, whose brilliant idea for a blog post this was. Thanks for letting me borrow it!)
Interviewing is hard and it sucks.
In theory, it really shouldn’t be. You’re a highly paid professional and your skills are in high demand. This ought to be a meeting between equals to mutually explore what a longer-term relationship might look like. Why take the outcome personally? There are at least as many reasons for you to decide not to join a company as for the company to decide not to hire you, right?
In reality, of course, all the situational cues and incentives line up to make you feel like the whole thing is a referendum on whether or not you personally are Good Enough (smart enough, senior enough, skilled enough, cool enough) to join their fancy club.
People stay at shitty jobs far, far longer than they ought to, just because interviews can be so genuinely crushing to your spirit and sense of self. Even when they aren’t the worst, it can leave a lasting sting when they decline to hire you.
But there is an important asymmetry here. By not hiring someone, I very rarely mean it as a rejection of that person. (Not unless they were, like, mean to the office manager, or directed all their technical questions to the male interviewers.) On the contrary, I generally hold the people we decline to hire — or have had to let go! — in extremely high opinion.
So if someone interviews at Honeycomb, I do not want them to walk away feeling stung, hurt, or bad about themselves. I would like them to walk away feeling good about themselves and our interactions, even if one or both of us are disappointed by the outcome. I want them to feel the same way about themselves as I feel about them, especially since there’s a high likelihood that I may want to work with them in the future.
So here are the real, honest-to-god most common reasons why I don’t hire someone.
If you’ve worked at a Google or Facebook before, you may have a certain mental model of how hiring works. You ask the candidate a bunch of questions, and if they do well enough, you hire them. This could not be more different from early stage startup hiring, which is defined in every way by scarcity.
I only have a few precious slots to fill this year, and every single one of them is tied to one or more key company initiatives or goals, without which we may fail as a company. Emily and I spend hours obsessively discussing what the profile we are looking for is, what the smallest possible set of key strengths and skills that this hire must have, inter-team and intra-team dynamics and what elements are missing or need to be bolstered from the team as it stands. And at the end of the day, there are not nearly as many slots to fill as there are awesome people we’d like to hire. Not even close. Having to choose between several differently wonderful people can be *excruciating*.
No, not that kind. (Yes, we care about cultivating a diverse team and support that goal through our recruiting and hiring processes, but it’s not a factor in our hiring decisions.) I mean your level, stage in your career, educational background, professional background, trajectory, areas of focus and strengths. We are trying to build radical new tools for sociotechnical systems; tools that are friendly, intuitive, and accessible to every engineer (and engineering-adjacent profession) in the world.
How well do you think we’re going to do at our goal if the people building it are all ex-Facebook, ex-MIT senior engineers? If everyone has the exact same reference points and professional training, we will all have the same blind spots. Even if our team looks like a fucking Benetton ad.
3. We are assembling a team, not hiring individuals.
We spend at least as much time hashing out what the subtle needs of the team are right now as talking about the individual candidate. Maybe what we need is a senior candidate who loves mentoring with her whole heart, or a language polyglot who can help unify the look and feel of our integrations across ten different languages and platforms. Or maybe we have plenty of accomplished mentors, but the team is really lacking someone with expertise in query profiling and db tuning, and we expect this to be a big source of pain in the coming year. Maybe we realize we have nobody on the team who is interested in management, and we are definitely going to need someone to grow into or be hired on as a manager a year or two from now.
There is no value judgment or hierarchy attached to any of these skills or particulars. We simply need what we need, and you are who you are.
4. I am not confident that we can make you successful in this role at this time.
We rarely turn people down for purely technical reasons, because technical skills can be learned. But there can be some combination of your skills, past experience, geographical location, time zone, experience with working remotely, etc — that just gives us pause. If we cast forward a year, do we think you are going to be joyfully humming along and enjoying yourself, working more-or-less independently and collaboratively? If we can’t convince ourselves this is true, for whatever reasons, we are unlikely to hire you. (But we would love to talk with you again someday.)
5. The team needs someone operating at a different level.
Don’t assume this always means “you aren’t senior enough”. We have had to turn down people at least as often for being too senior as not senior enough. An organization can only absorb so many principal and senior engineers; there just isn’t enough high-level strategic work to go around. I believe happy, healthy teams are comprised of a range of levels — you need more junior folks asking naive questions that give senior folks the opportunity to explain themselves and catch their dumb mistakes. You need there to be at least one sweet child who is just so completely stoked to build their very first login page.
A team staffed with nothing but extremely senior developers will be a dysfunctional, bored and contentious team where no one is really growing up or being challenged as they should.
6. We don’t have the kind of work you need or want.
The first time we tried hiring junior developers, we ran into this problem hardcore. We simply didn’t have enough entry-level work for them to do. Everything was frustratingly complex and hard for them, so they weren’t able to operate independently, and we couldn’t spare an engineer to pair with them full time.
This also manifests in other ways. Like, lots of SREs and data engineers would LOVE to work at honeycomb. But we don’t have enough ops engineering work or data problems to keep them busy full time. (Well — that’s not precisely true. They could probably keep busy. But it wouldn’t be aligned with our core needs as a business, which makes them premature optimizations we cannot afford.)
7. Communication skills.
We select highly for communication skills. The core of our technical interview involves improving and extending a piece of code, then bringing it in the next day to discuss it with your peers. We believe that if you can explain what you did and why, you can definitely do the work, and the reverse is not necessarily true. We also believe that communication skills are at the foundation of a team’s ability to learn from its mistakes and improve as a unit. We value high-performing teams, therefore we select for those skills.
There are many excellent engineers who are not good communicators, or who do not value communication the way we do, and while we may respect you very much, it’s not a great fit for our team.
8. You don’t actually want to work at a startup.
“I really want to work at a startup. Also the things that are really important to me are: work/life balance, predictability, high salary, gold benefits, stability, working from 10 to 5 on the dot, knowing what i’ll be working on for the next month, not having things change unexpectedly, never being on call, never needing to think or care about work out of hours …”
To be clear, it is not a red flag if you care about work/life balance. We care about that too — who the hell doesn’t? But startups are inherently more chaotic and unpredictable, and roles are more fluid and dynamic, and I want to make sure your expectations are aligned with reality.
9. You just want to work for women.
I hate it when I’m interviewing someone and I ask why they’re interested in Honeycomb, and they enthusiastically say “Because it was founded by women!”, and I wait for the rest of it, but that’s all there is. That’s it? Nothing interests you about the problem, the competitive space, the people, the customers … nothing?? It’s fine if the leadership team is what first caught your eye. But it’s kind of insulting to just stop there. Just imagine if somebody asked you out on a date “because you’re a woman”. Low. Fucking. Bar.
10. I truly want you to be happy.
I have no interest in making a hard sell to people who are dubious about Honeycomb. I don’t want to hire people who can capably do the job, but whose hearts are really elsewhere doing other things, or who barely tolerate going to work every day. I want to join with people who see their labor as an extension of themselves, who see work as an important part of their life’s project. I only want you to work here if it’s what’s best for you.
11. I’m not perfect.
We have made the wrong decision before, and will do so again. >_<
As a candidate, it is tempting to feel like you will get the job if you are awesome enough, therefore if you do not get the job it must be because you were insufficiently awesome. But that is not how hiring works — not for highly constrained startups, anyway.
If we brought you in for an interview, we already think you’re awesome. Period. Now we’re just trying to figure out if you narrowly intersect the skill sets we are lacking that we need to succeed this year.
If you could be a fly on the wall, listening to us talk about you, the phrase you would hear over and over is not “how good are they?”, but “what will they need to be successful? can we provide the support they need?” We know this is as much of a referendum on us as it is on you. And we are not perfect.
I made a vow this year to post one blog post a month, then I didn’t post anything at all from May to September. I have some catching up to do. 😑 I’ve also been meaning to transcribe some of the twitter rants that I end up linking back to into blog posts, so if there’s anything you especially want me to write about, tell me now while I’m in repentance mode.
This is one request I happened to make a note of because I can’t believe I haven’t already written it up! I’ve been saying the same thing over and over in talks and on twitter for years, but apparently never a blog post.
The question is: what is the proper role of alerting in the modern era of distributed systems? Has it changed? What are the updated best practices for alerting?
@mipsytipsy I've seen your thoughts on dashboards vs searching but haven't seen many thoughts from you on Alerting. Let me know if I've missed a blog somewhere on that! 🙂
It’s a great question. I want to wax philosophically about some stuff, but first let me briefly outline the way to modernize your alerting best practices:
implement SLOs and/or end-to-end checks that traverse key code paths and correlate to user-impacting events
create a secondary channel (tasks, ticketing system, whatever) for “things that on call should look at soon, but are not impacting users yet” which does not page anyone, but which on call is expected to look at (at least) first thing in the morning, last thing in the evening, and midday
move as many paging alerts as possible to the secondary channel, by engineering your services to auto-remediate or run in degraded mode until they can be patched up
wake people up only for SLOs and health checks that correlate to user-impacting events
Or, in an even shorter formulation: delete all your paging alerts, then page only on e2e alerts that mean users are in pain. Rely on debugging tools for debugging, and paging only when users are in pain.
To understand why I advocate deleting all your paging alerts, and when it’s safe to delete them, first we need to understand why have we accumulated so many crappy paging alerts over the years.
Monoliths, LAMP stacks, and death by pagebomb
Here, let’s crib a couple of slides from one of my talks on observability. Here are the characteristics of older monolithic LAMP-stack style systems, and best practices for running them:
The sad truth is, that when all you have is time series aggregates and traditional monitoring dashboards, you aren’t really debugging with science so much as you are relying on your gut and a handful of dashboards, using intuition and scraps of data to try and reconstruct an impossibly complex system state.
This works ok, as long as you have a relatively limited set of failure scenarios that happen over and over again. You can just pattern match from past failures to current data, and most of the time your intuition can bridge the gap correctly. Every time there’s an outage, you post mortem the incident, figure out what happened, build a dashboard “to help us find the problem immediately next time”, create a detailed runbook for how to respond to it, and (often) configure a paging alert to detect that scenario.
Over time you build up a rich library of these responses. So most of the time when you get paged you get a cluster of pages that actually serves to help you debug what’s happening. For example, at Parse, if the error graph had a particular shape I immediately knew it was a redis outage. Or, if I got paged about a high % of app servers all timing out in a short period of time, I could be almost certain the problem was due to mysql connections. And so forth.
Things fall apart; the pagebomb cannot stand
However, this model falls apart fast with distributed systems. There are just too many failures. Failure is constant, continuous, eternal. Failure stops being interesting. It has to stop being interesting, or you will die.
Instead of a limited set of recurring error conditions, you have an infinitely long list of things that almost never happen …. except that one time they do. If you invest your time into runbooks and monitoring checks, it’s wasted time if that edge case never happens again.
Frankly, any time you get paged about a distributed system, it should be a genuinely new failure that requires your full creative attention. You shouldn’t just be checking your phone, going “oh THAT again”, and flipping through a runbook. Every time you get paged it should be genuinely new and interesting.
Oh damn this talk looks baller. 😍 "Failure is important, but it is no longer interesting" — @this_hits_home… Netflix once again shining the light on where the rest of us need to get to over the next 3-5 years. 🙌🏅🎬 https://t.co/OY40Y0BTSa
And thus you should actually have drastically fewer paging alerts than you used to.
A better way: observability and SLOs.
Instead of paging alerts for every specific failure scenario, the technically correct answer is to define your SLOs (service level objectives) and page only on those, i.e. when you are going to run out of budget ahead of schedule. But most people aren’t yet operating at this level of sophistication. (SLOs sound easy, but are unbelievably challenging to do well; many great teams have tried and failed. This is why we have built an SLO feature into Honeycomb that does the heavy lifting for you. Currently alpha testing with users.)
If you haven’t yet caught the SLO religion, the alternate answer is that “you should only page on high level end-to-end alerts, the ones which traverse the code paths that make you money and correspond to user pain”. Alert on the three golden signals: request rate, latency, and errors, and make sure to traverse every shard and/or storage type in your critical path.
That’s it. Don’t alert on the state of individual storage instances, or replication, or anything that isn’t user-visible.
(To be clear: by “alert” I mean “paging humans at any time of day or night”. You might reasonably choose to page people during normal work hours, but during sleepy hours most errors should be routed to a non-paging address. Only wake people up for actual user-visible problems.)
Here’s the thing. The reason we had all those paging alerts was because we depended on them to understand our systems.
Once you make the shift to observability, once you have rich instrumentation and the ability to swiftly zoom in from high level “there might be a problem” to identifying specifically what the errors have in common, or the source of the problem — you no longer need to lean on that scattershot bunch of pagebombs to understand your systems. You should be able to confidently ask any question of your systems, understand any system state — even if you have never encountered it before.
With observability, you debug by systematically following the trail of crumbs back to their source, whatever that is. Those paging alerts were a crutch, and now you don’t need them anymore.
Everyone is on call && on call doesn’t suck.
I often talk about how modern systems require software ownership. The person who is writing the software, who has the original intent in their head, needs to shepherd that code out into production and watch real users use it. You can’t chop that up into multiple roles, dev and ops. You just can’t. Software engineers working on highly available systems need to be on call for their code.
But the flip side of this responsibility belongs to management. If you’re asking everyone to be on call, it is your sworn duty to make sure that on call does not suck. People shouldn’t have to plan their lives around being on call. People shouldn’t have to expect to be woken up on a regular basis. Every paging alert out of hours should be as serious as a heart attack, and this means allocating real engineering resources to keeping tech debt down and noise levels low.
And the way you get there is first invest in observability, then delete all your paging alerts and start over from scratch.