Dan Golant asked a great question today: “Any advice/reading on how to establish a team’s critical path?”
I repeated back: “establish a critical path?” and he clarified:
Yea, like, you talk about buttoning up your “critical path”, making sure it’s well-monitored etc. I think that the right first step to really improving Observability is establishing what business processes *must* happen, what our “critical paths” are. I’m trying to figure out whether there are particularly good questions to ask that can help us document what these paths are for my team/group in Eng.
“Critical path” is one of those phrases that I think I probably use a lot. Possibly because the very first real job I ever had was when I took a break from college and worked at criticalpath.net (“we handle the world’s email”) — and by “work” I mean, “lived in SF for a year when I was 18 and went to a lot of raves and did a lot of drugs with people way cooler than me”. Then I went back to college, the dotcom boom crashed, and the CP CFO and CEO actually went to jail for cooking the books, becoming the only tech execs I am aware of who actually went to jail.
Where was I.
Right, critical path. What I said to Dan is this: “What makes you money?”
Like, if you could only deploy three end-to-end checks that would perform entire operations on your site and ensure they work at all times, what would they be? what would they do? “Submit a payment” is a super common one; another is new user signups.
The idea here is to draw up a list of the things that are absolutely worth waking someone up to fix immediately, night or day, rain or shine. That list should be as compact and well-defined as possible. This allows you to be explicit about the fact that anything else can wait til morning, or some other less-demanding service level agreement.
And typically the right place to start on this list is by asking yourselves: “what makes us money?” as a proxy for the real questions, which are: “what actions allow us to survive as a business? What do our customers care the absolute most about? What makes us us?” That’s your critical path.
Someone will usually seize this opportunity to argue that absolutely any deterioration in service is worth paging someone immediately to fix it, day or night. They are wrong, but it’s good to flush these assumptions out and have this argument kindly out in the open.
(Also, this is really a question about service level objectives. So if you’re asking yourself about the critical path, you should probably consider buying Alex Hidalgo’s book on SLOs, and you may want to look into the Honeycomb SLO product, the only one in the industry that actually implements SLOs as the Google SRE book defines them (thanks Liz!) and lets you jump straight from “what are our customers experiencing?” to “WHY are they experiencing it”, without bouncing awkwardly from aggregate metrics to logs and back and just … hoping … the spikes line up according to your visual approximations.)
Last year I was diagnosed with ADHD, which was a great surprise to me (if no one else). Since then I have been trying to pay attention to things I do that might be, let’s say, outside the norm. One of those things is, apparently, food.
I tend to fixate on one food at a time. When I wake up in the morning, it’s the first and only thing I crave. When I’m hungry, I’m dying for it, and I don’t really experience cravings or desire for other foods, although I will eat them to be polite. The phase tends to last for…six months to two years? and then it shifts to something else.
The target of my appetite has been, at various times in the past: honeycrisp apples with peanut butter (I was DEVASTATED when honeycrisp season ended; other apples weren’t the same), dry cheerios with freeze-dried strawberries, chopped broccoli with sharp cheddar, a cashew chicken dish at a now-defunct Thai restaurant, etc.
One year it was manhattans (makers mark, sweet vermouth and bitters) and I seriously worried I was becoming an alcoholic. 🙈
But since September 19th, 2019, the only thing I have been interested in eating is … boba. Those little brown tapioca balls. I can rattle off to you the top boba places in every city I’ve visited since then (LA has some seriously adventurous ones). And when the world strapped in for quarantine, I was on the verge of panic. What to do??
I finally figured out how to make my own boba. This was NONTRIVIAL. It took the sacrifice of countless pans and far too many nights doubled over with nausea and stomach cramps (read my buying tips, I cannot this stress enough), and months of trial and error. But here is how to get the plump, chewy, slightly sweet boba of your dreams.
Do not buy any boba from China. Do not buy any boba labeled “quick cook”, or boba with instructions that are on the order of 5 minutes. Do not buy any flavored boba. I got violently ill from about half a dozen different brands I ordered randomly off Amazon, all made in China. Some had an odd aftertaste.
Supposedly, the Boba Guys are planning to let us buy the stuff they make domestically in California “soon”. Until then, stick to the stuff that is made of tapioca flour only, and manufactured in Taiwan or The U.S.
Also, the little balls are very fragile and turn to powder in the mail unless they are packed very tightly. This boba, from The Tea Zone is what I buy and recommend buying. Pick up some large diameter straws if you don’t have a stash at home.
You need a big-ass pot of boiling water. The biggest pot you’ve got. I use a big soup pot that holds like 16 or 20 quarts.
If you only have a few quarts of water, you will ruin pans. The tapioca dust turns to gummy that sticks to the sides and bottom and gets baked on like a motherfucker. You want a ratio of SHIT TONS of water to a handful or two of boba.
Fill it up with water to within an inch or two of the top — Bring it to a fast boil, then put your boba in — a cup or two or three, whatever you think you need. Let it boil for 20-25 minutes… only reduce the heat if you have to to keep it from boiling over.
Uncooked boba will have these little white spots in the middle. Once you see only a few of those in a sea of black pearls, turn off the heat. Let it sit in the hot water for another 20-25 minutes.
Then take the pot to the sink, pour off the excess water, fill it back up with cold water, swoosh it around to rinse; pour, fill, rinse a couple times til the balls are rinsed and lukewarm. You don’t have to drain them dry-dry; leave a small bit of water in the pan.
Flavoring and eating.
Add some sweetener — I like brown sugar, but honey is good too, or molasses and white sugar — and let the balls soak for another 30 minutes so they absorb the flavor. Now they are ready to eat. They will only keep for about a day, and don’t refrigerate them or they get gross.
**If you want the syrupy consistency of the gourmet boba shops, leave a little extra water in there, add the sugar, then simmer on low and STIR CONSTANTLY for 5-10 minutes or until it gets syrupy. I cannot stress this enough: rinse the boba first, and do not stop stirring, if you enjoy your pans and want to use them again
The easiest possible recipe (besides eating from the pot with a spoon) is, fill a glass 1/3 of the way with boba, add milk, add brown sugar simple syrup to taste. Add a couple ice cubes if you like your boba on the firm side. Also, try adding a little bit of rum and Frangelico for your bedtime boba.
I work as an engineering manager for a company whose non-technology leadership insists there has to be a way to measure the individual productivity of a software engineer. I have the opposite belief. I don’t believe you can measure the productivity of “professional” careers, or thought workers (ex: how do measure productivity of a doctor, lawyer, or chemist?).
For software engineering in particular, I feel that metrics can be gamed, don’t tell the whole story, or in some cases, are completely arbitrary. Do you measure individual developer productivity? If so, what do you measure, and why do you feel it’s valuable? If you don’t and share similar feelings as mine, how would you recommend I justify that position to non-technology leadership?
Thanks for your time.
Anonymous Engineering Manager
Once upon a time I had a job as a sysadmin, 100% remote, where all work was tracked using RT tasks. I soon realized that the owner didn’t have a lot of independent technical judgment, and his main barometer for the caliber of our contributions was the number of tasks we closed each day.
I became a ticket-closing machine. I’d snap up the quick and easy tasks within seconds. I’d pattern match and close in bulk when I found a solution for a group of tasks. I dove deep into the list of stale tickets looking for ones I could close as “did not respond” or “waiting for response”, especially once I realized there was no penalty for closing the same ticket over and over.
My boss worshiped me. I was bored as fuck. Sigh.
I guess what I’m trying to say is, I am fully in your camp. I don’t think you can measure the “productivity” of a creative professional by assigning metrics to their behaviors or process markers, and I think that attempting to derive or inflict such metrics can inflict a lot of damage.
In fact, I would say that to the extent you can reduce a job to a set of metrics, that job can be automated away. Metrics are for easy problems — discrete, self-contained, well-understood problems. The more challenging and novel a problem, the less reliable these metrics will be.
Your execs should fucking well know this: how would THEY like to be evaluated based on, like, how many emails they send in a day? Do they believe that would be good for the business? Or would they object that they are tasked with the holistic success of the org, and that their roles are too complex to reduce to a set of metrics without context?
This actually makes my blood boil. It is condescending as fuck for leadership to treat engineers like task-crunching interchangeable cogs. It reveals a deep misunderstanding of how sociotechnical systems are developed and sustained (plus authoritarian tendencies, and usually a big dollop of personal insecurity).
But what is the alternative?
In my experience, the “right” answer, i.e. the best way to run consistently high-performing teams, involves some combination of the following:
Outcome-based management that practices focusing on impact, plus
Team level health metrics, combined with
Engineering ladder and regular lightweight reviews, and
Managers who are well calibrated across the org, and encouraged to interrogate their own biases openly & with curiosity.
The right way to look at performance is at the team level. Individual engineers don’t own or maintain code; teams do. The team is the irreducible unit of ownership. So you need to incentivize people to think about work and spending their time cooperatively, optimizing for what is best for the team.
Some of the hardest and most impactful engineering work will be all but invisible on any set of individual metrics. You want people to trust that their manager will have their backs and value their contributions appropriately at review time, if they simply act in the team’s best interest. You do not want them to waste time gaming the metrics or courting personal political favor.
This is one of the reasons that managers need to be technical — so they can cultivate their own independent judgment, instead of basing reviews on hearsay. Because some resources (i.e. your budget for individual bonuses) are unfortunately zero-sum, and you are always going to rely on the good judgment of your engineering leaders when it comes to evaluating the relative impact of individual contributions.
This also is why it’s important for leaders to model the act of openly exploring whether they might be biased in some way:
“I would say that Joe’s contribution this quarter had greater impact than Jane’s. But is that really true? Jane did a LOT of mentoring and other “glue” work, which tends to be under-acknowledged as leadership work, so I just want to make sure I am evaluating this fairly … Does anyone else have a perspective on this? What might I be missing?” — a manager keeping themselves honest in calibrations
I do think every team should be tracking the 4 DORA metrics — time elapsed between merge and deploy, frequency of deploy, time to recover from outages, duration of outages — as well as how often someone is paged outside of business hours. These track pretty closely to engineering productivity and efficiency.
But leadership should do its best to be outcome oriented. The harder the problem, the more senior the contributor, the less business anyone has dictating the details of how or why. Make your agreements, then focus on impact.
This is harder on managers, for sure — it’s easier to count the hours someone spends at their desk or how many lines of code they commit than to develop a nuanced understanding of the quality and timbre of an engineer’s contributions to the product, team and the company over time. It is easier to micromanage the details than to negotiate a mutual understanding of what actually matters, commit to doing your part … and then step away, trusting them to fill in the gaps.
But we should expect this; it’s worth it. It is in those gaps where we feel trusted to act that we find joy and autonomy in our labor, where we do our best work as skilled artisans.
Is it ethical to discriminate in whom you will sell to as a business? What would you do if you found out that the work you do every day was being used to target and kill migrants at the border?
Is it ethical or defensible to pay two people doing the same job different salaries if they live in different locations and have a different cost of living? What if paying everyone the same rate means you are outcompeted by those who peg salaries to local rates, because they can vastly out-hire you?
You’re at the crowded hotel bar after a company-sponsored event, and one of your most valued customers begins loudly venting opinions about minorities in tech that you find alarming and abhorrent. What responsibility do you have, if any? How should you react?
If we were close to running out of money in the hypothetical future, should we do layoffs or offer pay cuts?
It’s not getting any simpler to live in this world, is it? 💔
Ethical problems are hard. Even the ones that seem straightforward on the face of them get stickier the closer you look at them. There are more stakeholders, more caveats, more cautionary tales, more unintended consequences than you can generally see at face value. It’s like fractal hardness, and anyone who thinks it’s easy is fooling themselves.
We’ve been running an experiment at Honeycomb for the past 6 months, where we talk through hypothetical ethical questions like these once a month. Sometimes they are ripped from the headlines, sometimes they are whatever I can invent the night before. I try to send them around in advance. The entire company is invited.**
Honeycomb is not a democracy, nor do I think that would be an effective way to run a company, any more than I think we should design our SDKs by committee or give everyone an equal vote on design mocks.
But I do think that we have a responsibility to act in the best interests of our stakeholders, to the best of our abilities, and to represent our employees. And that means we need to know where the team stands.
That’s one reason. Another is that people make the worst possible decisions when they’re taken off guard, when they are in an unfamiliar situation (and often panicking). Talking through a bunch of nightmare scenarios is a way for us to exercise these decision-making muscles while the stakes are low. We all get to experience what it’s like to hear a problem, have a kneejerk reaction .. then peeling back the onion to reveal layer after layer of dismaying complexities that muddy our snap certainties.
Honeycomb is a pretty transparent company; we believe that companies are created every day by the people who show up to labor together, so those people have a right to know most things. But it’s not always possible or ethically desirable to share all the gritty details that factor into a decision. My hope is that these practice runs help amplify employees’ voices, help them understand the way we approach big decisions, and help everyone make better decisions — and trust each other’s decisions — when things move fast and times get hard.
(Plus, these ethical puzzles are astonishingly fun to work through together. I highly recommend you borrow this idea and try it out at your own company.)
cheers, and please let me know if you do try it ☺️
** We used to limit attendance to the first 6 people to show up, to try and keep the discussion more authentic and less performative. We recently relaxed this rule since it doesn’t seem to matter, peacocking hasn’t really been an issue.
Last night I was talking with Mark Ferlatte about the advice we have given our respective companies in this pandemic era. He shared with me this link, on how to salvage a disastrous day. It’s a good link: you should read it.
My favorite part: “Your feelings will follow your actions. Just do it.”
The hardest part for me is, “Book-end your day. Don’t push it into the midnight hours.” Ugh. I really, really struggle with this because my brain takes a long long time to settle in and get started on a task to the point where I feel like I’m on a roll with it, and once I’m on a roll I do not want to stop until I’m done. Because god knows how long it will be — days? weeks?? — until I can catch this wave again, feel inspired again. But it’s true, if I stay up all night working I’m just setting myself up for a fuzzy, blundery tomorrow.
The advice we gave Honeycombers was differently shaped, though similar in spirit. I’ve had a few people ask me to share it, so here it is.
We formally request …
First, we would like to point out that what you are all being asked to do right now is impossible. Parenting, homeschooling, working, caregiving, correcting misinformed neighbors, being an engaged citizen … it is fifteen people’s worth of work. It is literally impossible.
But hey, it has always been impossible. We have never been able to do everything we want to do — there isn’t enough time. There was never enough time! We succeed as a company not by doing everything on our list, but by saying no to the right things; by NOT-doing enough most things so we can focus on the few things we have identified that matter most. That was true before COVID, it’s just truer now.
So: let’s all focus hard on our top priority. Shed as much of the other stuff as you have to. Shed more. Ask your manager for help figuring out what to shed, until you are down to an amount you can probably manage.
And speaking of focus:
You aren’t operating at full capacity. We all get that right now: none of us are. And nobody expects you to. So please spend zero energy on performing like you’re doing work, or acting extra-responsive, or keeping up a front like things are normal and you’re doing fine. That performance costs you precious energy, while doing nothing to get us closer to our goals.
What we need from you is not performance or busy-busy-ness but your engaged creative self — your active, curious mind engaging with our top problem. I would rather have 30 minutes of your creative energy applied to our biggest problem today than five hours of your distracted split-brain, juggling, trying to keep up with chat and seem like you’re as available per usual today.
So when you’re figuring out your schedule, please optimize for that — focused time on our biggest problem — and then communicate your availability to your team. If you’re a parent and you can only really work three days a week, calendar that. (If you’re not a parent, remember that you too are allowed to feel overwhelmed and underwater. Just because some have it even harder, doesn’t invalidate what you’re going through.)
Take care of yourself
Take care of your loved ones
Say no to as much as you possibly can
Focus on impact
No performative normalcy
Remember: this is temporary 🖤
We are incredibly fortunate — to be here, to have these resources, to have each other. It’s okay to have bad days; this is why we have teams, to carry each other through the hardest spots. Do your best. Everything is going to be okay, more or less.
Last Wednesday I walked into my living room and saw three gay rednecks in hot pink shirts being married as a “throuple” on a TV screen at close range, followed by one of the grooms singing a country song about a woman feeding her husband’s remains to her tigers.
In Blood Rites, Ehrenreich asks why we sacralize war. Not why we fight wars, or why we are violent necessarily, but why we are drawn to the idea of war, why we compulsively imbue it with an aura of honor and noble sacrifice. If you kill one person, you’re a murderer and we shut you out from society; kill ten and you are a monster; but if you kill thousands, or kill on behalf of the state, we give you medals and write books about you.
And it’s not only about scale or being backed by state power. The calling of war brings out the highest and finest experiences our species can know: it sings of heroism and altruism, of discipline, self-sacrifice, common ground, a life lived well in service; of belonging to something larger than one’s self. Even if, as generations of weary returning soldiers have told us, it remains the same old butchery on the ground, the near-religious allure of war is never dented for long in the popular imagination.
What the fuck is going on?
Ehrenreich is impatient with the traditional scholarship, which locates the origin of war in some innate human aggression or turf wars over resources. She is at her dryly funniest when dispatching feminist theories about violence being intrinsically male or “testosterone poisoning”, showing that the bloodthirstiest of the gods have usually been feminine. (Although there are fascinating symmetries between girls becoming women through menstruation, and boys becoming men through … some form of culturally sanctioned ritual, usually involving bloodshed.)
Rather, she shows that our sacred feelings towards blood shed in war are the direct descendents of our veneration of blood shed in sacrifice — originally towards human sacrifice and other animal sacrifice, in a reenactment of our own ever-so-recent role inversion from prey to predator. Prehistoric sacrifice was likely a way of exerting control over our environment and reenacting the death that gave us life through food.
In her theory, humans do not go to war because we are natural predators. Just the blink of an eye ago, on an evolutionary scale, humans were not predators by any means: we were prey. Weak, blind, deaf, slow, clawless and naked; we scrawny, clever little apes we were easy pickings for the many large carnivores who roamed the planet. We scavenged in the wake of predators and worshiped them as gods. We are the nouveaux riche of predators, constantly re-asserting our dominance to soothe our insecurities.
We go to war not because we are predators, in other words, but because we are prey — and this makes us very uncomfortable! War exists as a vestigial relic of when we venerated the shedding of blood and found it holy — as anyone who has ever opened the Old Testament can attest. It was not until the Axial Age that religions of the world underwent a wholesale makeover into a less bloody, more universalistic set of aspirations.
When I first read this book, years ago, I remember picking it up with a roll of the eyes. “Sounds like some overly-metaphorical liberal academic nonsense” or something like that. But I was hooked within ten pages, my mind racing ahead with even more evidence than she marshals in this lively book. It shifted the way I saw many things in the world.
Like horror movies, for example. Or why cannibalism is so taboo. How Jesus became the Son of God, the Brothers’ Grimm, the sacrament of Communion. The primal fear of being food still resonates through our culture in so many sublimated ways.
And whether what you’re watching is “Tiger King” or the Tiger-King-watchers, it will make A LOT more sense after reading this book too.
Stay safe and don’t kill each other,
 Ehrenreich is best known for her stunning book on the precariousness of the middle class, “Nickel and Dimed”, where she tried to subsist for a year only on whatever work she could get with a high school education. Ehrenreich is a journalist, and this is a piece of science journalism, not scientific research; yet it is well-researched and scrupulously cited, and it’s worth noting that she has a PhD in biology and was once a practicing scientist.
First of all, confusion over terminology is understandable, because there are some big players out there actively trying to confuse you! Big Monitoring is indeed actively trying to define observability down to “metrics, logs and traces”. I guess they have been paying attention to the interest heating up around observability, and well… they have metrics, logs, and tracing tools to sell? So they have hopped on the bandwagon with some undeniable zeal.
But metrics, logs and traces are just data types. Which actually has nothing to do with observability. Let me explain the difference, and why I think you should care about this.
“Observability? I do not think it means what you think it means.”
Observability is a borrowed term from mechanical engineering/control theory. It means, paraphrasing: “can you understand what is happening inside the system — can you understand ANY internal state the system may get itself into, simply by asking questions from the outside?” We can apply this concept to software in interesting ways, and we may end up using some data types, but that’s putting the cart before the horse.
It’s a bit like saying that “database replication means structs, longints and elegantly diagrammed English sentences.” Er, no.. yes.. missing the point much?
This is such a reliable bait and switch that any time you hear someone talking about “metrics, logs and traces”, you can be pretty damn sure there’s no actual observability going on. If there were, they’d be talking about that instead — it’s far more interesting! If there isn’t, they fall back to talking about whatever legacy products they do have, and that typically means, you guessed it: metrics, logs and traces.
Metrics in particular are actually quite hostile to observability. They are usually pre-aggregated, which means you are stuck with whatever questions you defined in advance, and even when they aren’t pre-aggregated they permanently discard the connective tissue of the request at write time, which destroys your ability to correlate issues across requests or track down any individual requests or drill down into a set of results — FOREVER.
Which doesn’t mean metrics aren’t useful! They are useful for many things! But they are useful for things like static dashboards, trend analysis over time, or monitoring that a dimension stays within defined thresholds. Not observability. (Liz would interrupt here and say that Google’s observability story involves metrics, and that is true — metrics with exemplars. But this type of solution is not available outside Google as far as we know..)
Ditto logs. When I say “logs”, you think “unstructured strings, written out to disk haphazardly during execution, “many” log lines per request, probably contains 1-5 dimensions of useful data per log line, probably has a schema and some defined indexes for searching.” Logs are at their best when you know exactly what to look for, then you can go and find it.
Again, these connotations and assumptions are the opposite of observability’s requirements, which deals with highly structured data only. It is usually generated by instrumentation deep within the app, generally not buffered to local disk, issues a single event per request per service, is schemaless and indexless (or inferred schemas and autoindexed), and typically containing hundreds of dimensions per event.
Traces? Now we’re getting closer. Tracing IS a big part of observability, but tracing just means visualizing events in order by time. It certainly isn’t and shouldn’t be a standalone product, that just creates unnecessary friction and distance. Hrmm … so what IS observability again, as applied to the software domain??
As a reminder, observability applied to software systems means having the ability to ask any question of your systems — understand any user’s behavior or subjective experience — without having to predict that question, behavior or experience in advance.
Observability is about unknown-unknowns.
At its core, observability is about these unknown-unknowns.
Plenty of tools are terrific at helping you ask the questions you could predict wanting to ask in advance. That’s the easy part. “What’s the error rate?” “What is the 99th percentile latency for each service?” “How many READ queries are taking longer than 30 seconds?”
Monitoring tools like DataDog do this — you predefine some checks, then set thresholds that mean ERROR/WARN/OK.
Logging tools like Splunk will slurp in any stream of log data, then let you index on questions you want to ask efficiently.
APM tools auto-instrument your code and generate lots of useful graphs and lists like “10 slowest endpoints”.
But if you *can’t* predict all the questions you’ll need to ask in advance, or if you *don’t* know what you’re looking for, then you’re in o11y territory.
This can happen for infrastructure reasons — microservices, containerization, polyglot storage strategies can result in a combinatorial explosion of components all talking to each other, such that you can’t usefully pre-generate graphs for every combination that can possibly degrade.
And it can happen — has already happened — to most of us for product reasons, as you’ll know if you’ve ever tried to figure out why a spike of errors was being caused by users on ios11 using a particular language pack but only in three countries, and only when the request hit the image export microservice running build_id 789782 if the user’s last name starts with “MC” and they then try to click on a particular button which then issues a db request using the wrong cache key for that shard.
Gathering the right data, then exploring the data.
Observability starts with gathering the data at the right level of abstraction, organized around the request path, such that you can slice and dice and group and look for patterns and cross-correlations in the requests.
To do this, we need to stop firing off metrics and log lines willynilly and be more disciplined. We need to issue one single arbitrarily-wide event per service per request, and it must contain the *full context* of that request. EVERYTHING you know about it, anything you did in it, all the parameters passed into it, etc. Anything that might someday help you find and identify that request.
Then, when the request is poised to exit or error the service, you ship that blob off to your o11y store in one very wide structured event per request per service.
In order to deliver observability, your tool also needs to support high cardinality and high dimensionality. Briefly, cardinality refers to the number of unique items in a set, and dimensionality means how many adjectives can describe your event. If you want to read more, here is an overview of the space, and more technical requirements for observability
You REQUIRE the ability to chain and filter as many dimensions as you want with infinitely high cardinality for each one if you’re going to be able to ask arbitrary questions about your unknown unknowns. This functionality is table stakes. It is non negotiable. And you cannot get it from any metrics or logs tool on the market today.
Why this matters.
Alright, this is getting pretty long. Let me tell you why I care so much, and why I want people like you specifically (referring to frontend engineers and folks earlier in their careers) to grok what’s at stake in the observability term wars.
We are way behind where we ought to be as an industry. We are shipping code we don’t understand, to systems we have never understood. Some poor sap is on call for this mess, and it’s killing them, which makes the software engineers averse to owning their own code in prod. What a nightmare.
Meanwhile developers readily admit they waste >40% of their day doing bullshit that doesn’t move the business forward. In large part this is because they are flying blind, just stabbing around in the dark.
We all just accept this. We shrug and say well that’s just what it’s like, working on software is just a shit salad with a side of frustration, it’s just the way it is.
But it is fucking not. It is un fucking necessary. If you instrument your code, watch it deploy, then ask “is it doing what I expect, does anything else look weird” as a habit? You can build a system that is both understandable and well-understood. If you can see what you’re doing, and catch errors swiftly, it never has to become a shitty hairball in the first place. That is a choice.
🌟 Butobservability in the original technical sense is a necessary prerequisite to this better world. 🌟
If you can’t break down by high cardinality dimensions like build ids, unique ids, requests, and function names and variables, if you cannot explore and swiftly skim through new questions on the fly, then you cannot inspect the intersection of (your code + production + users) with the specificity required to associate specific changes with specific behaviors. You can’t look where you are going.
Observability as I define it is like taking off the blindfold and turning on the light before you take a swing at the pinata. It is necessary, although not sufficient alone, to dramatically improve the way you build software. Observability as they define it gets you to … exactly where you already are. Which of these is a good use of a new technical term?
And honestly, it’s the next generation who are best poised to learn the new ways and take advantage of them. Observability is far, far easier than the old ways and workarounds … but only if you don’t have decades of scar tissue and old habits to unlearn.
The less time you’ve spent using monitoring tools and ops workarounds, the easier it will be to embrace a new and better way of building and shipping well-crafted code.
Observability matters. You should care about it. And vendors need to stop trying to confuse people into buying the same old bullshit tools by smooshing them together and slapping on a new label. Exactly how long do they expect to fool people for, anyway?
Welcome to the second installment of my advice column! Last time we talked about the emotional impact of going back to engineering after a stint in management. If you have a question you’d like to ask, please email me or DM it to me on twitter.
Hi Charity! I hope it’s ok to just ask you this…
I’m trying to get our company more aware of observability and I’m finding it difficult to convince people to look more into it. We currently don’t have the kind of systems that would require it much – but we will in future and I want us to be ahead of the game.
If you have any tips about how to explain this to developers (who are aware that quality is important but don’t always advocate for it / do it as much as I’d prefer), or have concrete examples of “here’s a situation that we needed observability to solve – and here’s how we solved it”, I’d be super grateful.
If this is too much to ask, let me know too 🙂
I’ve been talking to Abby Bangser a lot recently – and I’m “classifying” observability as “exploring in production” in my mental map – if you have philosophical thoughts on that, I’d also love to hear them 🙂
Yay, what a GREAT note! I feel like I get asked some subset or variation of these questions several times a week, and I am delighted for the opportunity to both write up a response for you and post it for others to read. I bet there are orders of magnitude more people out there with the same questions who *don’t* ask, so I really appreciate those who do. <3
I want to talk about the nuts and bolts of pitching to engineering teams and shepherding technical decisions like this, and I promise I will offer you some links to examples and other materials. But first I want to examine some of the assumptions in your note, because they elegantly illuminate a couple of common myths and misconceptions.
Myth #1: you don’t need observability til you have problems of scale
First of all, there’s this misconception that observability is something you only need when you have really super duper hard problems, or that it’s only justified when you have microservices and large distributed systems or crazy scaling problems. No, no no nononono.
There may come a point where you are ABSOLUTELY FUCKED if you don’t have observability, but it is ALWAYS better to develop with it. It is never not better to be able to see what the fuck you are doing! The image in my head is of a hiker with one of those little headlamps on that lets them see where they’re putting their feet down. Most teams are out there shipping opaque, poorly understood code blindly — shipping it out to systems which are themselves crap snowballs of opaque, poorly understood code. This is costly, dangerous, and extremely wasteful of engineering time.
Ever seen an engineering team of 200, and struggled to understand how the product could possibly need more than one or two teams of engineers? They’re all fighting with the crap snowball.
Developing software with observability is better at ANY scale. It’s better for monoliths, it’s better for tiny one-person teams, it’s better for pre-production services, it’s better for literally everyone always. The sooner and earlier you adopt it, the more compounding value you will reap over time, and the more of your engineers’ time will be devoted to forward progress and creating value.
Myth #2: observability is harder and more technically advancedthan monitoring
Actually, it’s the opposite — it’s much easier. If you sat a new grad down and asked them to instrument their code and debug a small problem, it would be fairly straightforward with observability. Observability speaks the native language of variables, functions and API endpoints, the mental model maps cleanly to the request path, and you can straightforwardly ask any question you can come up with. (A key tenet of observability is that it gives an engineer the ability to ask any question, without having had to anticipate it in advance.)
With metrics and logging libraries, on the other hand, it’s far more complicated.you have to make a bunch of awkward decisions about where to emit various types of statistics, and it is terrifyingly easy to make poor choices (with terminal performance implications for your code and/or the remote data source). When asking questions, you are locked in to asking only the questions that you chose to ask a long time ago. You spend a lot of time translating the relationships between code and lowlevel systems resources, and since you can’t break down by users/apps you are blocked from asking the most straightforward and useful questions entirely!
Doing it the old way Is. Fucking. Hard. Doing it the newer way is actually much easier, save for the fact that it is, well, newer — and thus harder to google examples for copy-pasta. But if you’re saturated in decades of old school ops tooling, you may have some unlearning to do before observability seems obvious to you.
Myth #3: observability is a purely technical solution
To be clear, you can just add an observability tool to your stack and go on about your business — same old things, same old way, but now with high cardinality!
You can, but you shouldn’t.
These are sociotechnical systems and they are best improved with sociotechnical solutions. Tools are an absolutely necessary and inextricable part of it. But so are on call rotations and the fundamental virtuous feedback loop of you build it, you run it. So are code reviews, monitoring checks, alerts, escalations, and a blameless culture. So are managers who allocate enough time away from the product roadmap to truly fix deep technical rifts and explosions, even when it’s inconvenient, so the engineers aren’t in constant monkeypatch mode.
I believe that observability is a prerequisite for any major effort to have saner systems, simply because it’s so powerful being able to see the impact of what you’ve done. In the hands of a creative, dedicated team, simply wearing a headlamp can be transformational.
Observability is your five senses for production.
You’re right on the money when you ask if it’s about exploring production, but you could also use words that are even more basic, like “understanding” or “inspecting”. Observability is to software systems as a debugger is to software code. It shines a light on the black box. It allows you to move much faster, with more confidence, and catch bugs much sooner in the lifecycle — before users have even noticed. It rewards you for writing code that is easy to illuminate and understand in production.
So why isn’t everyone already doing it? Well, making the leap isn’t frictionless. There’s a minimal amount of instrumentation to learn (easier than people expect, but it’s nonzero) and then you need to learn to see your code through the lens of your own instrumentation. You might need to refactor your use of older tools, such as metrics libraries, monitoring checks and log lines. You’ll need to learn another query interface and how it behaves on your systems. You might find yourself amending your code review and deploy processes a bit.
Nothing too terrible, but it’s all new. We hate changing our tool kits until absolutely fucking necessary. Back at Parse/Facebook, I actually clung to my sed/awk/shell wizardry until I was professionally shamed into learning new ways when others began debugging shit faster than I could. (I was used to being the debugger of last resort, so this really pissed me off.) So I super get it! So let’s talk about how to get your team aligned and hungry for change.
Okay okay okay already, how do I get my team on board?
If we were on the phone right now, I would be peppering you with a bunch of questions about your organization. Who owns production? Who is on call? Who runs the software that devs write? What is your deploy process, and how often does it get updated, and by who? Does it have an owner? What are the personalities of your senior folks, who made the decisions to invest in the current tools (and what are they), what motivates them, who are your most persuasive internal voices? Etc. Every team is different. <3
There’s a virtuous feedback loop you need to hook up and kickstart and tweak here, where the people with the original intent in their heads (software engineers) are also informed and motivated, i.e. empowered to make the changes and personally impacted when things are broken. I recommend starting by putting your software engineers on call for production (if you haven’t). This has a way of convincing even the toughest cases that they have a strong personal interest in quality and understandability.
Pay attention to your feedback loop and the alignment of incentives, and make sure your teams are given enough time to actually fix the broken things, and motivation usually isn’t a problem. (If it is, then perhaps another feedback loop is lacking: your engineers feeling sufficiently aligned with your users and their pain. But that’s another post.)
Technical ownership over technical outcomes
I appreciate that you want your team to own the technical decisions. I believe very strongly that this is the right way to go. But it doesn’t mean you can’t have influence or impact, and particularly in times like this.
It is literally your job to have your head up, scanning the horizon for opportunities and relevant threats. It’s their job to be heads down, focusing on creating and delivering excellent work. So it is absolutely appropriate for you to flag something like observability as both an opportunity and a potential threat, if ignored.
If I were in your situation and wanted my team to check out some technical concept, I might send around a great talk or two and ask folks to watch it, and then maybe schedule a lunchtime discussion. Or I might invite a tech luminary in to talk with the team, give a presentation and answer their questions. Or schedule a hack week to apply the concept to a current top problem, or something else of that nature.
But if I really wanted them to take it fucking seriously, I would put my thumb on the scale. I would find myself a champion, load them up with context, and give them ample time and space to skill up, prototype, and eventually present to the team a set of recommendations. (And I would stay in close contact with them throughout that period, to make sure they didn’t veer too far off course or lose sight of my goals.)
Get a champion.
Ideally you want to turn the person who is most invested in the old way of doing things — the person who owns the ELK cluster, say, or who was responsible for selecting the previous monitoring toolkit, or the goto person for ops questions — from your greatest obstacle into your proxy warrior. This only works if you know that person is open-minded and secure enough to give it a fair shot & publicly change course, has sufficiently good technical judgment to evaluate and project into the future, and has the necessary clout with their peers. If they don’t, or if they’re too afraid to buck consensus: pick someone else.
Give them context.
Take them for a long walk. Pour your heart and soul out to them. Tell them what you’ve learned, what you’ve heard, what you hope it can do for you, what you fear will happen if you don’t. It’s okay to get personal and to admit your uncertainties. The more context they have, the better the chance they will come out with an outcome you are happy with. Get them worried about the same things that worry you, get them excited about the same possibilities that excite you. Give them a sense of the stakes.
And don’t forget to tell them why you are picking them — because they are listened to by their peers, because they are already expert in the problem area, because you trust their technical judgment and their ability to evaluate new things — all the reasons for picking them will translate well into the best kind of flattery — the true kind.
Give them a deadline.
A week or two should be plenty. Most likely, the decision is not going to be unilaterally theirs (this also gives you a bit of wiggle room should they come back going “ah no ELK is great forever and ever”), but their recommendations should carry serious weight with the team and technical leadership. Make it clear what sort of outcome you would be very pleased with (e.g. a trial period for a new service) and what reasons you would find compelling for declining to pursue the project (i.e. your tech is unsupported, cost prohibitive, etc). Ideally they should use this time to get real production data into the services they are testing out, so they can actually experience and weigh the benefits, not just read the marketing copy.
As a rule of thumb, I always assume that managers can’t convince engineers to do things: only other engineers can. But what you can do instead is set up an engineer to be your champion. And then just sit quietly in the corner, nodding, with an interested look on your face.
The nuclear option
You have one final option. If there is no appropriate champion to be found, or insufficient time, or if you have sufficient trust with the team that you judge it the right thing to do: you can simply order them to do something your way. This can feel squicky. It’s not a good habit to get into. It usually results in things being done a bit slower, more reluctantly, more half-assedly. And you sacrifice some of your power every time you lean on your authority to get your team to do something.
But it’s just as bad for a leader to take it off the table entirely.
Sometimes you will see things they can’t. If you cannot wield your power when circumstances call for it, then you don’t fucking have real power — you have unilaterally disarmed yourself, to the detriment of your org. You can get away with this maybe twice a year, tops.
But here’s the thing: if you order something to be done, and it turns out in the end that you were right? You earn back all the power you expended on it plus interest. If you were right, unquestionably right in the eyes of the team, they will respect you more for having laid down the law and made sure they did the right thing.
One of my stretch goals for 2019 was to start writing an advice column. I get a lot of questions about everything under the sun: observability, databases, career advice, management problems, what the best stack is for a startup, how to hire and interview, etc. And while I enjoy this, having a high opinion of my own opinions and all, it doesn’t scale as well as writing essays. I do have a (rather all-consuming) day job.
So I’d like to share some of the (edited and lightly anonymized) questions I get asked and some of the answers I have given. With permission, of course. And so, with great appreciation to my anonymous correspondent for letting me publish this, here is one.
I’ve been in tech for 25 years. I don’t have a degree, but I worked my way up from menial jobs to engineering, and since then I have worked on some of the biggest sites in the world. I have been offered a management role many times, but every time I refused. Until about two years ago, when I said “fuck it, I’m almost 40; why not try.”
I took the job with boundless enthusiasm and motivation, because the team was honestly a mess. We were building everything on-prem, and ops was constantly bullying developers over their supposed incompetence. I had gone to conferences, listened to podcasts, and read enough blog posts that my head was full of “DevOps/CloudNative/ServiceOriented//You-build-it-you-run-it/ServantLeaders” idealism. I knew I couldn’t make it any worse, and thought maybe, just maybe I could even make it better.
Soon after I took the job, though, there were company-wide layoffs. It was not done well, and morale was low and sour. People started leaving for happier pastures. But I stayed. It was an interesting challenge, and I threw my heart and soul into it.
For two years I have stayed and grinded it out: recruiting (oh that is so hard), hiring, and then starting a migration to a cloud provider, and with the help of more and more people on the new team, slowly shifted the mindset of the whole engineering group to embrace devops best practices. Now service teams own their code in production and are on-call for them, migrate themselves to the cloud with my team supporting them and building tools for them. It is almost unrecognizable compared to where we were when I began managing.
A beautiful story isn’t it? I hope you’re still reading. 🙂
Now I have to say that with my schedule full of 1:1s, budgeting, hiring, firing, publishing papers of mission statements and OKRs, shaping the teams, wielding influence, I realized that I enjoyed none of the above. I read your 17 reasons not to be a manager, and I check so many boxes. It is a pain in the ass to constantly listen to people’s egos, talk to them and keep everybody aligned (which obviously never happens). And of course I am being crushed between top-down on-the-spot business decisions and bottom-up frustration of poorly executed engineering work under deadlines. I am also destroyed by the mistrust and power games I am witnessing (or involved in, sometimes). while I long for collaboration and trust. And of course when things go well my team gets all the praise, and when things go wrong I take all the blame. I honestly don’t know how one can survive without the energy provided by praise and a sense of achievement.
All of the above makes me miss being an IC (Individual Contributor), where I could work for 8 hours straight without talking to anyone, build stuff, say what I wanted when I wanted, switch jobs if I wasn’t happy, and basically be a little shit like the ones you mention in your article.
But when I think about doing it, I get stuck. I don’t know if I would be able to do it again, or if I could still enjoy it. I’ve seen too many things, I’ve tasted what it’s like to be (sometimes) in control, and I did have a big impact on the company’s direction over time. I like that. If I went back to being an IC, I would feel small and meaningless, like just another cog in the machine. And of course, being 40-ish, I will compete with all those 20-something smartasses who were born with kubernetes.
Thank you for reading. Could you give me your thoughts on this? In any case, it was good to get it off my chest.
Holy shitballs! What an amazing story! That is an incredible achievement in just two years, let alone as a rookie manager. You deserve huge props for having the vision, the courage, and the tenacity to drive such a massive change through.
Of COURSE you’re feeling bored and restless. You didn’t set out on a glorious quest for a life of updating mission statements and OKRs, balancing budgets, tending to people’s egos and fluffing their feelings, tweaking job descriptions, endless 1x1s and meetings meetings meetings, and the rest of the corporate middle manager’s portfolio. You wanted something much bigger. You wanted to change the world. And you did!
But now you’ve done it. What’s next?
First of all, YOUR COMPANY SUCKS. You don’t once mention your leadership — where are they in all this? If you had a good manager, they would be encouraging you and eagerly lining up a new and bigger role to keep you challenged and engaged at work. They are not, so they don’t deserve you. Fuck em. Please leave.
Another thing I am hearing from you is, you harbor no secret desire to climb the managerial ranks at this time. You don’t love the daily rhythms of management (believe it or not, some do); you crave novelty and mastery and advancement. It sounds like you are willing to endure being a manager, so long as that is useful or required in order to tackle bigger and harder problems. Nothing wrong with that! But when the music stops, it’s time to move on. Nobody should be saddled with a manager whose heart isn’t in the work.
You’re at the two year mark. This is a pivotal moment, because it’s the beginning of the end of the time when you can easily slip back into technical work. It will get harder and harder over the next 2-3 years, and at some point you will no longer have the option.
Picking up another technical role is the most strategic option, the one that maximizes your future opportunities as a technical leader. But you do not seem excited by this option; instead you feel many complex and uncomfortable things. It feels like going backwards. It feels like losing ground. It feels like ceding status and power.
“Management isn’t a promotion, it’s a career change.”
But if management is not a promotion, then going back to an engineering role should not feel like a demotion! What the fuck?!
It’s one thing to say that. Whether it’s true or not is another question entirely, a question of policy and org dynamics. The fact is that in most places, most of the power does go to the managers, and management IS a promotion. Power flows naturally away from engineers and towards managers unless the org actively and vigorously pushes back on this tendency by explicitly allocating certain powers and responsibilities to other roles.
I’m betting your org doesn’t do this. So yeah, going back to being an IC WILL be a step down in terms of your power and influence and ability to set the agenda. That’s going to feel crappy, no question. We humans hate that.
You cannot go back to doing exactly what you did before, for the very simple reason that you are not the same person. You are going to be attuned to power dynamics and ways of influencing that you never were before — and remember, leadership is primarily exercised through influence, not explicit authority.Senior ICs who have been managers are supremely powerful beings, who tend to wield outsize influence. Smart managers will lean on them extensively for everything from shadow management and mentorship to advice, strategy, etc. (Dumb managers don’t. So find a smart manager who isn’t threatened by your experience.)
You’re a short-timer here, remember? Your company sucks. You’re just renewing your technical skills and pulling a paycheck while finding a company that will treat you better, that is more aligned with your values.
Lastly (and most importantly), I have a question. Why did you need to become a manager in order to drive sweeping technical change over the past two years? WHY couldn’t you have done it as a senior IC? Shouldn’t technical people be responsible for technical decisions, and people managers responsible for people decisions? Could this be your next challenge, or part of it? Could you go back to being an engineer, equipped with your shiny new powers of influence and mystical aura of recent management experience, and use it to organize the other senior ICs to assert their rightful ownership over technical decisions? Could you use your newfound clout with leadership and upper management to convince them that this will help them recruit and retain better talent, and is a better way to run a technical org — for everyone?
I believe this is a better way, but I have only ever seen these changes happen when agitated for and demanded by the senior ICs. If the senior ICs don’t assert their leadership, managers are unlikely to give it to them. If managers try, but senior ICs don’t inhabit their power, eventually the managers just shrug and go back to making all the decisions. That is why ultimately this is a change that must be driven and owned — at a minimum co-owned — by the senior individual contributors.
I hope you can push back against that fear of being small and meaningless as an individual contributor. The fact that it very often is this way, especially in strongly hierarchical organizations, does not mean that it has to be this way; and in healthy organizations it is not this way. Command-and-control systems are not conducive to creative flourishing. We have to fight the baggage of the authoritarian structures we inherited in order to make better ones.
Organizations are created afresh each and every day — not created for us, but by us. Help create the organization you want to work at, where senior people are respected equally and have domains of ownership whether they manage people or technology. If your current gig won’t value that labor, find one that will..
They exist. And they want to hire you.
Lots of companies are DYING to hire this kind of senior IC, someone who is still hands on yet feels responsibility for the team as a whole, who knows the business side, who knows how to mentor and craft a culture and can herd cats when nec
There are companies that know how to use ICs at the strategic level, even executive level. There are bosses who will see you not as a threat, but as a *huge asset* they can entrust with monumental work.
As a senior contributor who moves fluidly between roles, you are especially well-equipped to help shape a sociotechnical organization. Could you make it your mission to model the kind of relationship you want to see between management and ICs, whichever side you happen to be on? We need more people figuring out how to build organizations where management is not a promotion, just a change of career, and where going back and forth carries no baggage about promotions and demotions. Help us.
And when you figure it out, please don’t keep it to yourself. Expand your influence and share your findings by writing your experiences in blog posts, in articles, in talks. Tell stories. Show people people how much better it is this way. Be so magnificently effective and mysteriously influential as a senior IC that all the baby engineers you work with want to grow up to be just like you.
Hope this helps.
P.S. — Oh and stop fretting about “competing” with the 20-somethings kuberneteheads, you dork. You have been learning shit your whole career and you’ll learn this shit too. The tech is the easy part. The tech will always be the easy part. 🙂
Over a year and a half ago, I wrote up a post about the rights and responsibilities due any engineer at Honeycomb. At the time we were in the middle of a growth spurt, had just hired several new engineers, and I was in the process of turning over day-to-day engineering management over to Emily. Writing things down helped me codify what I actually cared about, and helped keep us true to our principles as we grew.
Tacked on to the end of the post was a list of manager responsibilities, almost as an afterthought. Many people protested, “don’t managers get any rights??” (and naturally I snapped “NO! hahahahahha”)
I always intended to circle back and write a followup post with the rights and responsibilities for managers. But it wasn’t til recently, as we are gearing up for another hiring spurt and have expanded our managerial ranks, that it really felt like its time had come.
The time has come, the time is now, as marvin k. mooney once said. Added the bill of rights, and updated and expanded the list of responsibilities. Thanks Emily Nakashima for co-writing it with me.
Manager’s Bill of Rights
You shall receive honest, courageous, timely feedback about yourself and your team, from your reports, your peers, and your leaders. (No one is exempt from feeding the hungry hungry feedback hippo! NOO ONNEEEE!) 🦛🦛🦛🦛🦛🦛🦛
Management will be treated with the same respect and importance as individual work.
You have the final say over hiring, firing, and leveling decisions for your team. It is expected that you solicit feedback from your team and peers and drive consensus where possible. But in the end, the say is yours.
Management can be draining, difficult work, even at places that do it well. You will get tactical, strategic, and emotional support from other managers.
You cannot take care of others unless you first practice self-care. You damn well better take vacations. (Real ones.)
You have the right to personal development, career progression, and professional support. We will retain a leadership coach for you.
You do not have to be a manager if you do not want to. No one will ever pressure you.
Recruit and hire and train your team. Foster a sense of solidarity and “teaminess” as well as real emotional safety.
Cultivate an inclusive culture and redistribute opportunity. Fuck a pedigree. Resist monoculture.
Care for the people on your team. Support them in their career trajectory, personal goals, work/life balance, and inter- and intra-team dynamics.
Keep an eye out for people on other teams who aren’t getting the support they need, and work with your leadership and manager peers to fix the situation.
Give feedback early and often. Receive feedback gracefully. Always say the hard things, but say them with love.
Move us relentlessly forward, staying alert for rabbit-holing and work that doesn’t contribute to our goals. Ensure redundancy/coverage of critical areas.
Own the planning process for your team, be accountable for the goals you set. Allocate resources by communicating priorities and requesting support. Add focus or urgency where needed.
Own your time and attention. Be accessible. Actively manage your calendar. Try not to make your emotions everyone else’s problems (but do lean on your own manager and your peers for support).
Make your own personal growth and self-care a priority. Model the values and traits we want employees to pattern themselves after.