Why I hate the phrase “breaking down silos”

August 27, 2021July 14, 2023 mipsytipsyculture, management, tech culture3 Comments

We hear this phrase constantly: “I worked at breaking down silos.” “We need to break down silos.” “What did I do in my last role? I broke down silos.”

It sets my fucking teeth on edge.

What is a ‘silo’, anyway? What specifically wasn’t working well, and how did you solve it; or how was it solved, and what was your contribution to the solution? did you just follow orders, or did you personally diagnose the problem, or did some of your suggestions pan out?

Solutions to complex problems rarely work on the first go, so … what else did you try? How did you know it wasn’t working, how did you know when to abandon earlier ideas? It’s fiendishly hard to know whether you’ve given a solution enough time to bake, for people to adjust, so that you can even evaluate whether it works better or worse off than before.

Communication is not magic pixie dust

Breaking down silos is supposed to be about increasing communication, removing barriers and roadblocks to collaboration.

But you can’t just blindly throw “more communication” at your teams. Too much communication can be just as much of a problem and a burden as too little. It can distract, and confuse, and create little eddies of information that is incorrect or harmful.

The quantity of the communication isn’t the issue, so much as the quality. Who is talking to whom, and when, and why? How does information flow throughout your company? Who gets left out? Whose input is sought, and when, and why? How can any given individual figure out who to talk to about any given responsibility?

When someone says they are “breaking down silos”, whether in an interview, a panel, or casual conversation, it tells me jack shit about what they actually did.

cliches are a substitute for critical thinking

It’s just like when people say “it’s a culture problem”, or “fix your culture”, or “everything is about people”. These phrases tell me nothing except that the speaker has gone to a lot of conferences and wants to to sound cool.

If someone says “breaking down silos”, it immediately generates a zillion questions in my mind. I’m curious, because these problems are genuinely hard and people who solve them are incredibly rare.

Unfortunately, the people who use these phrases are almost never the ones who are out there in the muck and grind, struggling to solve real problems.

When asked, people who have done the hard labor of building better organizations with healthy communication flows, less inefficiency, and alignment around a single mission — people who have gotten all the people rowing in the same direction — tend to talk about the work.

People who haven’t, say they were “breaking down silos.”

Software deploys and cognitive biases

August 27, 2021July 14, 2023 mipsytipsycontinuous deployment, deploys, engineering, friday deploys, operations, sre1 Comment

There exist some wonderful teams out there who have valid, well thought through, legitimate reasons for enforcing “NO FRIDAY DEPLOYS” week in and week out, for not hooking CI/CD up to autodeploy, and for not shipping one person’s changes at a time.

And then there are the reasons most people have.

Bad decisions, and the biases they came from

“It is more dangerous to deploy than not to deploy” (prevention bias, zero-risk bias)
“It’s always riskier to do something than to not do something” (omission bias)
“Deploys are scary, so we need to slow down and be careful.”(slow motion bias)
“We’ve always done it this way; it works fine, you’re exaggerating, it doesn’t slow us down like those stories you tell.” (plan continuation bias, mere-exposure effect, time-saving bias, status-quo bias, default effect)
“Maybe that works for some teams, like baby startups, but not real software systems that people rely on” (false-uniqueness bias)
“We’ve already invested a ton of engineering hours into building a deployment framework that doesn’t ship especially fast and doesn’t ship one change at a time, but it works, and we don’t want to have to redo everything from scratch.” (surrogation, ostrich effect, irrational escalation, ikea-effect, law of the instrument)
“I get why it’s important, but the most important thing right now is for us to ship all of these features. At some point we’ll have the spare time to fix our deploys.” (hyperbolic discounting)
“We tried it once, and the whole site went down. Never again.” (non-adaptive choice switching, selective perception)
“Everybody says that no Friday deploys is the safe, sane, and caring choice.” (illusory truth effect, surrogation, continued influence effect, conservatism bias)
“Deploys are just inherently scary; there’s nothing that can be done about that. You should do them sparingly and with someone monitoring it closely.” (availability bias, dread aversion, functional fixedness)
“The best way to protect people’s personal time is to let merged code pile up between Thursday night and Monday morning, and ship all at once.” (continued-influence effect, plan-continuation bias, pseudocertainty effect, zero-risk bias)
“There is nothing anyone could say or do to convince me that the best thing I can do as a manager to protect their weekends is not blocking Friday deploys. I just feel it in my gut. I just know.” (… I got nuthin)

We’re humans. 💜 We leap to conclusions with the wetware we have doing the best it can based on heuristics that feel objectively true, but are ultimately just emotional reactions based on past lived experience. And then we retroactively enshrine those goofy gut feelings with the language of noble motive and moral values.

“I tell people not to deploy to production … because I care so deeply about my team and their ability to have a quiet weekend.”

Barf. 🙄 That’s just like saying you tell your kid not to brush his teeth at night, because you care SO DEEPLY about him and his ability to go to bed calm and happy.

Once the retcon engine in your brain gets running, it comes up with all sorts of reasons. Plausible-sounding reasons! But every single argument of the items in the list above is materially false.

Deploy myths are never going away for good; they appeal to too many of our cognitive biases. But what if there was one simple thing you could do that would invert many of these cognitive biases and cause people to grapple with the question in a new way? What if you could kickstart a recalculation?

My next post will pick up right here. I’ll tell you all about the One Simple Trick you can do to fix your deploys and set you on the virtuous path of high-performing teams.

Til then, here’s what I’ve previously written on the topic.

https://charity.wtf/2018/08/19/shipping-software-should-not-be-scary/

https://charity.wtf/2019/05/01/friday-deploy-freezes-are-exactly-like-murdering-puppies/

https://charity.wtf/2019/10/28/deploys-its-not-actually-about-fridays/

https://charity.wtf/2021/02/19/how-much-is-your-fear-costing-you/

https://charity.wtf/2020/12/31/why-are-my-tests-so-slow-a-list-of-likely-suspects-anti-patterns-and-unresolved-personal-trauma/

Footnotes

Availability bias: The tendency to overestimate the likelihood of events with greater “availability” in memory, which can be influenced by how recent the memories are or how unusual or emotionally charged they may be.

Continued influence effect: The tendency to believe previously learned misinformation even after it has been corrected. Misinformation can still influence inferences one generates after a correction has occurred.

Conservatism bias: The tendency to revise one’s belief insufficiently when presented with new evidence.

Default effect: When given a choice between several options, the tendency to favor the default one.

Dread aversion: Just as losses yield double the emotional impact of gains, dread yields double the emotional impact of savouring

False-uniqueness bias: The tendency of people to see their projects and themselves as more singular than they actually are.

Functional fixedness: Limits a person to using an object only in the way it is traditionally used

Hyperbolic discounting: Discounting is the tendency for people to have a stronger preference for more immediate payoffs relative to later payoffs. Hyperbolic discounting leads to choices that are inconsistent over time – people make choices today that their future selves would prefer not to have made, despite using the same reasoning

IKEA effect: The tendency for people to place a disproportionately high value on objects that they partially assembled themselves, such as furniture from IKEA, regardless of the quality of the end product

Illusory truth effect: A tendency to believe that a statement is true if it is easier to process, or if it has been stated multiple times, regardless of its actual veracity.

Irrational escalation: The phenomenon where people justify increased investment in a decision, based on the cumulative prior investment, despite new evidence suggesting that the decision was probably wrong. Also known as the sunk cost fallacy

Law of the instrument: An over-reliance on a familiar tool or methods, ignoring or under-valuing alternative approaches. “If all you have is a hammer, everything looks like a nail”

Mere exposure effect: The tendency to express undue liking for things merely because of familiarity with them

Negativity bias: Psychological phenomenon by which humans have a greater recall of unpleasant memories compared with positive memories

Non-adaptive choice switching: After experiencing a bad outcome with a decision problem, the tendency to avoid the choice previously made when faced with the same decision problem again, even though the choice was optimal

Omission bias: The tendency to judge harmful actions (commissions) as worse, or less moral, than equally harmful inactions (omissions).

Ostrich effect: Ignoring an obvious (negative) situation

Plan continuation bias: Failure to recognize that the original plan of action is no longer appropriate for a changing situation or for a situation that is different than anticipated

Prevention bias: When investing money to protect against risks, decision makers perceive that a dollar spent on prevention buys more security than a dollar spent on timely detection and response, even when investing in either option is equally effective

Pseudocertainty effect: The tendency to make risk-averse choices if the expected outcome is positive, but make risk-seeking choices to avoid negative outcomes

Salience bias: The tendency to focus on items that are more prominent or emotionally striking and ignore those that are unremarkable, even though this difference is often irrelevant by objective standards

Selective perception bias: The tendency for expectations to affect perception

Status-quo bias: If no special action is taken, the default action that will happen is that the code will go live. You will need an especially compelling reason to override this bias and manually stop the code from going live, as it would by default.

Slow-motion bias: We feel certain that we are more careful and less risky when we slow down. This is precisely the opposite of the real world risk factors for shipping software. Slow is dangerous for software; speed is safety. The more frequently you ship code, the smaller the diffs you ship, the less dangerous each one actually becomes. This is the most powerful and difficult to overcome of all of our biases, because there is no readily available counter-metaphor for us to use. (Riding a bike is the best I’ve come up with. 😔)

Surrogation: Losing sight of the strategic construct that a measure is intended to represent, and subsequently acting as though the measure is the construct of interest

Time-saving bias: Underestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively low speed and overestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively high speed.

Zero-risk bias: Preference for reducing a small risk to zero over a greater reduction in a larger risk.

Why every software engineering interview should include ops questions

August 21, 2021July 14, 2023 mipsytipsyculture, engineering, hiring, management, operations, sre, tech culture4 Comments

I’ve fallen way behind on my blog posts — my goal was to write one per month, and I haven’t published anything since MAY. Egads. So here I am dipping into the drafts archives! This one was written in April of 2016, when I was noodling over my CraftConf 2016 talk on “DevOps for Developers (see slides).”

So I got to the part in my talk where I’m talking about how to interview and hire software engineers who aren’t going to burn the fucking house down, and realized I could spend a solid hour on that question alone. That’s why I decided to turn it into a blog post instead.

Stop telling ops people to code better, start telling SWEs to ops better

Our industry has gotten very good at pressing operations engineers to get better at writing code, writing tests, and software engineering in general these past few years. Which is great! But we have not been nearly so good at pushing software engineers to level up their systems skills. Which is unfortunate, because it is just as important.

Most systems suffer from the syndrome of running too much software. Tossing more software into the heap is as likely to cause more problems as often as it solves them.

We see this play out at companies stacked with good software engineers who have built horrifying spaghetti messes of their infrastructure, and then commence paging themselves to death.

The only way to unwind this is to reset expectations, and make it clear that

you are still responsible for your code after it’s been deployed to production, and
operational excellence is everyone’s job.

Operations is the constellation of tools, practices, policies, habits, and docs around shipping value to users, and every single one of us needs to participate in order to do this swiftly and safely.

Every software engineering interviewing loop should have an ops component.

Nobody interviews candidates for SRE or ops nowadays without asking some coding questions. You don’t have to be the greatest programmer in the world, but you can’t be functionally illiterate. The reverse is less common: asking software engineers basic, stupid questions about the lifecycle of their code, instrumentation best practices, etc.

It’s common practice at lots of companies now to have a software engineer in the loop for hiring SREs to evaluate their coding abilities. It should be just as common to have an ops engineer in the loop for a SWE hire, especially for any SWE who is being considered for a key senior position. Those are the people you most rely on to be mentors and role models for junior hires. All engineers should embrace the ethos of owning their code in production, and nobody should be promoted or hired into a senior role if they don’t.

And yes, that means all engineers! Even your iOS/Android engineers and website developers should be interested in what happens to their code after they hit deploy. They should care about things like instrumentation, and what kind of data they may need later to debug their problems, and how their features may impact other infrastructure components.

You need to balance out your software engineers with engineers who don’t react to every problem by writing more code. You need engineers who write code begrudgingly, as a last resort. You’ll find these priceless gems in ops and SRE.

ops questions for software engineers

The best questions are broad and start off easy, with plenty of reasonable answers and pathways to explore. Even beginners can give a reasonable answer, while experts can go on talking for hours.

For example: give them the specs for a new feature, and ask them to talk through the infrastructure choices and dependencies to support that feature. Do they ask about things like which languages, databases, and frameworks are already supported by the team? Do they understand what kind of monitoring and observability tools to use, do they ask about local instrumentation best practices?

Or design a full deployment pipeline together. Probe what they know about generating artifacts, versioning, rollbacks, branching vs master, canarying, rolling restarts, green/blue deploys, etc. How might they design a deploy tool? Talk through the tradeoffs.

Some other good starting points:

“Tell me about the last time you caused a production outage. What happened, how did you find out, how was it resolved, and what did you learn?”
“What are some of your favorite tools for visibility, instrumentation, and debugging?
“Latency seems to have doubled over the last 6 hours. Where do you start looking, how do you start debugging?”
And this chestnut: “What happens when you type ‘google.com’ into a web browser?” You would be fucking *astonished* how many senior software engineers don’t know a thing about DNS, HTTP, SSL/TLS, cookies, TCP/IP, routing, load balancers, web servers, proxies, and on and on.

Another question I really like is: “what’s your favorite API (or database, or language) and why?” followed up by “… and what are the worst things about it?” (True love doesn’t mean blind worship.)

Remember, you’re exploring someone’s experience and depth here, not giving them a pass-fail quiz. It’s okay if they don’t know it all. You’re also evaluating them on communication skills, which is severely underrated by most people but is actually as a key technical skill.

Signals to look for

You’re not looking for perfection. You are teasing out signals for things like, how will this person perform on a team where software engineers are expected to own their code? How much do they know about the world outside the code they write themselves? Are they curious, eager, and willing to learn, or fearful, incurious and begrudging?

Do they expect networks to be reliable? Do they expect databases to respond, retries to succeed? Are they offended by the idea of being on call? Are they overly clever or do they look to simplify? (God, I hate clever software engineers 🙃.)

It’s valuable to get a feel for an engineer’s operational chops, but let’s be clear, you’re doing this for one big reason: to set expectations. By making ops questions part of the interview, you’re establishing from the start that you run an org where operations is valued, where ownership is non-optional. This is not an ivory tower where software engineers can merrily git push and go home for the day and let other people handle the fallout

It can be toxic when you have an engineer who thinks all ops work is toil and operations engineering is lesser-than. It tends to result in operations work being done very poorly. This is your best chance to let those people self-select out.

You know what, I’m actually feeling uncharacteristically optimistic right now. I’m remembering how controversial some of this stuff was when I first wrote it, five years ago in 2016. Nowadays it just sounds obvious. Like table stakes.

Hell yeah. 🤘

How Much Should My Observability Stack Cost?

August 18, 2021July 14, 2023 mipsytipsyadvice, cost, crossposted, vendorsLeave a comment

First posted on 2021-08-18 at https://www.honeycomb.io/blog/how-much-should-my-observability-stack-cost

What should one pay for observability? What should your observability stack cost? What should be in your observability stack?

How much observability is enough? How much is too much, or is there such a thing?

Is it better to pay for one product that claims (dubiously) to do everything, or twenty products that are each optimized to do a different part of the problem super well?

It’s almost enough to make a busy engineer say “Screw it, I’m spinning up Nagios”.

(Hey, I said almost.)

All of these service providers can give you sticker shock when you begin investigating them. The biggest reason is always that we aren’t used to considering the price of our own time. We act like it’s “free” to just take an hour and spin something up … we don’t count the cost of maintenance, context switching, and opportunity costs of not using the time to build something of business value. Which is both understandable and forgivable, as a starting point.

Considerably less forgivable is the vagueness–and sometimes outright misdirection and scare tactics–some vendors offer around pricing. It’s not ok for a business to optimize for revenue at the expense of user experience. As users, we have the right to demand transparency and accurate information. As vendors, we have the responsibility to provide it. Any pricing scheme that doesn’t align with best practices and users’ interests will be a drag on reputation and growth.

The core question, rarely addressed outright, is: how much should you pay? In this post I’ll talk about what your observability costs include, and in the next post, what you should consider including in your “observability stack”.

But I’ll give you the answer to your question right off the bat: you should probably spend 20-30% of infra costs on observability.

O11y spend should be 20-30% of infra spend

Rule of thumb: your observability spend should come to 20-30% of your infra spend. (I’ve seen 10% a few times from reasonable-seeming shops, but they have been edge cases and outliers. I have also seen 50% or more, but again, outliers.)

Full disclosure: this isn’t based on any particular science. It’s just based on my experience of 15+ years working in operations engineering, talking to other engineers and managers, and a couple of informal Twitter polls to satisfy my own curiosity.

Nevertheless, it’s a pretty solid rule. There are exceptions, but in general, if you’re spending less than 20%, you’re “saving money” at the expense of engineering time, or being silently dragged underwater by a million little time leaks and quality of service issues — which you could eliminate completely with a bit of investment.

Consider the person who told me proudly that his o11y spend was just 1-3%. (He meant the PagerDuty bill and Pingdom checks, actually.) He wasn’t counting the dedicated hardware for their ELK cluster (80k/month), or the 2-3 extra engineers they had to recruit, train and hire (250-300k/year apiece) to run the many open source tools they got for “free”.

And ultimately, it didn’t meet their needs very well. Few people knew how to use it, so they leaned on the “observability team” to craft custom views, write scripts and ETL one-offs, and serve as the institutional hive mind and software usability tutors. They could have used better tools, ones under active development by large product teams. They could have used that headcount to create core business value instead.

Engineers cost money

Engineers are expensive. Recruiting them is hard. The good ones are increasingly unwilling to waste time on unnecessary labor. This manager was “saving” maybe a million dollars a year (he mentioned a vendor quote of less than 100k/month)–but spending a couple million more than that in less-visible ways.

Worse, he was driving his engineering org into the ground by wasting so much of their time and energy on non-mission-critical work, inferior tooling, one-offs, frustrating maintenance work, etc, all of which had nothing to do with their core business value.

If you want to know if an org hires and retains good engineers, you could do worse than to ask the question: “What tools do you use, and why?”

Good orgs use good tools. They know engineering cycles are their scarcest and most valuable resource, and they want to train maximum firepower on their core business problems.
Mediocre orgs use mediocre tools, have no discipline or consistency around adoption and deprecation, and leak lost engineering cycles everywhere.

So back to our rule of thumb: observability amounting to 20-30% of total spend is where most shops should fall. This refers to cloud-native infrastructure, using third-party services to instrument and monitor code, with the basics covered — resource utilization graphs, end to end checks, paging, etc.

So, what do I need in my “observability stack”?

What are the basics? Well, obviously “it depends”. It depends on your requirements, your components, your commitments, your budget, sunk costs and skill sets, your teams, and most expensive of all — customer expectations and the cost of violating them. You should think carefully about these things and try to draw a straight line from the business case to the money you spend (or don’t spend). And don’t forget to factor in those invisible human costs.

Notes on the Perfidy of Dashboards

August 9, 2021July 14, 2023 mipsytipsydashboards, monitoring, operations, sre9 Comments

The other day I said this on twitter —

every dashboard is a sunk cost
every dashboard is an answer to some long-forgotten question
every dashboard is an invitation to pattern-match the past instead of interrogate the present
every dashboard gives the illusion of correlation
every dashboard dampens your thinking https://t.co/OIEowa1COa

— Charity Majors (@mipsytipsy) July 19, 2021

… which stirred up some Feelings for many people. 🙃 So I would like to explain my opinions in more detail.

Static vs dynamic dashboards

First, let’s define the term. When I say “dashboard”, I mean STATIC dashboards, i.e. collections of metrics-based graphs that you cannot click on to dive deeper or break down or pivot. If your dashboard supports this sort of responsive querying and exploration, where you can click on any graph to drill down and slice and dice the data arbitrarily, then breathe easy — that’s not what I’m talking about. Those are great. (I don’t really consider them dashboards, but I have heard a few people refer to them as “dynamic dashboards”.)

Actually, I’m not even “against” static dashboards. Every company has them, including Honeycomb. They’re great for getting a high level sense of system functioning, and tracking important stats over long intervals. They are a good starting point for investigations. Every company should have a small, tractable number of these which are easily accessible and shared by everyone.

Debugging with dashboards: it’s a trap

What dashboards are NOT good at is debugging, or understanding or describing novel system states.

I can hear some of you now: “But I’ve debugged countless super-hard unknown problems using only static dashboards!” Yes, I’m sure you have. If all you have is a hammer, you CAN use it to drive screws into the wall, but that doesn’t mean it’s the best tool. And It takes an extraordinary amount of knowledge and experience to be able to piece together a narrative that translates low-level system statistics into bugs in your software and back. Most software engineers don’t have that kind of systems experience or intuition…and they shouldn’t have to.

Why are dashboards bad for debugging? Think of it this way: every dashboard is an answer to a question someone asked at some point. Your monitoring system is probably littered with dashboards, thousands and thousands of them, most of whose questions have been long forgotten and many of whose source data streams have long since gone silent.

So you come along trying to investigate something, and what do you do? You start skimming through dashboards, eyes scanning furiously, looking for visual patterns — e.g. any spikes that happened around the same time as your incident. That’s not debugging, that’s pattern-matching. That’s … eyeball racing.

if we did math like we do dashboards

Imagine you’re in a math competition, and you get handed a problem to solve. But instead of pulling out your pencil and solving the equation, step by step, you start hollering out guesses.

“27!”
“19992.41!”
“1/4325!”

That’s what flipping through dashboards feels like to me. You’re riffling through a bunch of graphs that were relevant to some long-ago situation, without context or history, without showing their work. Sometimes you’ll spot the exact scenario, and — huzzah! — the number you shout is correct! But when it comes to unknown scenarios, the odds are not in your favor.

Debugging looks and feels very different from flipping through answers. You ask a question, examine the answer, and ask another question based on the result. (“Which endpoints were erroring? Are all of the requests erroring, or only some? What did they have in common?”, etc.)

You methodically put one foot in front of the other, following the trail of bread crumbs, until the data itself leads you to the answer.

The limitations of metrics and dashboards

Unfortunately, you cannot do that with metrics-based dashboards, because you stripped away the connective tissue of the event back when you wrote the metrics out to disk.

If you happened to notice while skimming through dashboards that your 404 errors spiked at 14:03, and your /payment and /import endpoints started erroring at 14.03, and your database started returning a bunch of mysql errors shortly after 14:00, you’ll probably assume that they’re all related and leap to find more evidence that confirms it.

But you cannot actually confirm that those events are the same ones, not with your metrics dashboards. You cannot drill down from errors to endpoints to error strings; for that, you’d need a wide structured data blob per request. Those might in fact be two or three separate outages or anomalies happening at the same time, or just the tip of the iceberg of a much larger event, and your hasty assumptions might extend the outage for much longer than was necessary.

With metrics, you tend to find what you’re looking for. You have no way to correlate attributes between requests or ask “what are all of the dimensions these requests have in common?”, or to flip back and forth and look at the request as a trace. Dashboards can be fairly effective at surfacing the causes of problems you’ve seen before (raise your hand if you’ve ever been in an incident review where one of the follow up tasks was, “create a dashboard that will help us find this next time”), but they’re all but useless for novel problems, your unknown-unknowns.

Other complaints about dashboards:

They tend to have percentiles like 95th, 99th, 99.9th, 99.99th, etc. Which can cover over a multitude of sins. You really want a tool that allows you to see MAX and MIN, and heatmap distributions.

A lot of dashboards end up getting created that are overly specific to the incident you just had — naming specific hosts, etc — which just creates clutter and toil. This is how your dashboards become that graveyard of past outages.

The most useful approach to dashboards is to maintain a small set of them; cull regularly, and think of them as a list of starter queries for your investigations.

Fred Hebert has this analogy, which I like:

“I like to compare the dashboards to the big display in a hospital room: heartbeat, pressure, oxygenation, etc. Those can tell you when a thing is wrong, but the context around the patient chart (and the patient themselves) is what allows interpretation to be effective. If all we have is the display but none of the rest, we’re not getting anywhere close to an accurate picture. The risk with the dashboard is having the metrics but not seeing or knowing about the rest changing.”

In conclusion

Dashboards aren’t universally awful. The overuse of them just encourages sloppy thinking, and static ones make it impossible for you to follow the plot of an outage, or validate your hypotheses. 🤒 There’s too many of them, and not enough shared consensus. (It would help if, like, new dashboards expired within a month if nobody looked at them again.)

If what you have is “nothing”, even shitty dashboards are far better than no dashboards. But shitty dashboards have been the only game in town for far too long. We need more vendors to think about building for queryability, explorability, and the ability to follow a trail of breadcrumbs. Modern systems are going to demand more and more of this approach.

Nothing < Dashboards < a Queryable, Exploratory Interface

If everyone out there who slaps “observability” on their web page also felt the responsibility to add an observability-enabling interface to their tool, one that would let users explore and identify unknown-unknowns, we would all be in a far better place. 🙂

Engineering Manager Archetypes and Career Paths

August 3, 2021September 21, 2023 mipsytipsyLeave a comment

Cross-posted from leaddev.com

Exploring seven common leadership scenarios.

If you’re looking for a very traditional breakdown of engineering manager archetypes – tech lead manager, team manager, director, and so forth – you can’t do better than Will Larson’s post, and I won’t try. I enthusiastically endorse everything he says, especially about Tech Lead Management roles being mostly a trap.

This post aims to capture a different set of archetypes: the common inflection points in a senior engineering manager’s career trajectory, what contributes to these, and some recommended next steps.

I’m of the opinion that career ‘planning’ is mostly overrated. There are just so many variables that go into what makes you feel happy, challenged, and content over the course of your life, and trying to predict how you’ll feel more than two or three years down the line is a fool’s game. You might get married, have kids, deal with immigration status or work visas, or family or health issues that suddenly throw a wrench in things. The company you work at might get acquired, go public, make some bad hires, get pummeled by changing market conditions or an economic downturn, or reorg you under somebody you absolutely loathe. The plum role you want might go to someone else. You might hitch your career to a technology that takes off like a rocket, or sputters out quietly, or suffers a series of community scandals.

Everything fails sooner or later, and most fail sooner rather than later.

And then there’s the fact that people who do plan their careers tend to plan in terms of hierarchy. They are eager to ‘make management’. Then they’re hungry to manage managers. They read ‘30 under 30’ lists and obsessively track which of their peers has climbed the ladder to director or vice president first, and their heart swells in tune with their headcount growth.

Few of us are completely immune to the siren song of such external signifiers of success. (Everybody’s mom congratulates them when they move into management – this shit runs deep in our culture.) But for most of us, nothing breaks the spell like spending time in those upper-hierarchical roles.

When you’re an engineer, it seems like managers hold all the power. Two or three years into a managerial role, you should be thoroughly cured of that illusion – aware of the constraints managers operate under and the multiple stakeholders they serve, but also newly appreciative of the powers and freedoms held by engineers, and attuned to how wielding influence and marshaling results may have little if anything to do with formal decision-making powers.

So, if you’re an ambitious person who wants a long, happy, rewarding career in technology, what’s the alternative?

Self-knowledge. Never stop working on self-awareness, being honest with yourself and others. Try to understand what truly matters to you – is it status and public acclaim? Is it spending half your life in the wilderness? Do you have a singular passion for compilers? Whatever it is, expect it to shift over the years.

Do set goals for yourself – know what you want to achieve over the next few years, and have a hazy idea of where your nose is pointing five years from now.
Whenever you don’t feel strongly compelled to make a particular move, act to preserve or expand optionality.

Don’t get comfortable. Once you’ve been doing any job for two or three years, it’s time for a change.

Let’s examine some archetypes and what to do if one of these applies to you. Keep in mind that these are just rules of thumb, and rules are meant to be broken. If your heart is pulling you in one direction, by all means, follow it.

The junior engineer who became an early manager

If you have worked as an engineer for less than seven-ish years before becoming a manager, you’re in risky territory. You’ll lose your fluency very quickly, and find it much harder to go back to engineering than if you had waited until you were more solidly senior. This can feel like a big compliment – and it is! – but this is your long-term career we’re talking about, not theirs, and too often those who progress early in their career aren’t warned about the downsides and the loss of optionality.

You’ll probably be fine as long as you stay at that same company, given that you know the systems well and management is all about relationships, but you may struggle when you try to find your next job, and you may also struggle finding someone willing to hire you as an engineer when you’re rusty. Technical skills buy credibility.

It’s worth noting that this seems to disproportionately happen to women, who are tapped for management because they are perceived as having better social skills, and then suffer a double penalty when they are rusty and judged ‘not technical’. If you’re a young woman, be extra cautious – make sure you ardently guard and grow your technical credibility.

The novice manager who wants to go back to engineering, six months in

When you’re deciding whether to be a manager or not, think of it as a tour of duty that will last for two to three years. Can you commit to that? If not, this probably isn’t the right time for you to try it out. If you aren’t sure, or you don’t feel like you have any idea what the job entails, don’t accept. Find ways to level up at managerial skills without taking on full responsibility for a team.

This might include taking ownership over a hiring loop – defining the role, writing a posting, designing the interview loop, training the interviewers, screening the candidates, adopting an intern or two (or even running the intern program), updating or open-sourcing your engineering job ladder, subbing for a manager while they’re out on paternity leave for a few months – any number of things. It’s okay to say no, and far better than saying yes, then backing out.

Why? First and foremost, it’s really disruptive to the team, and not fair to them. Secondly, you need to be prepared to shelve your own opinion of your work for the next two years and just keep doggedly plugging away, doing your best to learn and taking loads of feedback, without worrying too much about whether you personally think you’re doing a good job or not. You can’t base your self-esteem on how well you think you’re doing, because frankly, you’re not to be trusted. You have to gain a lot of experience and rewire a lot of neural pathways before you can once again trust your own inner voice on the topic of your job performance.

Management isn’t a promotion, it’s a change of career. And if you aren’t up for resetting the dial to ‘NOOB’ for a couple years, with all of the ups and downs, anxieties and discombobulations, then you aren’t up for management.

The manager who wants to go back to engineering, but only temporarily

One thing I hear surprisingly often from people is that they’re anxious that they’ll never get another shot at management if they step away from it – to take a new job at a new company, for instance, or to solidify their senior engineering skills. But they like management! – and are afraid of getting ‘stuck’ back in engineering.

It’s true: there are no guarantees in life. But in my experience, this fear is particularly overblown. There is a chronic shortage of good engineering managers, especially ones who genuinely enjoy the work. It’s pretty obvious when someone has the engineering manager’s skill set in their tool box; it permeates how you do your job, how you interact with your coworkers, and how you get things done. This means that it’s only a matter of time before you get tapped again. (And again.)

Engineers who have been managers tend to bat the question away over and over again – job after job, year after year. It takes more effort not to be a manager once you’re capable of it.

The manager who was forced into it and wants out…has wanted out for years

This is a tricky one, because it comes in two different guises. Sometimes it’s the person who got pushed into management unwillingly, or had a really rough time of it for a while, or maybe they have never worked in an environment where management was spoken of respectfully. For whatever reason, they still openly grump and groan about it despite the fact that they have actually come around and quite like the job now, and don’t actually want to go back to engineering after all.

Other times, the person got pushed into management unwillingly and genuinely dislikes it. Perhaps they were better as an engineer than they are as a manager, or maybe they simply never grew to love the management role enough to compensate for what they miss about engineering. If they’ve been complaining about it for years, and haven’t done anything about it, there are a few possible reasons: 1) they are genuinely getting pressured by their own boss and the organization to stay in their current role 2) they can’t bring themselves to give up the salary, the control, or the perceived status 3) there’s some other fear holding them back – for example, maybe they’re afraid it’s been too long and they’re too rusty.

In the first case, you should stay a manager, by all means, but own up to the fact that you enjoy it and find it rewarding. Even if you didn’t initially choose the gig, you’re choosing it now, so shut the hell up about being forced into it. Nobody wants to report to a manager who doesn’t want to be there, or is doing the job begrudgingly. Your complaining is toxic, and makes it impossible for you to coach others to have a healthy relationship with their roles.

In the latter case, if you genuinely don’t enjoy your job, then get the hell out. If it’s been years, then you have all the data you need. Your boss has no place forcing you into doing something you hate. You’re never going to do as well at a job you don’t enjoy as one that you do, so ultimately it’s a career-limiting move to stay where you are. Think long and hard about the distorting role that hierarchy has played in your life, and find a place where you are respected and valued for doing what you love.

Personally, I think the worst reason of all for not wanting to give up management is because it means giving up control. How do you think everyone else on your team feels? The good news is that you are ideally positioned to go back to engineering and model what a healthy power balance should look like between management and engineering – to demand transparency and model autonomy and ownership as a senior engineer.

The manager who has been managing for several years … now what?

If you’re a line manager who has been doing the job for around five years, and you feel pretty confident in your craft … what’s next?

That’s a great question. As you know, your technical credibility is grounded to some extent in your ability to do the same work as your team. So one question to ask yourself is, how wobbly are you on that front? Some managers manage to keep a hand in (but out of the critical path) the whole time; most managers do not. There is no right or wrong answer here, only an honest assessment of where you’re at. If you’re rusty, and you want to keep doing line management, it’s probably time to go back to the well for at least six months work as an engineer.

(This could be a GREAT opportunity to swap places with someone who’s questioning whether or not to be a manager full time.)

That’s definitely the choice that preserves the most optionality for you career-wise. It ensures you will still be hireable as an engineer or an engineering manager anywhere, regardless of how much technical knowledge the interview requires (and it does run the gamut, up to and including places that make you work as an engineer for six months before assuming your managerial role).

But most managers are at least slightly curious about what it would be like to move up the chain, or expand their managerial repertoire. This might be: managing teams where you are not the technical domain expert, taking on more of a hybrid product manager role, managing multiple teams, spinning off one or more subteams and managing those managers in addition to your primary team, or managing managers as a director.

Here your choices will likely be constrained by opportunities, which means that the company you work for matters A LOT. At a company with hundreds of engineers and high growth, there’s usually a nonstop trickle of reorgs, reteaming, and other opportunities opening up. If you work at a smaller company (or a place with less growth) and your bosses show no interest in leaving soon, you may need to switch companies to seek out those opportunities.

If this is important to you, you may want to consider switching earlier rather than later. You should pay extra attention to managing ‘out’ and ‘up’; building close relationships with your peers and those above you, in other words. Don’t be afraid to open your mouth and state your goals, and ask for your manager’s advice in getting there. People don’t get randomly tapped for these opportunities so much as they share their goals and are then grown into those positions over time.

The engineer who wants to be a manager, but hasn’t had the opportunity

The most demystifying thing about management is actually being a manager. To those who haven’t had the chance yet – or worse, who have seen colleague after colleague tapped for opportunities while being repeatedly overlooked– that can be cold comfort, or even infuriating to hear. If you want a shot at management, and haven’t had one yet: why not? What should you do or try differently?

Every circumstance is different, but I can offer some suggestions.

First of all, my personal belief is that in an ideal world, any senior engineer who wants to give management a try (and who demonstrates sufficient self-awareness, emotional intelligence, and respect for their colleagues), should get a shot at it. I also believe managers and engineers should both have parallel career ladders, should make equal salaries across their bands, and that technical leadership should be the purview of technical contributors. Leadership is not synonymous with management, and the more we can drain the relationships of hierarchy, the better people will be equipped to find the work that they love and find most fulfilling.

I live in the real world, so I know that most places don’t operate this way (although I believe the number is growing).

I also think that engineers tend to systematically underestimate and undervalue their own power, and they fail to inhabit and flex the power they should have. This leaves a vacuum, so managers step in to fill by default.

All that said. First, look at your company. How many opportunities are there for new managers, really? Is it too small, or not growing fast enough, or do you work in a specialty niche? Most companies aren’t bristling with new opportunities; if this is important to you, you should go work for a company that is growing fast, and you should state your ambition from the outset – yes, to the recruiter or hiring manager in the interview process.

Second, make friends with your manager and the other managers, and eagerly absorb as many managerial tasks or responsibilities as you can. Be more concerned with learning the skills and gaining the experience than with the title itself.

Third, work on your communication skills, written and spoken. Consider doing some public speaking. Run a workshop for your coworkers in something you know well. Take good meeting notes, write great technical project docs.

Fourth, ask your manager (and any other trusted senior people you work with) what you’re missing. Make it clear that you want to hear honest feedback, even if it hurts. If there’s something holding you back that you don’t know about, and someone is willing to tell you, that’s an incredible gift.

Fifth, educate yourself about diversity and inclusion issues. Stand up for others on your team who are getting talked over or having their ideas stolen. Be a leader among your peers, be someone who is willing to do what is right despite the temporary social costs.

The senior manager, director or vice president who daydreams about being an engineer again

Very often, people who were in the right place at the right time and climbed the ladder rapidly, get to the top and realize they feel restless and empty inside – they miss the intrinsic stimulation of writing code, solving problems, delighting users. Very few of them have the courage to go back.

It’s a true fact: the higher you climb the ladder, the further removed you are from the work that brings most people intrinsic joy, that feels real. Some manage to find dopamine hits in their work as a director or a vice president, but the dose is usually fainter and always on a delay.

It’s also a true fact that you’d have to go from being atop the monkey pyramid to being just another individual contributor, slinging code in the mines with the rest of them. That’s a lot for an ego to take. And it’s rough to pick up the skills again if you’ve been ten years or more away from writing code on a daily basis. Lots of people who are in their 30s, 40s or 50s feel like they have lost the mental agility to do so and therefore it might not even be an option anymore.

I don’t have an enormous sample size, but what I do have suggests the above is bullshit.

Consider my friend Molly Stamos, who was a software engineer from 1997-2001, then rapidly climbed the corporate ladder working in product, as a director, then as a vice president. She actually joined Honeycomb as our vice president of customer success, but she was burnt, and admitted to herself shortly after that she deeply missed software engineering and that was where her heart still lay. She worked in support for a while and picked up tasks from the backlog, and formally joined the engineering team in 2020, almost 20 years after she stopped coding full time. She’s freaking great at it and says she’s never been happier.

Be like Molly. Life’s too short to be miserable at work.

And P.S. most of the prestige is in your head. If you go from being an exec to being a senior engineer, I guarantee you that lots of people, especially other engineers, will actually respect you tons more.

charity.wtf

charity wtf's about technology, databases, startups, engineering management, and whiskey.

Month: August 2021