Questionable Advice: “My boss says we don’t need any engineering managers. Is he right?”

January 5, 2024January 10, 2024 mipsytipsyadvice, culture, engineering management, hierarchy, organizations, sociotechnical18 Comments

I recently joined a startup to run an engineering org of about 40 engineers. My title is VP Engineering. However, I have been having lots of ongoing conflict with the CEO (a former engineer) around whether or not I am allowed to have or hire any dedicated engineering managers. Right now, the engineers are clustered into small teams of 3-4, each of which has a lead engineer — someone who leads the group, but whose primary responsibility is still writing code and shipping product.

I have headcount to hire more engineers in the coming year, but no managers. My boss says we are a startup and can’t afford such luxuries. It seems obvious to me that we need engineering managers, but to him, it seems just as obvious that managers are unnecessary overhead and that all hands should be on deck writing code at our stage.

I don’t know how to make that argument. It seems so obvious to me that I actually struggle to put it into words or make the case for why we should hire EMs. Help?

— Unnecessary Overhead(?!?)

Oh boy, there’s a lot to unpack here.

It is unsurprising to me that your CEO does not understand why managers exist, given that he does not seem to understand why organizational structures exist. 🙈 Why is he micromanaging how you are structuring your org or what roles you are allowed to fill? He hired you to do a job, and he’s not letting you do it. He can’t even explain why he isn’t letting you do it. This does not bode well.

But I do think it’s an interesting question. So let’s pretend he isn’t holding your ability to do your damn job hostage until you defend yourself to his satisfaction. 😒

I can think of two ways to make the case for engineering managers: one is rather complicated, from first principles, and the other very simple, but perhaps unsatisfying.

I personally have a … vigorous … knee-jerk response to authority; I hate being told what to do. It’s only recently that I’ve found my way to an understanding of hierarchy that feels healthy and practical, and that was by looking at it through the lens of systems theory.

Why does hierarchy exist in organizations?

It makes sense that hierarchy comes with a lot of baggage. Many of us have had bad experience with managers — indeed, entire organizations — where hierarchy was used as a tool of oppression, where people rose up the leadership ranks by hoarding information and playing dominance games, and decisions got made by pulling rank.

Working at a place like that fucking sucks. Who wants to invest their creativity and life force into a place that feels like a Dilbert cartoon, knowing how little it will be valued or reciprocated, and that it will slowly but surely get crushed out of you?

But hierarchy is not intrinsically authoritarian. Hierarchy did not originate as a political structure that humans invented for controlling and dominating one another, it is in fact a property of self-organizing systems, and it emerges for the benefit of the subsystems. In fact, hierarchy is absolutely critical to the adaptability, resiliency, and scalability of complex systems.

Let’s start with few basic facts about systems, for anyone that may be unfamiliar.

Hierarchy is a property of self-organizing systems

A system is “a network of interdependent components that work together to try to accomplish a common aim” (W. Edward Deming). A pile of sand is not a system, but a car is a system; if you take out its gas tank, the car cannot achieve its aim.

A subsystem is a collection of elements with a smaller aim inside a larger system. There can be many levels of subsystems that operate interdependently. The subsystems always work to support the needs of the larger system; if the subsystem instead optimizes for its own best interests, the whole system can fail (this is where the term “suboptimize” comes from 😄).

A system is self-organizing if it has the ability to make itself more complex, by diversifying, adapting, and improving itself. As systems self-organize and their complexity increases, they tend to generate hierarchy — an arrangement of systems and subsystems. In a stable, resilient and efficient system, subsystems can largely take care of themselves, regulate themselves, and serve the needs of the larger system, while the larger system coordinates between subsystems and helps them perform better.

Hierarchy minimizes the costs of coordination and reduces the amount of information that any given part of the system has to keep track of, preventing information overload. Information transfer and relationships within a subsystem are much more dense and have fewer delays than information transfer or relationships between subsystems.

(This should all sound pretty familiar to any software engineer. Modularization, amirite?? 😍)

Applying this definition, we can say that a manager’s job is to coordinate between teams and help their team perform better.

The false binary of sociotechnical systems

You’ve probably heard this canard: “Engineers do the technical work, managers do the people work.” I hate it. ☺️ I think it misconstrues the fundamental nature of sociotechnical systems. The “socio” and “technical” of sociotechnical systems are not neatly separable, they are interwoven and interdependent. There is actually precious little that is purely technical work or purely people work; there is a metric shitload of glue work that draws upon both skill sets.

Consider a very partial list of tasks done by any functional engineering org, besides writing code:

Recruiting, networking, interviewing, training interviewers, synthesizing feedback, writing job descriptions and career ladders
Project management for each project or commitment, prioritizing backlog, managing stakeholders and resolving conflicts, estimating size and scope, running retrospectives
Running team meetings, having 1x1s, giving continuous growth feedback, writing reviews, representing the team’s needs
Architecture, code review, refactoring; capturing DORA and productivity metrics, managing alert volume to prevent burnout

A lot of this work can be done by engineers, and often is. Every company distributes the load somewhat differently. This is a good thing! You don’t WANT an org where this work is only done by managers. You want individual contributors to help co-create the org and have a stake in how it gets run. Almost all of this work would be done more effectively by someone with an engineering background.

So you can understand why someone might hesitate to spend valuable headcount on engineering managers. Why wouldn’t you want everyone in engineering to be writing and shipping code as their primary job? Isn’t that by definition the best way to maximize productivity?

Ehhh… 😉

Engineering managers are a useful abstraction

In theory, you could make a list of all the tasks that need to be done to coordinate with other teams and have each item be picked up by a different person. In practice, this is impractical because then everybody would need to know about everything. One of the primary benefits of hierarchy, remember, is to reduce information overload. Intra-team communication should be high-bandwidth and fast, inter-team communication should be more sparse.

As the company scales, you can’t expect everybody to know everyone else; we need abstractions in order to function. A manager is the point of contact and representative for their team, and they serve as routers for important information.

Graph theory: for each of n engineers to be connected ("aligned") w each other you need n(n-1)/2 links, i.e. you need 780 interactions just to keep consistency. Without a few engineering managers (and arks) he is allowing velocity to build up tech debt at blazing speed.
— HurtingBrain (@HurtingBrain) January 5, 2024

I sometimes imagine managers as the nervous system of the company body, carrying around messages from one limb to another to coordinate actions. Centralizing many or most of these functions into one person lets you take advantage of specialization, as a manager builds relationships and context and improves at their role, and this massively reduces context switching for everyone else.

Manager calendars vs maker calendars

Engineering labor takes concentration and focus. Context switching is expensive, and too many interrupts can be fatal. Management labor consists of context switching every hour or so, and being available for interruptions throughout the day. These are two very different modes of being, headspaces, and calendar schedules, and do not coexist well.

In general, you want people to be able to spend most of their time working on things that contribute to the success of the outcomes they are directly responsible for. Engineers can only do so much glue work before their calendar turns into Swiss cheese and they can no longer deliver on their commitments. Since managers’ calendars are already Swiss cheese, it’s typically less disruptive for them to take on a larger share of glue labor.

It isn’t up to managers to do all the glue work, but it is a manager’s job to make sure that everything that needs to get done, does gets done. It is a manager’s job to try to line up every engineer with work that is interesting and challenging, but not overwhelming, and to ensure that unpleasant labor gets equitably distributed. It’s also a manager’s job to make sure that if we are asking someone to do a job, they are equipped with the resources they need to succeed at that job. Including time to focus.

Management is a tool for accountability

When you’re an engineer, you are responsible for the software you develop, deploy, and maintain. When you’re a manager, you are responsible for your team and the organization as a whole.

Management is one way of holding people accountable for specific outcomes (building teams with the right skills, relationships, and processes to make good decisions and build value for the company), and equipping them with the resources (budget, tools, headcount) to achieve those outcomes. If you aren’t making building the organization someone’s number one job, it won’t be anyone’s number one job, which means it probably won’t get done very well. And whose responsibility will that be, Mr. CEO?

There’s a real upper limit to what you can reasonably expect tech leads, or engineers, or anyone whose actual job is shipping software to do in their “spare time”. If you’re trying to hold your tech leads responsible for building healthy engineering teams, tools, and processes, you are asking them to do two calendarily incompatible jobs with only one calendar. The likeliest scenario is that they will focus on the outcomes they feel comfortable owning (the technical ones), while you pile up organizational debt in the background.

In natural hierarchies, we look up for purpose and down for function. That, in a nutshell, is the more complicated argument for why we need engineering managers.

Choose Boring technology Culture

The simpler argument is this: most engineering orgs have engineering managers. That’s the default. Lots of people much smarter than you or me have spent lots of time thinking and tinkering with org structures over the years, and this is what we’ve got.

As Dan McKinley famously said, we should “choose boring technology“. Boring doesn’t mean bad, it means the capabilities and failure conditions are well understood. You only ever get a few innovation tokens, so you should spend those wisely on core differentiators that could make or break your business. The same goes for culture. Do you really want to spend one of your tokens on org structure? Why??

For better or for worse, the hierarchical org structure is well understood. There are plenty of people on the job market who are proficient at managing or working with managers, and you can hire them. You can get training, coaching, or read a lot of self-help books. There are various management philosophies you can coalesce around or use to rule people out. On the other hand, the manager-free experiments I’m aware of (e.g. holacracy at Medium and GitHub, or “Choose Your Own Work” at Linden Lab) have all been quietly abandoned or outgrown. Not, in my experience, because leaders went mad for power, but due to chaos, lack of focus, and poor execution.

When there is no explicit structure or hierarchy, the result is not freedom and egalitarianism, it’s “informal, unacknowledged, and unaccountable leadership”, as famously detailed in “The Tyranny of Structureless“. In reality, sadly, these teams tend to be chaotic, fragile, and frustrating. I know! I’m pissed too! 😭

This argument doesn’t necessarily prove your CEO is wrong, but I should think his bar for proof is much higher than yours. “I don’t want any of my engineers to stop writing code” is not an argument. But I’m also feeling like I haven’t quite addressed the core question of productivity, so let’s pick that up again once more.

More lines of code != more productivity

To briefly recap: we were talking about an org with ~40 engineers, broken up into 10 small clusters of 3-4 engineers, each with a tech lead. Your CEO is arguing that you can’t afford to lose any velocity, which he thinks is what would happen if anyone stops writing code full time.

Maybe. But everything I have ever experienced leads me to believe that a fewer number of larger teams, each helmed by an experienced engineering manager, should way outperform this gaggle of tiny groups. It’s not even close. And they can do so in a way that’s more efficient, sustainable, and humane than this scrappy death march.

And systems thinking shows us why! With fewer groups, but larger ones, you have less overall management overhead, and much less of the slow and costly intra-group coordination. You unlock rich, dense knowledge transfer within groups, which gives you more shared coverage of the surface area. With 7-9 engineers per group you can build a real on call rotation, which means fewer heroics and less burnout. The coordination that you do need to do can be more strategic, less tactical, and much more forward-looking.

Would five big teams ship as many lines of code as 10 small teams, even if five engineers become managers and stop writing code? Probably, but who cares? Your customers give zero fucks how many lines of code you write. They care about whether you are building the right things and solving problems that matter to them. What matters is moving the business forward, not churning out code. Don’t forget, the act of churning out code creates costs and externalities in and of itself.

What defines your velocity is that you spend your time on the right things. Learning to make good decisions about what to build is something every organization has to work out for itself, and it is always an ongoing work in progress. Engineering managers don’t do all the work or make all the decisions, but they are absolutely fucking vital, in my experience, to ensuring that work happens and is done well. As I wrote in my last piece, engineering managers are the embodiment of the feedback loops that systems use to learn and improve.

Are managers ever unnecessary overhead?

Sure, absolutely. Management is about coordinating between teams and helping teams run more optimally, so anything that decreases your need for coordination also decreases your need for management. If you are a small company, or if you have really senior folks who are used to working together, you need a lot less coordination. The next most relevant factor is probably the rate of change; if you’re growing fast or have a lot of turnover, or if there’s a lot of time pressure or frequent shifts in strategy, your need for managers goes up. But there are plenty of smaller orgs out there that are doing just fine without a lot of formal management.

Look, I’m not a fan of the word “overhead”, because a) it’s kind of rude and b) people who call managers “overhead” are typically people who disrespect or do not value the craft of management.

But management is, in fact, overhead. 😅 So is a lot of other glue work! By which I mean the work is important, but does not itself move the business forward; we should do as much of it as absolutely necessary and no more. The nature of glue work is such that it too-easily expands to consume all available time and space (and then some). Constraints are good. Feeling a bit underresourced is good, and should be the norm. It is incredibly easy for management to get a bit bloated, and managers can be very loath to acknowledge this, because it’s not like they ever feel any less stressed or stretched.[*]

Management is also very much like operations work in that when it’s being done well, it’s invisible. Evaluating managers can be very hard, especially in the near term, and making decisions about when it’s time to create or pay down organizational debt is a whole nother ball of wax, and way outside the scope of this post.

But yes, managers can absolutely be unnecessary overhead.

However, if you have 40 engineers all reporting to one VP, and nobody else whose number one job is the outcomes related to people, teams and org, I feel pretty safe in saying this is not a risk for you at this time.

<3 charity

[*] In fact, the reverse can be true; bloated management can create MORE work for managers, and they may counterintuitively feel LESS stretched or stressed with a leaner org chart. Bureaucracies do tend to pick up a momentum all their own. Especially when management gets all wrapped up in promotions and egos. Which is yet another good reason to ensure that management is not a promotion or a form of domination.

Why On-Call Pain Is A Sociotechnical Problem

June 30, 2022January 10, 2024 mipsytipsycrossposted, devops, oncall, sociotechnicalLeave a comment

Cross-posted from leaddev.com

Most people hate being on call, because most on-call rotations are terrible.

Pager bombs, flappy alerts, false positives going off night and day, sleepless nights… Who can blame them? Small wonder that so many people develop a Pavlovian response to the sound of their Pagerduty ringtone. Alert goes off; adrenaline soars.

Conventional wisdom tells us that being on call means you put your whole life on hold, then spend all week lurching between firefighting and false alarms as you get progressively more sleep-deprived. It sucks, but that’s just what you get when you own your code in production. Right?

Noooooo. Wrong wrongy wrong wrong. Being on call should not be a constant cycle of things breaking down and firefighting, or alerts going off at all hours. This is not ‘normal.’ These are telltale signs of a fragile system and lack of alert discipline.

If on-call pain is a constant source of pain at your organization, that is a PROBLEM. It’s a five-alarm fire. You should drop what you’re doing and fix it with urgency.

An eternally miserable on-call rotation is a violation of the pact we make to support these systems:

It is engineering’s job to own their code in production.
It is management’s job to make sure it doesn’t suck.

This is a two-way handshake. If management isn’t holding up their end, if they don’t allocate enough time to fix the underlying problems – if they run a feature factory that never stops to refactor or invest in reliability work – then on-call will never get better, and you should leave.

On-call rotations are sociotechnical systems

On-call rotations are a classic example of a sociotechnical problem. A sociotechnical system consists of three elements: in this case that’s your production system, the people who operate it, and the tools they use to enact change on it.

You cannot solve sociotechnical problems with purely people solutions or with purely technical solutions. You need to use both.

The technical problems are usually easier to diagnose. You need to automate failovers, instrument your code, build and test repairing code, audit your indexes, etc. The social problems can be trickier to spot, but here’s a tip: they usually manifest as organizational problems.

Some engineers spend their entire career actively avoiding roles where they would have to be on call. Other engineers cling to the safety buffer of ops teams on call for their code, so that only manual escalations reach them.

Responsibility for your code is increasingly non-optional

This is becoming a harder line to hold, as the consensus has shifted decisively towards engineers owning their own code in production. Our systems are becoming exponentially more complex, and feedback loops are tightening. The people best equipped to own software in production are the people who built it. And in order to own it effectively, they need to close the loop by receiving the signal when something breaks.

But the point is not to invite software engineers into the same circle of hell that ops engineers have traditionally inhabited. This isn’t an act of vengeance. The point is that tightening these feedback loops is how we make systems better. Being on call shouldn’t have to destroy your social life or your sleep schedule.

Yes, engineering owns their software. But ensuring that engineering’s time is respected and their rest time valued is on management. It’s management’s job to make sure time is allocated to fixing recurring or known issues – and that they don’t kick the proverbial can down the road to later turn into tech debt. If reliability or productivity is suffering, managers need to reassign engineering cycles away from feature work. Managers’ performance should be evaluated by the four DORA metrics, as well as a fifth; how often is their team alerted outside of working hours?

It’s reasonable to be woken up two to three times a year when on call. But more than that is not okay. It’s management’s responsibility to ensure enough resources are dedicated to maintaining system stability, and they should be held accountable – not the on-call engineers.

Humans doing human things

We all have lives outside of work – families, doctor appointments, dentist visits, and so on. Instead of being surprised when things come up, we can predict the ways people’s lives will conflict with on-call duty and come up with ways to ease the burden.

Kids. I would never ask a new parent to be on call. Being woken up by ONE instrument of chaos is all anyone should ever have to cope with at any given time.
Sleepy brain. People are never going to be at their best when they are woken up in the middle of the night. We should make sure alert text, documentation, and steps are all clear, simple, and otherwise tuned for 2 a.m. brain fog.
Getting sleep. Sometimes people struggle getting back to sleep, or they were up all night dealing with something. Establish that 1) no one is EVER to be on call two nights in a row after a bad night, and 2) they are entitled to sleep in, come in late, leave early – whatever works best to help them catch up on their sleep.
Anxiety. I’ve managed people before who had high anxiety about being on call. They were perfectly willing, but it didn’t matter how quiet the pager was – their anxiety knowing it was on made it impossible to sleep. We tried it for a while, and it wasn’t getting better, so we ultimately found other ways for them to pull their weight.

If someone is absolutely unable to participate in on-call rotations, well, it happens. If it’s a temporary situation, you might want to let it go. But if it’s a permanent thing, like in the ‘anxiety’ example above, the team should address this by finding other ways for that person to do their share of maintenance work.

For example, they could be in charge of failed builds or maintain the dev environment. What matters is that 1) the team as a whole feels like it’s a fair distribution of labor, and 2) there are enough people left in the on-call rotation that no one is overly burdened.

Technical stumbling blocks

Un-owned code. Everything you support, and every alert that can fire, should have a team that owns it.
Conversely, you may have architectural issues that make it impossible to isolate and alert only the owners. If you have ten different on-call rotations for various areas of the code base, but any time the database gets slow all ten of you get paged, this is a bad situation.
SLOs. As you scale up, there will come a point where you can no longer alert on individual services or symptoms. They will simply drown you. At this point, you need to migrate your alerts over to Service Level Objectives. SLOs align your engineering pain directly with user pain.
Paging too early. Ah, this always sounds like such a great idea. ‘Wouldn’t it be great if we could catch it and alert someone before the users are impacted?’ But it’s not. It’s a recipe for flappy alerts and aggravation. Alert when users are impacted, not before.
Two lanes. You need two types of alerts: ‘WAKE ME UP’ and ‘Deal with this later.’ No more, no less. Keep the list of ‘wake me up’ alerts as short, crisp, and carefully curated. Put everything that needs to be dealt with ‘soon’ in the second lane, and have your on-call engineer sweep through it at the start of the day and the end of the day. If it doesn’t need to be acted upon in the next day, it probably shouldn’t be an alert.

On-call problems are often organizational problems

Sometimes people don’t want to be on call, and it’s not due to life events. This is a bit trickier to address because they are actually the result of organizational problems that present themselves as on-call problems. For example:

Tribal knowledge, or the ‘bus factor.’ You’re the debugger of last resort because you’ve been responsible for a mission-critical component of the system from the very beginning. The team tried training new people, but you still get called every time something goes wrong, and it’s not clear if the issue would be fixed if you weren’t available (or how long it would take them if they did).
Individual ownership vs. team ownership. Software is owned by teams, not by individuals. In an ideal world, this means everyone on the team is capable of debugging and maintaining all the systems they collectively own. In the real world, this means everything is at least understood by more than one engineer.
Too little – or too much – coverage. If you have three to four people on call, that’s too much of your life spent lugging around a laptop. Tossing all 20-30 engineers into a single rotation is also the wrong way to go; engineers won’t be on call often enough to stay familiar with the systems. The ideal on-call rotation has seven to eight people; five people is a bare minimum. With eight people, you are on call for a highly sustainable one week out of every two months.
Lower the barriers to asking for help, swapping times, covering for each other, etc. When someone asks for help with their on-call shift, thank them for asking. If the on-call shift isn’t that arduous, it’s really no big deal to back someone up for the duration of a movie.
Appointing primary/secondary on-call engineers can be really helpful here. Only the primary needs to get alerted and lug their laptop around, but they have a designated point person to tag if they need to run to the grocery store, drive through the boonies, or otherwise go offline for a while.
Put managers on call. I’m not generally a fan of putting managers in the rotation, but they really are the ideal backup situation. Especially when it comes to picking up the pager the day after someone has had a rough night. This serves multiple purposes: it helps keep the manager fresh, it exposes them to the reality of what on-call is currently like, and their time doesn’t have to be swapped for someone else’s.

The next time someone doesn’t want to be on call, it may be time to take a closer look at your organization as a whole to see whether the problem really is resource allocation, risk mitigation, or something else.

Making on-call costs tangible

On the topic of paying people more to be on call: there are loads of opinions here – it’s a very fraught topic. I generally come down on the side of ‘no, it’s part of the job,’ just like it is for doctors. With one big exception.

If you’re having a hard time getting upper management to understand the value of spending engineering cycles on the infrastructure and reliability work that needs to be done, instead of just cranking features… by all means, pay people for being on call.

Pay them for every event they have to respond to.

Pay them well.

Pay them so goddamn well the finance team starts squawking about the need to pay down that reliability debt.

If that’s the only way you can make it real for them, well, use the tools you’ve got. Engineers should never have to quietly suffer the pain of flaky software and unhappy users alone. Give management pain too until they take their jobs seriously enough to see that reliability issues get fixed.

charity.wtf

charity wtf's about technology, databases, startups, engineering management, and whiskey.

sociotechnical