Why On-Call Pain Is A Sociotechnical Problem

Cross-posted from leaddev.com

Most people hate being on call, because most on-call rotations are terrible.

Pager bombs, flappy alerts, false positives going off night and day, sleepless nights… Who can blame them? Small wonder that so many people develop a Pavlovian response to the sound of their Pagerduty ringtone. Alert goes off; adrenaline soars.

Conventional wisdom tells us that being on call means you put your whole life on hold, then spend all week lurching between firefighting and false alarms as you get progressively more sleep-deprived. It sucks, but that’s just what you get when you own your code in production. Right?

Noooooo. Wrong wrongy wrong wrong. Being on call should not be a constant cycle of things breaking down and firefighting, or alerts going off at all hours. This is not ‘normal.’ These are telltale signs of a fragile system and lack of alert discipline.

If on-call pain is a constant source of pain at your organization, that is a PROBLEM. It’s a five-alarm fire. You should drop what you’re doing and fix it with urgency.

An eternally miserable on-call rotation is a violation of the pact we make to support these systems:

  1. It is engineering’s job to own their code in production.
  2. It is management’s job to make sure it doesn’t suck.

This is a two-way handshake. If management isn’t holding up their end, if they don’t allocate enough time to fix the underlying problems – if they run a feature factory that never stops to refactor or invest in reliability work – then on-call will never get better, and you should leave.

On-call rotations are sociotechnical systems

On-call rotations are a classic example of a sociotechnical problem. A sociotechnical system consists of three elements: in this case that’s your production system, the people who operate it, and the tools they use to enact change on it.

You cannot solve sociotechnical problems with purely people solutions or with purely technical solutions. You need to use both.

The technical problems are usually easier to diagnose. You need to automate failovers, instrument your code, build and test repairing code, audit your indexes, etc. The social problems can be trickier to spot, but here’s a tip: they usually manifest as organizational problems.

Some engineers spend their entire career actively avoiding roles where they would have to be on call. Other engineers cling to the safety buffer of ops teams on call for their code, so that only manual escalations reach them.

Responsibility for your code is increasingly non-optional

This is becoming a harder line to hold, as the consensus has shifted decisively towards engineers owning their own code in production. Our systems are becoming exponentially more complex, and feedback loops are tightening. The people best equipped to own software in production are the people who built it. And in order to own it effectively, they need to close the loop by receiving the signal when something breaks.

But the point is not to invite software engineers into the same circle of hell that ops engineers have traditionally inhabited. This isn’t an act of vengeance. The point is that tightening these feedback loops is how we make systems better. Being on call shouldn’t have to destroy your social life or your sleep schedule.

Yes, engineering owns their software. But ensuring that engineering’s time is respected and their rest time valued is on management. It’s management’s job to make sure time is allocated to fixing recurring or known issues – and that they don’t kick the proverbial can down the road to later turn into tech debt. If reliability or productivity is suffering, managers need to reassign engineering cycles away from feature work. Managers’ performance should be evaluated by the four DORA metrics, as well as a fifth; how often is their team alerted outside of working hours?

It’s reasonable to be woken up two to three times a year when on call. But more than that is not okay. It’s management’s responsibility to ensure enough resources are dedicated to maintaining system stability, and they should be held accountable – not the on-call engineers.

Humans doing human things

We all have lives outside of work – families, doctor appointments, dentist visits, and so on. Instead of being surprised when things come up, we can predict the ways people’s lives will conflict with on-call duty and come up with ways to ease the burden.

  • Kids. I would never ask a new parent to be on call. Being woken up by ONE instrument of chaos is all anyone should ever have to cope with at any given time.
  • Sleepy brain. People are never going to be at their best when they are woken up in the middle of the night. We should make sure alert text, documentation, and steps are all clear, simple, and otherwise tuned for 2 a.m. brain fog.
  • Getting sleep. Sometimes people struggle getting back to sleep, or they were up all night dealing with something. Establish that 1) no one is EVER to be on call two nights in a row after a bad night, and 2) they are entitled to sleep in, come in late, leave early – whatever works best to help them catch up on their sleep.
  • Anxiety. I’ve managed people before who had high anxiety about being on call. They were perfectly willing, but it didn’t matter how quiet the pager was – their anxiety knowing it was on made it impossible to sleep. We tried it for a while, and it wasn’t getting better, so we ultimately found other ways for them to pull their weight.

If someone is absolutely unable to participate in on-call rotations, well, it happens. If it’s a temporary situation, you might want to let it go. But if it’s a permanent thing, like in the ‘anxiety’ example above, the team should address this by finding other ways for that person to do their share of maintenance work.

For example, they could be in charge of failed builds or maintain the dev environment. What matters is that 1) the team as a whole feels like it’s a fair distribution of labor, and 2) there are enough people left in the on-call rotation that no one is overly burdened.

Technical stumbling blocks

  • Un-owned code. Everything you support, and every alert that can fire, should have a team that owns it.
  • Conversely, you may have architectural issues that make it impossible to isolate and alert only the owners. If you have ten different on-call rotations for various areas of the code base, but any time the database gets slow all ten of you get paged, this is a bad situation.
  • SLOs. As you scale up, there will come a point where you can no longer alert on individual services or symptoms. They will simply drown you. At this point, you need to migrate your alerts over to Service Level Objectives. SLOs align your engineering pain directly with user pain.
  • Paging too early. Ah, this always sounds like such a great idea. ‘Wouldn’t it be great if we could catch it and alert someone before the users are impacted?’ But it’s not. It’s a recipe for flappy alerts and aggravation. Alert when users are impacted, not before.
  • Two lanes. You need two types of alerts: ‘WAKE ME UP’ and ‘Deal with this later.’ No more, no less. Keep the list of ‘wake me up’ alerts as short, crisp, and carefully curated. Put everything that needs to be dealt with ‘soon’ in the second lane, and have your on-call engineer sweep through it at the start of the day and the end of the day. If it doesn’t need to be acted upon in the next day, it probably shouldn’t be an alert.

On-call problems are often organizational problems

Sometimes people don’t want to be on call, and it’s not due to life events. This is a bit trickier to address because they are actually the result of organizational problems that present themselves as on-call problems. For example:

  • Tribal knowledge, or the ‘bus factor.’ You’re the debugger of last resort because you’ve been responsible for a mission-critical component of the system from the very beginning. The team tried training new people, but you still get called every time something goes wrong, and it’s not clear if the issue would be fixed if you weren’t available (or how long it would take them if they did).
  • Individual ownership vs. team ownership. Software is owned by teams, not by individuals. In an ideal world, this means everyone on the team is capable of debugging and maintaining all the systems they collectively own. In the real world, this means everything is at least understood by more than one engineer.
  • Too little – or too much – coverage. If you have three to four people on call, that’s too much of your life spent lugging around a laptop. Tossing all 20-30 engineers into a single rotation is also the wrong way to go; engineers won’t be on call often enough to stay familiar with the systems. The ideal on-call rotation has seven to eight people; five people is a bare minimum. With eight people, you are on call for a highly sustainable one week out of every two months.
  • Lower the barriers to asking for help, swapping times, covering for each other, etc. When someone asks for help with their on-call shift, thank them for asking. If the on-call shift isn’t that arduous, it’s really no big deal to back someone up for the duration of a movie.
  • Appointing primary/secondary on-call engineers can be really helpful here. Only the primary needs to get alerted and lug their laptop around, but they have a designated point person to tag if they need to run to the grocery store, drive through the boonies, or otherwise go offline for a while.
  • Put managers on call. I’m not generally a fan of putting managers in the rotation, but they really are the ideal backup situation. Especially when it comes to picking up the pager the day after someone has had a rough night. This serves multiple purposes: it helps keep the manager fresh, it exposes them to the reality of what on-call is currently like, and their time doesn’t have to be swapped for someone else’s.

The next time someone doesn’t want to be on call, it may be time to take a closer look at your organization as a whole to see whether the problem really is resource allocation, risk mitigation, or something else.

Making on-call costs tangible

On the topic of paying people more to be on call: there are loads of opinions here – it’s a very fraught topic. I generally come down on the side of ‘no, it’s part of the job,’ just like it is for doctors. With one big exception.

If you’re having a hard time getting upper management to understand the value of spending engineering cycles on the infrastructure and reliability work that needs to be done, instead of just cranking features… by all means, pay people for being on call.

Pay them for every event they have to respond to.

Pay them well.

Pay them so goddamn well the finance team starts squawking about the need to pay down that reliability debt.

If that’s the only way you can make it real for them, well, use the tools you’ve got. Engineers should never have to quietly suffer the pain of flaky software and unhappy users alone. Give management pain too until they take their jobs seriously enough to see that reliability issues get fixed.

Why On-Call Pain Is A Sociotechnical Problem

Advice for Engineering Managers Who Want to Climb the Ladder

We have been interviewing and hiring a pile of engineering directors at Honeycomb lately. In so doing, I’ve had some fascinating conversations with engineering managers who have been trying unsuccessfully to make the leap to director.

Here is a roundup of some of the ideas and advice I shared with them, and the original twitter thread that spawned this post.

What is an engineering director?

Given all the title inflation and general inconsistency out there, it seems worth describing what I have in mind when I say or hear “Engineering Director.”

In a traditional org chart, an engineering manager usually manages about 5-8 engineers, an engineering director manages 2-5 engineering managers, and a VP of engineering manages the directors. (At big companies, you may see managers and directors reporting to other managers and directors, and/or you may find a bunch of ‘title padding’ roles like Senior Manager, Senior Director etc.)

In smaller companies, it’s common to have a “Head of Engineering” (this is an appropriately weaselly title that commands just the right amount of respect while leaving plenty of space to hire additional people below or above them). Or all of engineering might roll up to a director or VP or CTO. It varies a lot.

When it comes to the work a director is expected to do, though, there’s a fair bit of consistency: we expect managers to manage ICs, and directors to manage managers.  Directors sit between the line managers and the strategic leadership roles. (More on this later.)

So if you’re an engineering manager, and you want to try being a director, the first thing you’ll want to understand is this: it is generally better to get there by being promoted than by getting offered a director title at a different company.

How to level up

Lots of engineers get tapped by their management to become managers, but not many become directors without a conversation and some intentional growth first. This means that for many of you, trying to become a director may be the first time you have ever consciously solicited a role outside the interview process. This can bring up feelings of awkwardness, even shamefulness or inappropriateness. You’ll just have to push through those.

If you ever want a job in upper leadership, you are going to have to learn how to shamelessly state your career goals. We want people in senior leadership who want to be there and are honing their skills in anticipation of an opportunity. Not “oops, I accidentally a VP.”

It is better to get promoted than hired up a level

There are a few reasons for this. It’s usually easier to get promoted than to get hired straight into a job you’ve never held before (at a org with high standards), and it also tends to be more sustainable/more likely to succeed if you get promoted in as well. Being a director is NOT just being a super-duper manager, it’s a different role and function entirely.

A lot of your ability to be successful as a director (or any kind of manager) comes from knowing the landscape, the product and the people, and having good relationships internally. When you are internally promoted, you already know the company and the people, so you get a leg up towards being successful. Whereas if you’ve just joined the company and are trying to learn the tech, the people, the relationships, and how the company works all at once, on top of trying to perform a new role for the first time.. well, that is a lot to take on at once.

There are exceptions, sure! Oodles of them[1]. But I would frankly look sideways at a place that wanted to hire me as a director if I haven’t been one, or hadn’t at least managed managers before. It’s at least a yellow flag. It tells me they are probably either a) very desperate or b) very sloppy with handing out titles.

If you want a promotion…

The obvious first step involves asking your manager, “what is the skill gap for me between the job I am doing right now and a director role?” Unlike in the movies, promotions don’t usually get surprise-dropped on people’s heads; people are usually cultivated for them. Registering your interest makes it more likely they will consider you, or help you develop skills in that direction as time moves on.

If you have a good manager who believes in you, and the opportunities exist at your company, that might even be all you have to do.(!)

And if so, lucky you. But as for the remaining ~80-90% of us (ha!) … well, we’ll need a bit more hustle.

Take inventory of your opportunities

Lots of companies aren’t large enough to need directors, or growing fast enough to create new opportunities. This can actually be the most challenging part of the equation, because there are generally a lot more managers who want to be directors than there are available openings.

If you do need to find a new job to reach your career goals, I would target fast-growing companies with at least 100 engineers. If you’re evaluating prospective employers based on your chance of advancement, consider the following::

  • Ask about their policies on internal vs external hires. Do they give preference to existing employees? How do they decide when to recruit vs grow from within?
  • Ask about the last time that someone was promoted into a similar role.
  • Tell the recruiter and hiring manager about your career goals. Don’t be shy. “My next career goal is to gain some experience managing managers” is fine. (That shouldn’t be the only reason you’re interested, of course.)
  • Size up the playing field. Is there oxygen at that level? Or are there a dozen other managers senior to you lined up for the same shot?

There are no sure bets. But you can do a lot to put yourself in the right place at the right time, signal your interest, and be prepared for the opportunity when it strikes.

a director is not a ‘super-senior manager’

A director is not just a manager on steroids: it is an entirely different job. It helps to have been a good manager before becoming a director, because many management skills will translate, but others will be entirely new to you. Expect this.

How being a good director is different from being a good manager

Let’s look at some of the ways that being a good engineering manager is different than being a good director.

  1. You can be a great EM, beloved by your team, without giving much thought to managing out or up. Directors cannot. If anything, it’s the opposite. You may get away with not coddling your EMs, but you must pull your weight for your peers and upper management.
  2. You can have a bit of a reputation for being stubborn or difficult as an EM, and that can be just fine. But having such a rep will probably sabotage your attempt at being promoted to director.
  3. You can be a powerful technical EM who sometimes jumps in to train engineers, be on call, or course correct technical and architectural decisions. This can even burnish your value and reputation as an EM. But this would all be a solid knock against you as a director.

Managers can get away with being opinionated and attached to technology, to some extent, while directors absolutely must balance lots of different stakeholders to achieve healthy business outcomes.

This difference of perspective is why managers will sometimes sniff about directors having sold out, or being “all about politics.”

(Blaming something on “politics” is usually a way of accidentally confessing that you don’t actually understand the constraints someone is operating under, IMO.)

A director’s job is running the business

Here’s the key fact: ✨directors run the business✨.

Managers should be focused on high-performing engineering teams. VPs should be focused on strategy and the longer term. Directors are the execution machines that knit technology with business objectives. (I like this piece, although the lede is a little buried. Key graf:)

managers, directors, VPs

Directors run the business. They are accountable for results. You can’t be bopping in and writing or reviewing code, or tossing off technical opinions. That’s not your job anymore.

Managing managers is a whole new skill set

The distance between managing engineers and managing managers is nearly as vast as the gulf between being an engineer and being a manager.

But it’s sneakier, because you don’t feel out of your depth as much as you did when you became a manager. 😁

As a manager, each of us instinctively draws on our own unique blend of strength and charisma — whatever it is that makes people look up to you and willing to accept your influence. Most of us can’t explain how we do it, because we run on instinct.

But as a director, you have to figure it out. Because you need to be able to debug it when the magic breaks down. You need to help your managers influence and lead using *their* unique strengths. What works for you won’t work for them. You have to learn how to unpack different leadership styles and support them in the way they need.

If you’re working towards a director role:

There are lots of areas where you can improve your director skills and increase your chances of being viewed as director material without any help whatsoever from your manager.

You ✨can not✨ be a blocker

Directors run the business … so you CANNOT be seen as a blocker. People must come to you of their own accord to get shit done and break through the blockers.

If they are going to other people for advice on how to break through YOU, you are not a good candidate for director. Figure out how to fix this before you do anything else.

Demonstrate impact beyond your team(s)

Another way to make yourself an attractive prospect for director is to work on systemic problems, driving impact at the org or company level. You could:

  • work to substantially increase the diversity of your teams or your candidate pipeline, and offer to work with recruiting and other managers to help them do the same (becoming BFFs with recruiting is often a canny move)
  • drive some cross-platform initiative to consolidate dozens of snowflake deploy processes and significantly reduce CI/CD build/deploy times, set an internal SLO for artifact build times, or successfully champion auto-deployment
  • champion an internal tools team with a mandate to increase developer productivity, and quantify the hell out of it
  • lead a revamp of the new hire onboarding process. Or add training and structure to the interview process and set an SLO of responding to every candidate within one week

I dunno — it all depends on what’s broken at your company. 🙃 Identify something causing widespread pain and frustration at the organizational level and fix it. 

Managing ‘up’ is not a ‘nice-to-have’…

If there’s a problem, make sure you are the one to bring it to your manager (and swiftly), along with “Here is the context, here’s where I went wrong, and this is what I’m planning to do about it.” No surprises.

At this point in your career, you should have mastered the art of not being a giant pain in the ass to your manager. Nobody wants a high-maintenance director. Do you reliably make problems go away, or do they boomerang back five times worse after you “fix” them?

…Neither is managing ‘out’

Managing “out” is important too. (Not “managing out”, which means terminating people from the company, but managing “out” as in horizontally, meaning your relationship with your peers.)

What do your peers think of you? Do you invest in those relationships? Do they see you as an ally and a source of wise counsel, or a source of chaos, gossip and instability, or a competitor with turf to protect? If you’re the manager that other managers seek out for a peer check, you might be a good candidate for director.

psst.. People are watching you

One of the most uncomfortable things to internalize if you climb the ladder is how much people will make snap judgments about you based on the tiniest fragments of information about you, and how those judgments may forever color the way they think of you or interact with you.

First impressions might be made by ten minutes together on the same zoom call…a few overheard fragments of people talking about you…even the expressions on your face as they pass you in the hallway. People will extrapolate a lot from a very little, and changing their impression of you later is hard work.

(Yes it’s frustrating, but you can’t really get upset about it, because you and I do it too. It’s part of being human. )

Because of that, you really do have to guard against being too cranky, too tired, or out of spoons. People WILL take it personally. It WILL come back to hurt you.

Remember, you don’t hear most feedback. If you visibly disagree with someone, assume 10x as many silently agree with them. If one person gives you a piece of hard feedback, assume 10x as many will never tell you. Be grateful. The more power you are perceived to have, the less feedback you will ever hear.

Pro tip

You can infer a surprising amount about how good a director candidate may be at their job, simply by listening closely to how they talk about their colleagues. Do they complain about being misunderstood or mistreated, do they minimize the difficulty or quality of others’ work, do they humblebrag, or do they take full responsibility for outcomes? And does their empathy fully extend to their peers in other departments, like sales and marketing?

Does it sound like they enjoy their work, and look forward to beginning it every day? Does it sound like they are all in the same little tugboat, all pulling in the same direction, or is there a baseline disconnect and lack of trust?

In conclusion…

Be approachable, be a drama dampener, project warmth. Control your calendar and carve out regular focus time. Guard your energy — never run your engine under 30%, and always leave something in the tank.

There are a lot more great responses and advice in the replies to my thread, btw. Go check them out if you’re interested.. and if you have something to say, contribute!.☺️

charity

Footnotes:

[1] Occasionally, it may work out to your benefit to jump into a new, higher title at a new company. This can happen when someone is already well qualified for the higher role, but is finding it difficult to get promoted (possibly due to insufficient opportunity or systemic biases). Just be aware that the job you were hired into is likely to be one where the titles are meaningless and/or the roles are chaotic. You may want to stay just long enough to get the title, then bounce to a healthier org.

Advice for Engineering Managers Who Want to Climb the Ladder