Questionable Advice: “How can I drive change and influence teams…without power?”

Last month I got to attend GOTO Chicago and give a talk about continuous deployment and high-performing teams. Honestly I did a terrible job, and I’m not being modest. I had just rolled off a delayed redeye flight; I realized partway through that I had the wrong slides loaded, and my laptop screen was flashing throughout the talk, which was horribly distracting and means I couldn’t read the speaker notes or see which slide was next. 😵 Argh!

Anyway, shit happens. BUT! I got to meet some longstanding online friends and acquaintances (hi JJ, Avdi, Matt!) and got to eat some of Hillel Wayne’s homemade chocolates, and the Q&A session afterwards was actually super fun.

My talk was about what high performing teams look like and why it’s so important to be on one (spoiler: because this is the #1 way to become a radically better engineer!!). Most of the Q&A topics therefore came down to some version of “okay, so how can I help my team get there?” These are GREAT questions, so I thought I’d capture a few of them for posterity.

But first… just a reminder that the actual best way to persuade people to listen to you is to make good decisions and display good judgment. Each of us has an implicit reputation score, which formal power can only overcome to an extent. Even the most junior engineer can work up a respectable reputation over time, and even principal engineers can fritter theirs away by shooting off at the mouth. 🥰

“how can I drive change when I have no power or influence?”

This first question came from someone who had just landed their first real software engineering job (congrats!!!):

“This is my first real job as a software engineer. One other junior person and myself just formed a new team with one super-senior guy who has been there forever. He built the system from scratch and knows everything about it. We keep trying to suggest ideas like the things you talked about in your talk, but he always shoots us down. How can we convince him to give it a shot?”

Well, you probably can’t. ☺️ Which isn’t the end of the world.

If you’re just starting to write software every day, you are facing a healthy learning curve for the next 3-5 years. Your one and only job is to learn and practice as much you possibly can. Pour your heart and soul into basic skills acquisition, because there really are no shortcuts. (Please don’t get hooked on chatGPT!!)

I know that I came down hard in my talk on the idea that great engineers are made by great teams, and that the best thing most people can do for their career is to join a high-performing, fast-moving team. There will come a time where this is true for you too, but by then you will have skills and experience, and it will be much easier for you to find a new job, one with a better culture of learning.

It is hard to land your first job as a software engineer. Few can afford to be picky. But as long as you are a) writing code every day, b) debugging code every day, and c) getting good feedback via code reviews, this job will get you where you need to go. When you’re fluent and starting to mentor others, or getting into higher level architecture work, or when you’re starting to get bored … then it’s time to start looking for roles with better teachers and a more collaborative team, so your growth doesn’t stall. (Please don’t fall into the Trap of the Premature Senior.)

This is an apprenticeship industry. You’re like a med student right now, who is just starting to do rounds under the supervision of an attending physician (your super-senior engineer). You can kinda understand why he isn’t inclined to listen to your opinions on his choice of stethoscope or how he fills out a patient chart. A better teacher would take time to listen and explain, but you already know he isn’t one. 🤷

I only have one piece of advice. If there’s something you want to try, and it involves doing engineering work, consider tinkering around and building it after hours. It’s real hard to say no to someone who cares enough to invest their own time into something.

“how can I drive change when I am a tech lead on a new team?”

“I have the same question! — except I’m a tech lead, so in theory I DO have some power and influence. But I just joined a new team, and I’m wondering what the best way is to introduce changes or roll them out, given that there are soooo many changes I’d like to make.”

(I wrote a somewhat scattered post a few years ago on engineers and influence, or influence without authority, which covers some related territory.)

As a tech lead who is new to a team, busting at the seams with changes I want to make, here’s where I’d start:

  1. Understand why things are the way they are and get to know the personalities on your team a bit before you start pitching changes. (UNLESS they are coming to you with arms outstretched, pleading desperately for changes ~fast~ because everything is on fire and they know they need help. This does happen!)
  2. Spend some time working with the old systems, even if you think you already understand. It’s not enough for you to know; you need to take the team on this journey with you. If you expect your changes to be at all controversial, you need to show that you respect their work and are giving it a chance.
  3. Change one thing at a time, and go for the developer experience wins first. Address things that will visibly pay off for your team in terms of shipping faster, saving time, less frustration. You have no credibility in the beginning, so you want to start racking up wins before you take on the really hard stuff.
  4. Roll up your sleeves. Nothing buys a leader more goodwill than being willing to do the scut work. Got a flaky test suite that everybody has been dreading trying to fix? I smell opportunity…
  5. Pitch it as an experiment. If people aren’t sold on your idea for e.g. code review SLAs, ask if they’d be willing to try it out for three weeks just as an experiment.
  6. Strategically shop it around to the rest of the team, if you sense there will be resistance…

At this point in my answer 👆 I outlined a technique for persuading a team and building support for a plan or an idea, especially when you already know it’s gonna be an uphill battle. Hillel Wayne said I should write it up in a blog post, so here it is! (I’ll do anything for free chocolate 😍)

“How can I get people on board with my controversial plan?”

So you have a great idea, and you’re eager to get started. Awesome!!! You believe it’s going to make people’s lives better, even though you know you are going to have to fight tooth and nail to make it happen.

What NOT to do:

Walk into the team meeting and drop your bomb idea on everyone cold:

“Hey, I think we should stop shipping product changes until we fix our build pipeline to the point where we can auto-deploy each merge set to production, one at a time, in under an hour.” ~ (for example)

…. then spend the rest of the hour grappling with everybody’s thoughts, feelings, and intense emotional reactions, before getting discouraged and slinking away, vowing to never have another idea, ever again.

What to do instead:

Suss out your audience. Who will be there? How are they likely to react? Are any of them likely to feel especially invested in the existing solution, maybe because they built it? Are any of them known for their strong opinions or being combative?

Great!!! Your first move is to have a conversation with each of them. Approach them in the spirit of curiosity, and ask what they think of your idea. Talking with them will also help you hash out the details and figure out if it is actually a good idea or not.

Your goal is to make the rounds, ask for advice, identify any allies, and talk your idea through with anybody who is likely to oppose you…before the meeting where you intend to unveil your plan. So that when that happens, you have:

  1. given people the chance to process their reactions and ask questions in private
  2. ensured that key people will not feel surprised, threatened, or out of the loop
  3. already heard and discussed any objections
  4. ideally, you have earned their support!

Even if you didn’t manage to convince every person, this was still a valuable exercise. By approaching people in advance, you are signaling that you respect them and their voice matters. You are always going to get people’s absolute worst reactions when you spring something on them in a group setting; any anxiety or dismay will be amplified tenfold. By letting them reflect and ask questions in private, you’re giving time for their better selves to emerge.

What to do instead…if you’re a manager:

As an engineer or a tech lead, you sometimes end up out front and visible as the owner of a change you are trying to drive. This is normal. But as a manager, there are far more times when you need to influence the group but not be the leader of the change, or when you need to be wary of sounding like you are telling people what to do. These are just a few of the many reasons it can be highly effective to have other people arguing on your behalf.

In the ideal scenario, particularly on technical topics, you don’t have to push for anything. All you do is pose the question, then sit back and listen as vigorous debate ensues, with key stakeholders and influential engineers arguing for your intended outcome. That’s a good sign that not only are they convinced, they feel ownership over the decision and its execution. This is the goal! 🌈

It’s not just about persuading people to agree with you, either. Instead of having a shitty dynamic where engineers are attached to the old way of doing things and you are “dragging them” into the newer ways against their will, you are inviting them to partner with you. You are offering them the opportunity to lead the team into the brave new world, by getting on board early.

(It probably goes without saying, but always start with the smallest relevant group of stakeholders, and not, say, all of engineering, or a group that has no ownership over the given area. 🙃 And … even this strategy will stop working rather quickly, if your controversial ideas all turn out to be disastrous. 😉)

“How do I know where to even start?!? 😱”

Before I wrap up, I want to circle back to the question from the tech lead about how to drive change on a team when you do have some influence or power. He went on to say (or maybe this was from a third questioner?*):

“There is SO MUCH I’d like to do or change with our culture and our tech stack. Where can I even start??”

Yeah, it can be pretty overwhelming. And there are no universal answers… as you know perfectly well, the answer is always “it depends.” ☺️ But in most cases you can reduce the solution space substantially to one of the two following starting points.

1. Can you understand what’s going on in your systems? If not, start with observability.

It doesn’t have to be elegant or beautiful; grepping through shitty text logs is fine, if it does the trick. But do any of the following make you shudder in recognition?:

  • If I get paged, I might lose the rest of the afternoon trying to figure out what happened
  • Our biggest problem is performance and we don’t know where the time is going
  • We have a lot of flaky, flappy alerts, and unexplained outages that simply resolve themselves without our ever truly understanding what happened.

If you can’t understand what’s going on in your system, you have to start with instrumentation and observability. It’s just too deadly, and too risky, not to. You’re going to waste a ton of time stabbing around in the dark trying to do anything else without visibility. Put your glasses on before you start driving down the freeway, please.

2. Can you build, test and deploy software in under an hour? If not, start with your deploy pipeline.

Specifically, the interval of time between when the code is written and when it’s being used in production. Make it shorter, less flaky, more reliable, more automated. This is the feedback loop at the heart of software engineering, which means that it’s upstream from a whole pile of pathologies and bullshit that creep in as a consequence of long, painful, batched-up deploys.

Here’s a talk I’ve given a few times on why this matters so much:

You pretty much can’t fail with one of those two; your lives will materially improve as you make progress. And the iterative process of doing them will uncover a great deal of shit you should probably know about.

Cheers! 🥂

charity.

* My apologies if I remembered anyone’s question inaccurately!

Questionable Advice: “How can I drive change and influence teams…without power?”

Architects, Anti-Patterns, and Organizational Fuckery

I recently wrote a twitter thread on the proper role of architects, or as I put it, tongue-in-cheek-ily, whether or not architect is a “bullshit role”.

It got a LOT of reactions (2.5 weeks later, the thread is still going!!), which I would sort into roughly three camps:

  1. “OMG this resonates; this matches my experiences working with architects SO MUCH”,
  2. “I’m an architect, and you’re not wrong”, and
  3. “I’m an architect and I hate you.”

Some of your responses (in all three categories!) were truly excellent and thought-provoking. THANK YOU — I learned a ton. I figured I should write up a longer, more readable, somewhat less bombastic version of my original thread, featuring some of my favorite responses.

Where I’m Coming From

Just to be clear, I don’t hate architects! Many of the most brilliant engineers I have ever met are architects.

Nor do I categorically believe that architects should not exist, especially after reading all of your replies. I received some interesting and compelling arguments for the architect role at larger enterprises, and I have no reason to believe they are not true.

Also, please note that I personally have never worked at a company with “architect” as a role. I have also never worked anywhere but Silicon Valley, or at any company larger than Facebook. My experiences are far from universal. I know this.

Let me get suuuuuper specific here about what I’m reacting to:

  • When I meet a new “architect”, they tend toward the extremes: either world class and amazing or useless and out of touch, with precious little middle ground.
  • When I am interviewing someone whose last job title was “architect”, they often come from long tenured positions, and their engineering skills are usually very, very rusty. They often have a lot of detailed expertise about how their last company worked, but not a lot of relevant, up-to-date experience.
  • Because of 👆, when I see “architect” on a job ladder, I tend to feel dubious about that org in a way I do not when I see “staff engineer” or “principal engineer” on the ladder.

What I have observed is that the architect role tends to be the locus of a whole mess of antipatterns and organizational fuckery. The role itself can also be one that does not set up the people who hold it for a successful career in the long run, if they are not careful. It can be a one-way street to being obsolete.

I think that a lot of companies are using some of their best, most brilliant senior engineers as glorified project manager/politicians to paper over a huge amount of organizational dysfunction, while bribing them with money and prestige, and that honestly makes me pretty angry. 😡

But title is not destiny. And if you are feeling mad because none of what I’ve written applies to you, then I’m not writing about you! Live long and prosper. 🖖

Architect Anti-patterns and fuckery

There is no one right way to structure your org and configure your titles, any more than there is any one right way to architect your systems and deploy your services. And there is an eternal tension between centralization and specialization, in roles as well as in systems.

Most of the pathologies associated with architects seem to flow from one of two originating causes:

  1. unbundling decision-making authority from responsibility for results, and
  2. design becoming too untethered from execution (the “Frank Gehry” syndrome)

But it’s only when being an architect brings more money and prestige than engineering that these problems really tend to solidify and become entrenched.

Skin In The Game

When that happens, you often run into the same fucking problem with architects and devs as we have traditionally seen with devs and ops. Only instead of “No, I can’t be on call or get woken up, my time is far too valuable, too busy writing important software”, the refrain is, “No, I can’t write software or review code, my time is far too valuable, I’m much too busy telling other people how to do their jobs.”

This is also why I think calling the role “architect” instead of “staff engineer” or “principal engineer” may itself be kind of an anti-pattern. A completely different title implies that it’s a completely different job, when what you really want, at least most of the time, is an engineer performing a slightly different (but substantially overlapping) set of functions as a senior engineer.

My core principle here is simple: only the people responsible for building software systems get to make decisions about how those systems get built. I can opine all I want on your architecture or ours, but if I’m not carrying a pager for you, you should probably just smile politely and move along.

Technical decisions should be ultimately be made by the people who have to live with the consequences. But good architects will listen to those people, and help co-create architectural decisions that take into account local, domain, and enterprise perspectives (a Katy Allred quote).

Architecture is a core engineering skill

When you make architecture “someone else’s problem” and scrap the expectation that it is a core skill, you get weaker engineers and worse systems.

Learning to see the forest as well as the trees, and factor in security, maintainability, data integrity and scale, performance, etc is a *critical* part of growing up as an engineer into senior roles.

The story of QA is relevant here. Once upon a time, every technical company had a QA department to test their code and ensure quality. Software engineers weren’t expected to write tests for their code — that was QA’s job. Eventually we realized that we wrote better software when engineers were held responsible for writing their own tests and testing their own code.

Developers howled and complained: they didn’t have time! they would never get anything built! But it gradually became clear that while it may take more time up front to write and test code, it saved immensely more time and pain in the longer run because the code got so much better and problems got found so much earlier.

It’s not like we got rid of QA  — QA departments still exist, especially in some industries, but they are more like consulting experts. They write test suites and test software, but more importantly they are a resource to make sure that everybody is writing good tests and shipping quality software.

This was long enough ago that most people writing code today probably don’t remember this. (It was mostly before my own time as well.) But you hear echoes of the same arguments today when engineers are complaining about having to be on call for their code, or write instrumentation and operate their code in production.

The point is not that every engineer has to do everything. It’s that there are elements of testing, operations, and architecture that every software engineer needs to know in order to write quality code — in order to not make mistakes that will cost you dearly down the line.

Specialists are not here to do the job for you, they’re to help you do the job better.

“Architect” Done Right

If you must have architects at all, I suggest:

  1. Grow your architects from within. The best high-level thinkers are the ones with a thorough grounding in the context and the particulars.
  2. Be clear about who gets to have opinions vs who gets to make decisions. Having architects who consult, educate, and support is terrific. Having “pigeon architects” who “swoop and poop” — er, make technical decisions for engineers to implement — is a recipe for resentment and weak architectures.
  3. Pay them the same as your staff or principal engineers, not dramatically more. Create an org structure that encourages pendulum swings between (eng, mgr, arch) roles, not one with major barriers in form of pay or level disparities.
  4. Consider adopting one of the following patterns, which do a decent job of evading the two main traps we described above.

If your architects don’t have the technical skills, street cred, or time to spend growing baby engineers into great engineers, or mentoring senior engineers in architecture, they are probably also crappy architects. (another Katy Allred quote)

The “Embedded Architect” (aka Staff+ Engineer)

The most reliable way I know to align architecture and engineering goals is for them to be done by the same team. When one team is responsible for designing, developing, maintaining, and operating a service, you tend to have short, tight, feedback loops that let you ship products and iterate swiftly.

Here is one useful measure of your system’s complexity and the overhead involved in making changes:

“How long does it take you to ship a one-character fix?”

There are many other measures, of course, but this is one of the most important. It gets to the heart of why so many engineers get fed up with working at big companies, where the overhead for change is SO high, and the threshold for having an impact is SO long and laborious.

https://twitter.com/jetpack/status/1633005928399384576

The more teams have to be involved in designing, reviewing, and making changes, the slower you will grind. People seem to accept this as an inevitability of working in large and complex systems far more than I think they should.

Embedding architecture and operations expertise in every engineering team is a good way to show that these are skills and responsibilities we expect every engineer to develop.

This is the model that Facebook had. It is often paired with,

The “Architecture Group” of Practicing Engineers

Every company eventually needs a certain amount of standardization and coordination work. Sometimes this means building out a “Golden Path” of supported software for the organization. Sometimes this looks like a platform engineering team. Sometimes it looks like capacity planning years worth of hardware requirements across hundreds of teams.

I’ve seen this function fulfilled by super-senior engineers who come together informally to discuss upcoming projects at a very high level. I’ve seen it fulfilled by teams that are spun up by leadership to address a specific problem, then spun down again. I’ve seen it fulfilled by guilds and other formal meetings.

These conversations need to happen, absolutely no question about it. The question is whether it’s some people’s full time job, or one of many part-time roles played by your most senior engineers.

I’m more accustomed to the latter. Pro: it keeps the conversations grounded in reality. Con: engineers don’t have a lot of time to spend interfacing with other groups and doing “project management” or “stakeholder management”, which may be a sizable amount of work at some companies.

The “architect-engineer” pendulum

The architect-to-engineer pendulum seems like the only strategy short of embedded architects / shared ownership that seems likely to yield consistently good results, in my opinion.

The reasoning behind this is similar to the reasons for saying that engineering managers should probably spend some time doing hands-on work every few years. You need to be a pretty good engineer before you can be a good engineering manager or a good architect, and 5+ years after doing any hands-on work, you probably aren’t one anymore.

If you’re the type of architect that is part of an engineering team, partly responsible for a product, shipping code for that product, or on call for that product, this may not apply to you. But if you’re the type of architect that spends little if any time debugging/understanding or building the systems you architect, you should probably make a point of swinging back and forth every few years.

The “Time-Share Architect”

This one has aspects of both the “Architecture Working Group” and the “Architect-Engineer Pendulum”. It treats architecture is a job to be done, not a role to be occupied. Thinking of it like a “really extended pager rotation” is an interesting idea.

Somewhat relatedly — at Honeycomb, “lead engineer” is a title attached to a particular project, and refers to a set of actions and responsibilities for that project. It isn’t a title that’s attached to a particular person. Every engineer gets the opportunity to lead projects (if they want to), and everybody gets a break from doing the project management stuff from time to time. The beautiful thing about this is that everybody develops key leadership skills, instead of embodying them in a single person.

The important thing is that someone is performing the coordination activities, but the people building the system have final say on architecture decisions.

The “Advisor Architect”

I honestly have no problem with architects who are not seen as senior to, and do not have opinions overriding those of, the senior engineers who are building and maintaining the system.

Engineers who are making architectural decisions should consult lots of sources and get lots of opinions. If architects provide educated opinions and a high level view of the systems, and the engineers make use of their expertise, well  that’s fan fucking tastic.

If architects are handing them assignments, or overriding their technical decisions and walking off, leaving a mess behind … fuck that shit. That’s the opposite of empowerment and ownership.

The “skin in the game” rule of thumb still holds, though. The less an architect is exposed to the maintenance and operational consequences of decisions, the less sway their opinion should hold with the group. It doesn’t mean it doesn’t bring value. But the limitations of opinions at a distance should be made clear.

The Threat to Architects’ Careers

It’s super flattering to be told you are just too important, your time is too valuable for you to fritter it away on the mundane acts of debugging and reviewing PRs. (I know! It feels great!!!) But I don’t think it serves you well. Not you, or your team, your company, customers, or the tech itself.

And not *every* architect role falls into this trap. But there’s a definite correlation between orgs that stop calling you “engineers” and orgs that encourage (or outright expect) you to stop engineering at that level. In my experience.

But your credibility, your expertise, your moral authority to impose costs on the team are all grounded in your fluency and expertise with this codebase and this production system — and your willingness to shoulder those costs alongside them. (All the baby engineers want to grow up to be a principal engineer like this.)

But if you aren’t grounded in the tech, if you don’t share the burden, your direction is going to be received with some (or a LOT of) cynicism and resentment. Your technical work will also be lower quality.

Furthermore, you’re only hurting yourself in the long run. Some of the most useless people I’ve ever met were engineers who were “promoted” to architect many, many years ago, and have barely touched an editor or production shell since. They can’t get a job anywhere else, certainly not with comparable status or pay, and they know it. 🤒

They may know EVERYTHING about the company where they work, but those aren’t transferable skills. They have become a super highly paid project manager.

And as a result … they often become the single biggest obstacle to progress. They are just plain terrified of being automated out of a job. It is frustrating to work with, and heartbreaking to watch. 💔

Don’t become that sad architect. Be an engineer. Own your own code in production. This is the way.

Coda: On “Solutions Architects”

You might note that I didn’t include solutions architects in this thread. There is absolutely a real and vibrant use for architects who advise. The distinction in my mind is: who has the last word, the engineers or the architect? Good engineering teams will seek advice from all kinds of expert sources, be they managers or architects or vendors.

My complaint is only with “architects” who are perceived to be superior to, and are capable of overruling the judgments of, the engineering team.

Exceptions abound; the title is not the person. My observations do not obviate your existence as a skilled technologist.  You obviously know your own role better than I do. 🙃

charity

Architects, Anti-Patterns, and Organizational Fuckery

Why On-Call Pain Is A Sociotechnical Problem

Cross-posted from leaddev.com

Most people hate being on call, because most on-call rotations are terrible.

Pager bombs, flappy alerts, false positives going off night and day, sleepless nights… Who can blame them? Small wonder that so many people develop a Pavlovian response to the sound of their Pagerduty ringtone. Alert goes off; adrenaline soars.

Conventional wisdom tells us that being on call means you put your whole life on hold, then spend all week lurching between firefighting and false alarms as you get progressively more sleep-deprived. It sucks, but that’s just what you get when you own your code in production. Right?

Noooooo. Wrong wrongy wrong wrong. Being on call should not be a constant cycle of things breaking down and firefighting, or alerts going off at all hours. This is not ‘normal.’ These are telltale signs of a fragile system and lack of alert discipline.

If on-call pain is a constant source of pain at your organization, that is a PROBLEM. It’s a five-alarm fire. You should drop what you’re doing and fix it with urgency.

An eternally miserable on-call rotation is a violation of the pact we make to support these systems:

  1. It is engineering’s job to own their code in production.
  2. It is management’s job to make sure it doesn’t suck.

This is a two-way handshake. If management isn’t holding up their end, if they don’t allocate enough time to fix the underlying problems – if they run a feature factory that never stops to refactor or invest in reliability work – then on-call will never get better, and you should leave.

On-call rotations are sociotechnical systems

On-call rotations are a classic example of a sociotechnical problem. A sociotechnical system consists of three elements: in this case that’s your production system, the people who operate it, and the tools they use to enact change on it.

You cannot solve sociotechnical problems with purely people solutions or with purely technical solutions. You need to use both.

The technical problems are usually easier to diagnose. You need to automate failovers, instrument your code, build and test repairing code, audit your indexes, etc. The social problems can be trickier to spot, but here’s a tip: they usually manifest as organizational problems.

Some engineers spend their entire career actively avoiding roles where they would have to be on call. Other engineers cling to the safety buffer of ops teams on call for their code, so that only manual escalations reach them.

Responsibility for your code is increasingly non-optional

This is becoming a harder line to hold, as the consensus has shifted decisively towards engineers owning their own code in production. Our systems are becoming exponentially more complex, and feedback loops are tightening. The people best equipped to own software in production are the people who built it. And in order to own it effectively, they need to close the loop by receiving the signal when something breaks.

But the point is not to invite software engineers into the same circle of hell that ops engineers have traditionally inhabited. This isn’t an act of vengeance. The point is that tightening these feedback loops is how we make systems better. Being on call shouldn’t have to destroy your social life or your sleep schedule.

Yes, engineering owns their software. But ensuring that engineering’s time is respected and their rest time valued is on management. It’s management’s job to make sure time is allocated to fixing recurring or known issues – and that they don’t kick the proverbial can down the road to later turn into tech debt. If reliability or productivity is suffering, managers need to reassign engineering cycles away from feature work. Managers’ performance should be evaluated by the four DORA metrics, as well as a fifth; how often is their team alerted outside of working hours?

It’s reasonable to be woken up two to three times a year when on call. But more than that is not okay. It’s management’s responsibility to ensure enough resources are dedicated to maintaining system stability, and they should be held accountable – not the on-call engineers.

Humans doing human things

We all have lives outside of work – families, doctor appointments, dentist visits, and so on. Instead of being surprised when things come up, we can predict the ways people’s lives will conflict with on-call duty and come up with ways to ease the burden.

  • Kids. I would never ask a new parent to be on call. Being woken up by ONE instrument of chaos is all anyone should ever have to cope with at any given time.
  • Sleepy brain. People are never going to be at their best when they are woken up in the middle of the night. We should make sure alert text, documentation, and steps are all clear, simple, and otherwise tuned for 2 a.m. brain fog.
  • Getting sleep. Sometimes people struggle getting back to sleep, or they were up all night dealing with something. Establish that 1) no one is EVER to be on call two nights in a row after a bad night, and 2) they are entitled to sleep in, come in late, leave early – whatever works best to help them catch up on their sleep.
  • Anxiety. I’ve managed people before who had high anxiety about being on call. They were perfectly willing, but it didn’t matter how quiet the pager was – their anxiety knowing it was on made it impossible to sleep. We tried it for a while, and it wasn’t getting better, so we ultimately found other ways for them to pull their weight.

If someone is absolutely unable to participate in on-call rotations, well, it happens. If it’s a temporary situation, you might want to let it go. But if it’s a permanent thing, like in the ‘anxiety’ example above, the team should address this by finding other ways for that person to do their share of maintenance work.

For example, they could be in charge of failed builds or maintain the dev environment. What matters is that 1) the team as a whole feels like it’s a fair distribution of labor, and 2) there are enough people left in the on-call rotation that no one is overly burdened.

Technical stumbling blocks

  • Un-owned code. Everything you support, and every alert that can fire, should have a team that owns it.
  • Conversely, you may have architectural issues that make it impossible to isolate and alert only the owners. If you have ten different on-call rotations for various areas of the code base, but any time the database gets slow all ten of you get paged, this is a bad situation.
  • SLOs. As you scale up, there will come a point where you can no longer alert on individual services or symptoms. They will simply drown you. At this point, you need to migrate your alerts over to Service Level Objectives. SLOs align your engineering pain directly with user pain.
  • Paging too early. Ah, this always sounds like such a great idea. ‘Wouldn’t it be great if we could catch it and alert someone before the users are impacted?’ But it’s not. It’s a recipe for flappy alerts and aggravation. Alert when users are impacted, not before.
  • Two lanes. You need two types of alerts: ‘WAKE ME UP’ and ‘Deal with this later.’ No more, no less. Keep the list of ‘wake me up’ alerts as short, crisp, and carefully curated. Put everything that needs to be dealt with ‘soon’ in the second lane, and have your on-call engineer sweep through it at the start of the day and the end of the day. If it doesn’t need to be acted upon in the next day, it probably shouldn’t be an alert.

On-call problems are often organizational problems

Sometimes people don’t want to be on call, and it’s not due to life events. This is a bit trickier to address because they are actually the result of organizational problems that present themselves as on-call problems. For example:

  • Tribal knowledge, or the ‘bus factor.’ You’re the debugger of last resort because you’ve been responsible for a mission-critical component of the system from the very beginning. The team tried training new people, but you still get called every time something goes wrong, and it’s not clear if the issue would be fixed if you weren’t available (or how long it would take them if they did).
  • Individual ownership vs. team ownership. Software is owned by teams, not by individuals. In an ideal world, this means everyone on the team is capable of debugging and maintaining all the systems they collectively own. In the real world, this means everything is at least understood by more than one engineer.
  • Too little – or too much – coverage. If you have three to four people on call, that’s too much of your life spent lugging around a laptop. Tossing all 20-30 engineers into a single rotation is also the wrong way to go; engineers won’t be on call often enough to stay familiar with the systems. The ideal on-call rotation has seven to eight people; five people is a bare minimum. With eight people, you are on call for a highly sustainable one week out of every two months.
  • Lower the barriers to asking for help, swapping times, covering for each other, etc. When someone asks for help with their on-call shift, thank them for asking. If the on-call shift isn’t that arduous, it’s really no big deal to back someone up for the duration of a movie.
  • Appointing primary/secondary on-call engineers can be really helpful here. Only the primary needs to get alerted and lug their laptop around, but they have a designated point person to tag if they need to run to the grocery store, drive through the boonies, or otherwise go offline for a while.
  • Put managers on call. I’m not generally a fan of putting managers in the rotation, but they really are the ideal backup situation. Especially when it comes to picking up the pager the day after someone has had a rough night. This serves multiple purposes: it helps keep the manager fresh, it exposes them to the reality of what on-call is currently like, and their time doesn’t have to be swapped for someone else’s.

The next time someone doesn’t want to be on call, it may be time to take a closer look at your organization as a whole to see whether the problem really is resource allocation, risk mitigation, or something else.

Making on-call costs tangible

On the topic of paying people more to be on call: there are loads of opinions here – it’s a very fraught topic. I generally come down on the side of ‘no, it’s part of the job,’ just like it is for doctors. With one big exception.

If you’re having a hard time getting upper management to understand the value of spending engineering cycles on the infrastructure and reliability work that needs to be done, instead of just cranking features… by all means, pay people for being on call.

Pay them for every event they have to respond to.

Pay them well.

Pay them so goddamn well the finance team starts squawking about the need to pay down that reliability debt.

If that’s the only way you can make it real for them, well, use the tools you’ve got. Engineers should never have to quietly suffer the pain of flaky software and unhappy users alone. Give management pain too until they take their jobs seriously enough to see that reliability issues get fixed.

Why On-Call Pain Is A Sociotechnical Problem

Why I hate the phrase “breaking down silos”

We hear this phrase constantly: “I worked at breaking down silos.” “We need to break down silos.” “What did I do in my last role? I broke down silos.”

It sets my fucking teeth on edge.

What is a ‘silo’, anyway? What specifically wasn’t working well, and how did you solve it; or how was it solved, and what was your contribution to the solution? did you just follow orders, or did you personally diagnose the problem, or did some of your suggestions pan out?

Solutions to complex problems rarely work on the first go, so … what else did you try? How did you know it wasn’t working, how did you know when to abandon earlier ideas? It’s fiendishly hard to know whether you’ve given a solution enough time to bake, for people to adjust, so that you can even evaluate whether it works better or worse off than before.

Communication is not magic pixie dust

Breaking down silos is supposed to be about increasing communication, removing barriers and roadblocks to collaboration.

But you can’t just blindly throw “more communication” at your teams. Too much communication can be just as much of a problem and a burden as too little. It can distract, and confuse, and create little eddies of information that is incorrect or harmful.

The quantity of the communication isn’t the issue, so much as the quality. Who is talking to whom, and when, and why? How does information flow throughout your company? Who gets left out? Whose input is sought, and when, and why? How can any given individual figure out who to talk to about any given responsibility?

Every time you say "break down silos", I want to "break down your face." | News EcardWhen someone says they are “breaking down silos”, whether in an interview, a panel, or casual conversation, it tells me jack shit about what they actually did.

cliches are a substitute for critical thinking

It’s just like when people say “it’s a culture problem”, or “fix your culture”, or “everything is about people”. These phrases tell me nothing except that the speaker has gone to a lot of conferences and wants to to sound cool.

If someone says “breaking down silos”, it immediately generates a zillion questions in my mind. I’m curious, because these problems are genuinely hard and people who solve them are incredibly rare.

Unfortunately, the people who use these phrases are almost never the ones who are out there in the muck and grind, struggling to solve real problems.

When asked, people who have done the hard labor of building better organizations with healthy communication flows, less inefficiency, and alignment around a single mission — people who have gotten all the people rowing in the same direction — tend to talk about the work.

People who haven’t, say they were “breaking down silos.”

Why I hate the phrase “breaking down silos”

Software deploys and cognitive biases

There exist some wonderful teams out there who have valid, well thought through, legitimate reasons for enforcing “NO FRIDAY DEPLOYS” week in and week out, for not hooking CI/CD up to autodeploy, and for not shipping one person’s changes at a time.

And then there are the reasons most people have.

Bad decisions, and the biases they came from

 

We’re humans. 💜  We leap to conclusions with the wetware we have doing the best it can based on heuristics that feel objectively true, but are ultimately just emotional reactions based on past lived experience. And then we retroactively enshrine those goofy gut feelings with the language of noble motive and moral values.

“I tell people not to deploy to production … because I care so deeply about my team and their ability to have a quiet weekend.”

Barf. 🙄  That’s just like saying you tell your kid not to brush his teeth at night, because you care SO DEEPLY about him and his ability to go to bed calm and happy.

Once the retcon engine in your brain gets running, it comes up with all sorts of reasons. Plausible-sounding reasons! But every single argument of the items in the list above is materially false.

Deploy myths are never going away for good; they appeal to too many of our cognitive biases. But what if there was one simple thing you could do that would invert many of these cognitive biases and cause people to grapple with the question in a new way? What if you could kickstart a recalculation?

My next post will pick up right here. I’ll tell you all about the One Simple Trick you can do to fix your deploys and set you on the virtuous path of high-performing teams.

Til then, here’s what I’ve previously written on the topic.

 

Footnotes

 

Availability bias: The tendency to overestimate the likelihood of events with greater “availability” in memory, which can be influenced by how recent the memories are or how unusual or emotionally charged they may be.

Continued influence effect: The tendency to believe previously learned misinformation even after it has been corrected. Misinformation can still influence inferences one generates after a correction has occurred.

Conservatism bias: The tendency to revise one’s belief insufficiently when presented with new evidence.

Default effect: When given a choice between several options, the tendency to favor the default one.

Dread aversion: Just as losses yield double the emotional impact of gains, dread yields double the emotional impact of savouring

False-uniqueness bias: The tendency of people to see their projects and themselves as more singular than they actually are.

Functional fixedness: Limits a person to using an object only in the way it is traditionally used

Hyperbolic discounting: Discounting is the tendency for people to have a stronger preference for more immediate payoffs relative to later payoffs. Hyperbolic discounting leads to choices that are inconsistent over time – people make choices today that their future selves would prefer not to have made, despite using the same reasoning

IKEA effect: The tendency for people to place a disproportionately high value on objects that they partially assembled themselves, such as furniture from IKEA, regardless of the quality of the end product

Illusory truth effect: A tendency to believe that a statement is true if it is easier to process, or if it has been stated multiple times, regardless of its actual veracity.

Irrational escalation: The phenomenon where people justify increased investment in a decision, based on the cumulative prior investment, despite new evidence suggesting that the decision was probably wrong. Also known as the sunk cost fallacy

Law of the instrument: An over-reliance on a familiar tool or methods, ignoring or under-valuing alternative approaches. “If all you have is a hammer, everything looks like a nail”

Mere exposure effect: The tendency to express undue liking for things merely because of familiarity with them

Negativity bias: Psychological phenomenon by which humans have a greater recall of unpleasant memories compared with positive memories

Non-adaptive choice switching: After experiencing a bad outcome with a decision problem, the tendency to avoid the choice previously made when faced with the same decision problem again, even though the choice was optimal

Omission bias: The tendency to judge harmful actions (commissions) as worse, or less moral, than equally harmful inactions (omissions).

Ostrich effect: Ignoring an obvious (negative) situation

Plan continuation bias: Failure to recognize that the original plan of action is no longer appropriate for a changing situation or for a situation that is different than anticipated

Prevention bias: When investing money to protect against risks, decision makers perceive that a dollar spent on prevention buys more security than a dollar spent on timely detection and response, even when investing in either option is equally effective

Pseudocertainty effect: The tendency to make risk-averse choices if the expected outcome is positive, but make risk-seeking choices to avoid negative outcomes

Salience bias: The tendency to focus on items that are more prominent or emotionally striking and ignore those that are unremarkable, even though this difference is often irrelevant by objective standards

Selective perception bias: The tendency for expectations to affect perception

Status-quo bias: If no special action is taken, the default action that will happen is that the code will go live. You will need an especially compelling reason to override this bias and manually stop the code from going live, as it would by default.

Slow-motion bias: We feel certain that we are more careful and less risky when we slow down. This is precisely the opposite of the real world risk factors for shipping software. Slow is dangerous for software; speed is safety. The more frequently you ship code, the smaller the diffs you ship, the less dangerous each one actually becomes. This is the most powerful and difficult to overcome of all of our biases, because there is no readily available counter-metaphor for us to use. (Riding a bike is the best I’ve come up with. 😔)

Surrogation: Losing sight of the strategic construct that a measure is intended to represent, and subsequently acting as though the measure is the construct of interest

Time-saving bias: Underestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively low speed and overestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively high speed.

Zero-risk bias: Preference for reducing a small risk to zero over a greater reduction in a larger risk.

Software deploys and cognitive biases

Why every software engineering interview should include ops questions

I’ve fallen way behind on my blog posts — my goal was to write one per month, and I haven’t published anything since MAY. Egads. So here I am dipping into the drafts archives! This one was written in April of 2016, when I was noodling over my CraftConf 2016 talk on “DevOps for Developers (see slides).”

So I got to the part in my talk where I’m talking about how to interview and hire software engineers who aren’t going to burn the fucking house down, and realized I could spend a solid hour on that question alone. That’s why I decided to turn it into a blog post instead.

Stop telling ops people to code better, start telling SWEs to ops better

Our industry has gotten very good at pressing operations engineers to get better at writing code, writing tests, and software engineering in general these past few years. Which is great! But we have not been nearly so good at pushing software engineers to level up their systems skills. Which is unfortunate, because it is just as important.

Most systems suffer from the syndrome of running too much software. Tossing more software into the heap is as likely to cause more problems as often as it solves them.

We see this play out at companies stacked with good software engineers who have built horrifying spaghetti messes of their infrastructure, and then commence paging themselves to death.

The only way to unwind this is to reset expectations, and make it clear that

  1. you are still responsible for your code after it’s been deployed to production, and 
  2. operational excellence is everyone’s job.

Operations is the constellation of tools, practices, policies, habits, and docs around shipping value to users, and every single one of us needs to participate in order to do this swiftly and safely.

Every software engineering interviewing loop should have an ops component.

Nobody interviews candidates for SRE or ops nowadays without asking some coding questions. You don’t have to be the greatest programmer in the world, but you can’t be functionally illiterate. The reverse is less common: asking software engineers basic, stupid questions about the lifecycle of their code, instrumentation best practices, etc. 

It’s common practice at lots of companies now to have a software engineer in the loop for hiring SREs to evaluate their coding abilities. It should be just as common to have an ops engineer in the loop for a SWE hire, especially for any SWE who is being considered for a key senior position. Those are the people you most rely on to be mentors and role models for junior hires. All engineers should embrace the ethos of owning their code in production, and nobody should be promoted or hired into a senior role if they don’t.

And yes, that means all engineers!  Even your iOS/Android engineers and website developers should be interested in what happens to their code after they hit deploy.  They should care about things like instrumentation, and what kind of data they may need later to debug their problems, and how their features may impact other infrastructure components.

You need to balance out your software engineers with engineers who don’t react to every problem by writing more code. You need engineers who write code begrudgingly, as a last resort. You’ll find these priceless gems in ops and SRE.

ops questions for software engineers

The best questions are broad and start off easy, with plenty of reasonable answers and pathways to explore. Even beginners can give a reasonable answer, while experts can go on talking for hours.

For example: give them the specs for a new feature, and ask them to talk through the infrastructure choices and dependencies to support that feature. Do they ask about things like which languages, databases, and frameworks are already supported by the team? Do they understand what kind of monitoring and observability tools to use, do they ask about local instrumentation best practices?

Or design a full deployment pipeline together. Probe what they know about generating artifacts, versioning, rollbacks, branching vs master, canarying, rolling restarts, green/blue deploys, etc. How might they design a deploy tool? Talk through the tradeoffs.

Some other good starting points:

  • “Tell me about the last time you caused a production outage. What happened, how did you find out, how was it resolved, and what did you learn?”
  • “What are some of your favorite tools for visibility, instrumentation, and debugging?
  • “Latency seems to have doubled over the last 6 hours. Where do you start looking, how do you start debugging?”
  • And this chestnut: “What happens when you type ‘google.com’ into a web browser?” You would be fucking *astonished* how many senior software engineers don’t know a thing about DNS, HTTP, SSL/TLS, cookies, TCP/IP, routing, load balancers, web servers, proxies, and on and on.

Another question I really like is: “what’s your favorite API (or database, or language) and why?” followed up by “… and what are the worst things about it?” (True love doesn’t mean blind worship.)

Remember, you’re exploring someone’s experience and depth here, not giving them a pass-fail quiz. It’s okay if they don’t know it all. You’re also evaluating them on communication skills, which is severely underrated by most people but is actually as a key technical skill.

Signals to look for

You’re not looking for perfection. You are teasing out signals for things like, how will this person perform on a team where software engineers are expected to own their code? How much do they know about the world outside the code they write themselves? Are they curious, eager, and willing to learn, or fearful, incurious and begrudging?

Do they expect networks to be reliable? Do they expect databases to respond, retries to succeed? Are they offended by the idea of being on call? Are they overly clever or do they look to simplify? (God, I hate clever software engineers 🙃.)

It’s valuable to get a feel for an engineer’s operational chops, but let’s be clear, you’re doing this for one big reason: to set expectations. By making ops questions part of the interview, you’re establishing from the start that you run an org where operations is valued, where ownership is non-optional. This is not an ivory tower where software engineers can merrily git push and go home for the day and let other people handle the fallout

It can be toxic when you have an engineer who thinks all ops work is toil and operations engineering is lesser-than. It tends to result in operations work being done very poorly. This is your best chance to let those people self-select out.

You know what, I’m actually feeling uncharacteristically optimistic right now. I’m remembering how controversial some of this stuff was when I first wrote it, five years ago in 2016. Nowadays it just sounds obvious. Like table stakes.

Hell yeah. 🤘

Why every software engineering interview should include ops questions

Notes on the Perfidy of Dashboards

The other day I said this on twitter —

… which stirred up some Feelings for many people. 🙃  So I would like to explain my opinions in more detail.

Static vs dynamic dashboards

First, let’s define the term. When I say “dashboard”, I mean STATIC dashboards, i.e. collections of metrics-based graphs that you cannot click on to dive deeper or break down or pivot. If your dashboard supports this sort of responsive querying and exploration, where you can click on any graph to drill down and slice and dice the data arbitrarily, then breathe easy — that’s not what I’m talking about. Those are great. (I don’t really consider them dashboards, but I have heard a few people refer to them as “dynamic dashboards”.)

Actually, I’m not even “against” static dashboards. Every company has them, including Honeycomb. They’re great for getting a high level sense of system functioning, and tracking important stats over long intervals. They are a good starting point for investigations. Every company should have a small, tractable number of these which are easily accessible and shared by everyone.

Debugging with dashboards: it’s a trap

What dashboards are NOT good at is debugging, or understanding or describing novel system states.

I can hear some of you now: “But I’ve debugged countless super-hard unknown problems using only static dashboards!” Yes, I’m sure you have. If all you have is a hammer, you CAN use it to drive screws into the wall, but that doesn’t mean it’s the best tool. And It takes an extraordinary amount of knowledge and experience to be able to piece together a narrative that translates low-level system statistics into bugs in your software and back. Most software engineers don’t have that kind of systems experience or intuition…and they shouldn’t have to.

Why are dashboards bad for debugging? Think of it this way: every dashboard is an answer to a question someone asked at some point. Your monitoring system is probably littered with dashboards, thousands and thousands of them, most of whose questions have been long forgotten and many of whose source data streams have long since gone silent.

So you come along trying to investigate something, and what do you do? You start skimming through dashboards, eyes scanning furiously, looking for visual patterns — e.g. any spikes that happened around the same time as your incident. That’s not debugging, that’s pattern-matching. That’s … eyeball racing.

if we did math like we do dashboards

Imagine you’re in a math competition, and you get handed a problem to solve. But instead of pulling out your pencil and solving the equation, step by step, you start hollering out guesses.

“27!”
“19992.41!”
“1/4325!”

That’s what flipping through dashboards feels like to me. You’re riffling through a bunch of graphs that were relevant to some long-ago situation, without context or history, without showing their work. Sometimes you’ll spot the exact scenario, and — huzzah! — the number you shout is correct! But when it comes to unknown scenarios, the odds are not in your favor.

Debugging looks and feels very different from flipping through answers. You ask a question, examine the answer, and ask another question based on the result. (“Which endpoints were erroring? Are all of the requests erroring, or only some? What did they have in common?”, etc.)

You methodically put one foot in front of the other, following the trail of bread crumbs, until the data itself leads you to the answer.

The limitations of metrics and dashboards

Unfortunately, you cannot do that with metrics-based dashboards, because you stripped away the connective tissue of the event back when you wrote the metrics out to disk.

If you happened to notice while skimming through dashboards that your 404 errors spiked at 14:03, and your /payment and /import endpoints started erroring at 14.03, and your database started returning a bunch of mysql errors shortly after 14:00, you’ll probably assume that they’re all related and leap to find more evidence that confirms it.

But you cannot actually confirm that those events are the same ones, not with your metrics dashboards. You cannot drill down from errors to endpoints to error strings; for that, you’d need a wide structured data blob per request. Those might in fact be two or three separate outages or anomalies happening at the same time, or just the tip of the iceberg of a much larger event, and your hasty assumptions might extend the outage for much longer than was necessary.

With metrics, you tend to find what you’re looking for. You have no way to correlate attributes between requests or ask “what are all of the dimensions these requests have in common?”, or to flip back and forth and look at the request as a trace. Dashboards can be fairly effective at surfacing the causes of problems you’ve seen before (raise your hand if you’ve ever been in an incident review where one of the follow up tasks was, “create a dashboard that will help us find this next time”), but they’re all but useless for novel problems, your unknown-unknowns.

Other complaints about dashboards:

They tend to have percentiles like 95th, 99th, 99.9th, 99.99th, etc. Which can cover over a multitude of sins. You really want a tool that allows you to see MAX and MIN, and heatmap distributions.

A lot of dashboards end up getting created that are overly specific to the incident you just had — naming specific hosts, etc — which just creates clutter and toil. This is how your dashboards become that graveyard of past outages.

The most useful approach to dashboards is to maintain a small set of them; cull regularly, and think of them as a list of starter queries for your investigations.

Fred Hebert has this analogy, which I like:

“I like to compare the dashboards to the big display in a hospital room: heartbeat, pressure, oxygenation, etc. Those can tell you when a thing is wrong, but the context around the patient chart (and the patient themselves) is what allows interpretation to be effective. If all we have is the display but none of the rest, we’re not getting anywhere close to an accurate picture. The risk with the dashboard is having the metrics but not seeing or knowing about the rest changing.”

In conclusion

Dashboards aren’t universally awful. The overuse of them just encourages sloppy thinking, and static ones make it impossible for you to follow the plot of an outage, or validate your hypotheses. 🤒  There’s too many of them, and not enough shared consensus. (It would help if, like, new dashboards expired within a month if nobody looked at them again.)

If what you have is “nothing”, even shitty dashboards are far better than no dashboards. But shitty dashboards have been the only game in town for far too long. We need more vendors to think about building for queryability, explorability, and the ability to follow a trail of breadcrumbs. Modern systems are going to demand more and more of this approach.

Nothing < Dashboards < a Queryable, Exploratory Interface

If everyone out there who slaps “observability” on their web page also felt the responsibility to add an observability-enabling interface to their tool, one that would let users explore and identify unknown-unknowns, we would all be in a far better place. 🙂

 

 

 

 

 

Notes on the Perfidy of Dashboards

How much is your fear of continuous deployment costing you?

Most people aren’t doing true CI/CD. Most teams wait far too long to get their code into prod after writing it. Most painful of all are the teams who have done all the hard parts — wired up continuous integration, achieved test coverage, etc — but still deploy by hand, thus depriving themselves of the payoff for their hard work.

Any time an engineer merges a diff back to main, this ought to trigger a full run of your CI/CD pipeline, culminating with an automatic deploy to production. This should happen once per mergeset, never batching multiple engineers’ diffs in a run, and it should be over and done with in 15 minutes or less with no human intervention.

It’s 2021, and everyone should know this by now.

✨✨15 minutes or bust✨✨

Okay, but what if you don’t? How costly can it be, really?

Let’s do some back of the envelope math. First you’ll need to answer a couple questions about your org and deploy pipeline.

  • How many engineers do you have? ____________
  • How long typically elapses between when someone writes code and that code is live in production? _____________

Let (n) be the number of engineers it takes to efficiently build and run your product, assuming each set of changes will autodeploy individually in <15 min.

  • If changes typically ship on the order of hours, you need 2(n).
  • If changes ship on the order of days, you need 2(2(n)).
  • If changes ship on the order of weeks, you need 2(2(2(n)))
  • If changes ship on the order of months, you need 2(2(2(2(n))))

Your 6 person team with a consistent autodeploy loop would take 24 people to do the same amount of work, if it took days to deploy their changes. Your 10 person team that ships in weeks would need 80 people.

At cost to the company of approx 200k per engineer, that’s $3.6 million in the first example and $14 million in the second example. That’s how much your neglect of internal tools and kneejerk fear of autodeploy might be costing you.

It’s not just about engineers. The more delay you add into the process of building and shipping code, the more pathologies multiply, and you find yourselves needing to spend more and more time addressing those pathologies instead of making forward progress for the business. Longer diffs. Manual deploy processes. Bunching up multiple engineers’ diffs in a single deploy, then spending the rest of the day trying to figure out which one was at fault for the error.

Soon you need an SRE team to handle your reliability issues, build engineering specialists to build internal tools for all these engineers, managers to manage the teams, product folks to own the roadmap and project managers to coordinate all this blocking and waiting on each other…

You could have just fixed your build process. You could have just committed to continuous delivery. You would be moving more swiftly and confidently as a small, killer team than you ever could at your lumbering size.

✨✨15 minutes or bust✨✨

In 2021, how will *you* achieve the dream of CI/CD, and liberate your engineers from the shackles of pointless toil?

P.S. if you want to know my methodology for coming up with this equation, it’s called “pulled out of my ass because it sounded about right, then checked with about a dozen other technical folks to see if it aligned with their experience.”

 

 

How much is your fear of continuous deployment costing you?

On Call Shouldn’t Suck: A Guide For Managers

There are few engineering topics that provoke as much heated commentary as oncall. Everybody has a strong opinion. So let me say straight up that there are few if any absolutes when it comes to doing this well; context is everything. What’s appropriate for a startup may not suit a larger team. Rules are made to be broken.

That said, I do have some feelings on the matter. Especially when it comes to the compact between engineering and management. Which is simply this:

It is engineering’s responsibility to be on call and own their code. It is management’s responsibility to make sure that on call does not suck. This is a handshake, it goes both ways, and if you do not hold up your end they should quit and leave you.

As for engineers who write code for 24×7 highly available services, it is a core part of their job is to support those services in production. (There are plenty of software jobs that do not involve building highly available services, for those who are offended by this.) Tossing it off to ops after tests pass is nothing but a thinly veiled form of engineering classism, and you can’t build high-performing systems by breaking up your feedback loops this way.

Someone needs to be responsible for your services in the off-hours. This cannot be an afterthought; it should play a prominent role in your hiring, team structure, and compensation decisions from the very start. These are decisions that define who you are and what you value as a team.

Some advice on how to organize your on call efforts, in no particular order.

  • It is easier to keep yourself from falling into an operational pit of doom than it is to claw your way out of one. Make good operational hygiene a priority from the start. Value good, clean, high-level abstractions that allow you to delegate large swaths of your infrastructure and operational burden to third parties who can do it better than you — serverless, AWS, *aaS, etc. Don’t fall into the trap of disrespecting operations engineering labor, it’s the only thing that can save you.
  • Invest in good release and deploy tooling. Make this part of your engineering roadmap, not something you find in the couch cushions. Get code into production within minutes after merging, and watch how many of your nightmares melt away or never happen.
  • Invest in good instrumentation and observability. Impress upon your engineers that their job is not done when tests pass; it is not done until they have watched users using their code in production. Promote an ownership mentality over the full software life cycle. This is how dev.to did it.
  • Construct your feedback loops thoughtfully. Try to alert the person who made the broken change directly. Never send an alert to someone who isn’t fully equipped and empowered to fix it.
  • When an engineer is on call, they are not responsible for normal project work — period. That time is sacred and devoted to fixing things, building tooling, and creating guard-rails to protect people from themselves. If nothing is on fire, the engineer can take the opportunity to fix whatever has been annoying them. Allow for plenty of agency and following one’s curiosity, wherever it may lead, and it will be a special treat.
  • Closely track how often your team gets alerted. Take ANY out-of-hours-alert seriously, and prioritize the work to fix it. Night time pages are heart attacks, not diabetes.
  • Consider joining the on call rotation yourself! If nothing else, generously pinch hit and be an eager and enthusiastic backup on the regular.
  • Reliability work and technical debt are not secondary to product work. Budget them into your roadmap, right alongside your features and fixes. Don’t plan so tightly that you have no flex for the unexpected. Don’t be afraid to push back on product and don’t neglect to sell it to your own bosses. People’s lives are in your hands; this is what you get paid to do.
  • Consider making after-hours on call fully-elective. Why not? What is keeping you from it? Fix those things. This is how Intercom did it.
  • Depending on your stage and available resources, consider compensating for it.This doesn’t have to be cash, it could be a Friday off the week after every on call rotation. The more established and funded a company you are, the more likely you should do this in order to surface the right incentives up the org chart.
  • Once you’ve dug yourself out of firefighting mode, invest in SLOs (Service Level Objectives). SLOs and observability are the mature way to get out of reactive mode and plan your engineering work based on tradeoffs and user impact.

I believe it is thoroughly possible to construct an on call rotation that is 100% opt-in, a badge of pride and accomplishment, something that brings meaning and mastery to people’s engineering roles and ties them emotionally to their users. I believe that being on call is something that you can genuinely look forward to.

But every single company is a unique complex sociotechnical snowflake. Flipping the script on whether on call is a burden or a blessing will require a unique solution, crafted to meet your specific needs and drawing on your specific history. It will require tinkering. It will take maintenance.

Above all: ✨RAISE YOUR STANDARDS✨ for what you expect from yourselves. Your greatest enemy is how easily you accept the status quo, and then make up excuses for why it is necessarily this way. You can do better. I know you can.

There is lots and lots of prior art out there when it comes to making on call work for you, and you should research it deeply. Watch some talks, read some pieces, talk to some people. But then you’ll have to strike out on your own and try something. Cargo-culting someone else’s solution is always the wrong answer.

Any asshole can write some code; owning and tending complex systems for the long run is the hard part. How you choose to shoulder this burden will be a deep reflection of your values and who you are as a team.

And if your on call experience is mandatory and severely life-impacting, and if you don’t take this dead seriously and fix it ASAP? I hope your team will leave you, and go find a place that truly values their time and sleep.

 

On Call Shouldn’t Suck: A Guide For Managers

Questionable Advice: War Rooms? Really?!?

My company has recently begun pushing for us to build and staff out what I can only describe as “command centers”. They’re picturing graphs, dashboards…people sitting around watching their monitors all day just to find out which apps or teams are having issues. With your experience in monitoring and observability, and your opinions on teams supporting their own applications…do you think this sounds like a bad idea? What are things to watch out for, or some ways this might all go sideways?

— Anonymous

Jesus motherfucking Christ on a stick. Is it 1995 where you work? That’s the only way I can try and read this plan like it makes sense.

It’s a giant waste of money and no, it won’t work. This path leads into a death spiral where alarms are going off constantly (yet somehow never actually catching the real problems), people getting burned out, and anyone competent will either a) leave or b) refuse to be on call. Sideways enough for you yet?

Snark aside, there are two foundational flaws with this plan.

1) watching graphs is pointless. You can automate that shit, remember?  ✨Computers!✨ Furthermore, this whole monitoring-based approach will only ever help you find the known unknowns, the problems you already know to look for. But most of your actual problems will be unknown unknowns, the ones you don’t know about yet.

2) those people watching the graphs… When something goes wrong, what exactly can they do about it? The answer, unfortunately, is “not much”. The only people who can swiftly diagnose and fix complex systems issues are the people who build and maintain those systems, and those people are busy building and maintaining, not watching graphs.

That extra human layer is worse than useless; it is actively harmful. By insulating developers from the consequences of their actions, you are concealing from them the information they need to understand the consequences of their actions. You are interfering with the most basic of feedback loops and causing it to malfunction.

The best time to find a bug is as soon as possible after writing it, while it’s all fresh in your head. If you let it fester for days, weeks, or months, it will be exponentially more challenging to find and solve. And the best people to find those bugs are the people who wrote them

Helpful? Hope so. Good luck. And if they implement this anyway — leave. You deserve to work for a company that won’t waste your fucking time.

with love, charity.

selfie - 4

Questionable Advice: War Rooms? Really?!?