A Manager’s Bill of Responsibilities (and Rights)

October 30, 2019July 14, 2023 mipsytipsyculture, engineering, management, tech culture7 Comments

Over a year and a half ago, I wrote up a post about the rights and responsibilities due any engineer at Honeycomb. At the time we were in the middle of a growth spurt, had just hired several new engineers, and I was in the process of turning over day-to-day engineering management over to Emily. Writing things down helped me codify what I actually cared about, and helped keep us true to our principles as we grew.

Tacked on to the end of the post was a list of manager responsibilities, almost as an afterthought. Many people protested, “don’t managers get any rights??” (and naturally I snapped “NO! hahahahahha”)

I always intended to circle back and write a followup post with the rights and responsibilities for managers. But it wasn’t til recently, as we are gearing up for another hiring spurt and have expanded our managerial ranks, that it really felt like its time had come.

The time has come, the time is now, as marvin k. mooney once said. Added the bill of rights, and updated and expanded the list of responsibilities. Thanks Emily Nakashima for co-writing it with me.

Manager’s Bill of Rights

You shall receive honest, courageous, timely feedback about yourself and your team, from your reports, your peers, and your leaders. (No one is exempt from feeding the hungry hungry feedback hippo! NOO ONNEEEE!) 🦛🦛🦛🦛🦛🦛🦛
Management will be treated with the same respect and importance as individual work.
You have the final say over hiring, firing, and leveling decisions for your team. It is expected that you solicit feedback from your team and peers and drive consensus where possible. But in the end, the say is yours.
Management can be draining, difficult work, even at places that do it well. You will get tactical, strategic, and emotional support from other managers.
You cannot take care of others unless you first practice self-care. You damn well better take vacations. (Real ones.)
You have the right to personal development, career progression, and professional support. We will retain a leadership coach for you.
You do not have to be a manager if you do not want to. No one will ever pressure you.

Manager’s Responsibilities

Recruit and hire and train your team. Foster a sense of solidarity and “teaminess” as well as real emotional safety.
Cultivate an inclusive culture and redistribute opportunity. Fuck a pedigree. Resist monoculture.
Care for the people on your team. Support them in their career trajectory, personal goals, work/life balance, and inter- and intra-team dynamics.
Keep an eye out for people on other teams who aren’t getting the support they need, and work with your leadership and manager peers to fix the situation.
Give feedback early and often. Receive feedback gracefully. Always say the hard things, but say them with love.
Move us relentlessly forward, staying alert for rabbit-holing and work that doesn’t contribute to our goals. Ensure redundancy/coverage of critical areas.
Own the planning process for your team, be accountable for the goals you set. Allocate resources by communicating priorities and requesting support. Add focus or urgency where needed.
Own your time and attention. Be accessible. Actively manage your calendar. Try not to make your emotions everyone else’s problems (but do lean on your own manager and your peers for support).
Make your own personal growth and self-care a priority. Model the values and traits we want employees to pattern themselves after.
Stay vulnerable.

(Easier said than done, huh?)

<3 charity

Deploys: It’s Not Actually About Fridays

October 28, 2019July 14, 2023 mipsytipsyculture, deploys, friday deploys, management, operations, paging & alerting, sre, tech culture2 Comments

I just read this piece, which is basically a very long subtweet about my Friday deploy threads. Go on and read it: I’ll wait.

Here’s the thing. After getting over some of the personal gibes (smug optimism? literally no one has ever accused me of being an optimist, kind sir), you may be expecting me to issue a vigorous rebuttal. But I shan’t. Because we are actually in violent agreement, almost entirely.

I have repeatedly stressed the following points:

I want to make engineers’ lives better, by giving them more uninterrupted weekends and nights of sleep. This is the goal that underpins everything I do.
Anyone who ships code should develop and exercise good engineering judgment about when to deploy, every day of the week
Every team has to make their own determination about which policies and norms are right given their circumstances and risk tolerance
A policy of “no Friday deploys” may be reasonable for now but should be seen as a smell, a sign that your deploys are risky. It is also likely to make things WORSE for you, not better, by causing you to adopt other risky practices (e.g. elongating the interval between merge and deploy, batching changes up in a single deploy)

This has been the most frustrating thing about this conversation: that a) I am not in fact the absolutist y’all are arguing against, and b) MY number one priority is engineers and their work/life balance. Which makes this particularly aggravating:

Lastly there is some strange argument that choosing not to deploy on Friday “Shouldn’t be a source of glee and pride”. That one I haven’t figured out yet, because I have always had a lot of glee and pride in being extremely (overly?) protective of the work/life balance of the engineers who either work for me, or with me. I don’t expect that to change.

Hold up. Did you catch that clever little logic switcheroo? You defined “not deploying on Friday” as being a priori synonymous with “protecting the work/life balance of engineers”. This is how I know you haven’t actually grasped my point, and are arguing against a straw man. My entire point is that the behaviors and practices associated with blocking Friday deploys are in fact hurting your engineers.

I, too, take a lot of glee and pride in being extremely, massively, yes even OVERLY protective of the work/life balance of the engineers who either work for me, or with me.

AND THAT IS WHY WE DEPLOY ON FRIDAYS.

Because it is BETTER for them. Because it is part of a deploy ecosystem which results in them being woken up less and having fewer weekends interrupted overall than if I had blocked deploys on Fridays.

It’s not about Fridays. It’s about having a healthy ecosystem and feedback loop where you trust your deploys, where deploys aren’t a big deal, and they never cause engineers to have to work outside working hours. And part of how you get there is by not artificially blocking off a big bunch of the week and not deploying during that time, because that breaks up your virtuous feedback loop and causes your deploys to be much more likely to fail in terrible ways.

The other thing that annoys me is when people say, primly, “you can’t guarantee any deploy is safe, but you can guarantee people have plans for the weekend.”

Know what else you can guarantee? That people would like to sleep through the fucking night, even on weeknights.

When I hear people say this all I hear is that they don’t care enough to invest the time to actually fix their shit so it won’t wake people up or interrupt their off time, seven days a week. Enough with the virtue signaling already.

You cannot have it both ways, where you block off a bunch of undeployable time AND you have robust, resilient, swift deploys. Somehow I keep not getting this core point across to a substantial number of very intelligent people. So let me try a different way.

Let’s try telling a story.

A tale of two startups

Here are two case studies.

Company X

Company X is a three-year-old startup. It is a large, fast-growing multi-tenant platform on a large distributed system with spiky traffic, lots of user-submitted data, and a very green database. Company X deploys the API about once per day, and does a global deploy of all services every Tuesday. Deploys often involve some firefighting and a rollback or two, and Tuesdays often involve deploying and reverting all day (sigh).

Pager volume at Company X isn’t the worst, but usually involves getting woken up a couple times a week, and there are deploy-related alerts after maybe a third of deploys, which then need to be triaged to figure out whose diff was the cause.

Company Z

Company Z is a three-year-old startup. It is a large, fast-growing multi-tenant platform on a large distributed system with spiky traffic, lots of user-submitted data, and a very green house-built distributed storage engine. Company Z automatically triggers a deploy within 30 minutes of a merge to master, for all services impacted by that merge. Developers at company Z practice observability-driven deployment, where they instrument all changes, ask “how will I know if this change doesn’t work?” during code review, and have a muscle memory habit of checking to see if their changes are working as intended or not after they merge to master.

Deploys rarely result in the pager going off at Company Z; most problems are caught visually by the engineer and reverted or fixed before any paging alert can fire. Pager volume consists of roughly one alert per week outside of working hours, and no one is woken up more than a couple times per year.

Same damn problem, better damn solutions.

If it wasn’t extremely obvious, these companies are my last two jobs, Parse (company X, from 2012-2016) and Honeycomb (company Z, from 2016-present).

They have a LOT in common. Both are services for developers, both are platforms, both are running highly elastic microservices written in golang, both get lots of spiky traffic and store lots of user-defined data in a young, homebrewed columnar storage engine. They were even built by some of the same people (I built infra for both, and they share four more of the same developers).

At Parse, deploys were run by ops engineers because of how common it was for there to be some firefighting involved. We discouraged people from deploying on Fridays, we locked deploys around holidays and big launches. At Honeycomb, none of these things are true. In fact, we literally can’t remember a time when it was hard to debug a deploy-related change.

What’s the difference between Company X and Company Z?

So: what’s the difference? Why are the two companies so dramatically different in the riskiness of their deploys, and the amount of human toil it takes to keep them up?

I’ve thought about this a lot. It comes down to three main things.

Observability
Observability-driven development
Single merge per deploy

1. Observability.

I think that I’ve been reluctant to hammer this home as much as I ought to, because I’m exquisitely sensitive about sounding like an obnoxious vendor trying to sell you things. 😛 (Which has absolutely been detrimental to my argument.)

When I say observability, I mean in the precise technical definition as I laid out in this piece: with high cardinality, arbitrarily wide structured events, etc. Metrics and other generic telemetry will not give you the ability to do the necessary things, e.g. break down by build id in combination with all your other dimensions to see the world through the lens of your instrumentation. Here, for example, are all the deploys for a particular service last Friday:

Each shaded area is the duration of an individual deploy: you can see the counters for each build id, as the new versions replace the old ones,

2. Observability-driven development.

This is cultural as well as technical. By this I mean instrumenting a couple steps ahead of yourself as you are developing and shipping code. I mean making a cultural practice of asking each other “how will you know if this is broken?” during code review. I mean always going and looking at your service through the lens of your instrumentation after every diff you ship. Like muscle memory.

3. Single merge per deploy.

The number one thing you can do to make your deploys intelligible, other than observability and instrumentation, is this: deploy one changeset at a time, as swiftly as possible after it is merged to master. NEVER glom multiple changesets into a single deploy — that’s how you get into a state where you aren’t sure which change is at fault, or who to escalate to, or if it’s an intersection of multiple changes, or if you should just start bisecting blindly to try and isolate the source of the problem. THIS is what turns deploys into long, painful marathons.

headlamps, illuminating whatever’s in front of my face: this is the image in my mind when i think about instrumenting my code

And NEVER wait hours or days to deploy after the change is merged. As a developer, you know full well how this goes. After you merge to master one of two things will happen. Either:

you promptly pull up a window to watch your changes roll out, checking on your instrumentation to see if it’s doing what you intended it to or if anything looks weird, OR
you close the project and open a new one.

When you switch to a new project, your brain starts rapidly evicting all the rich context about what you had intended to do and and overwriting it with all the new details about the new project.

Whereas if you shipped that changeset right after merging, then you can WATCH it roll out. And 80-90% of all problems can be, should be caught right here, before your users ever notice — before alerts can fire off and page you. If you have the ability to break down by build id, zoom in on any errors that happen to arise, see exactly which dimensions all the errors have in common and how they differ from the healthy requests, see exactly what the context is for any erroring requests.

Healthy feedback loops == healthy systems.

That tight, short feedback loop of build/ship/observe is the beating heart of a healthy, observable distributed system that can be run and maintained by human beings, without it sucking your life force or ruining your sleep schedule or will to live.

Most engineers have never worked on a system like this. Most engineers have no idea what a yawning chasm exists between a healthy, tractable system and where they are now. Most engineers have no idea what a difference observability can make. Most engineers are far more familiar with spending 40-50% of their week fumbling around in the dark, trying to figure out where in the system is the problem they are trying to fix, and what kind of context do they need to reproduce.

Most engineers are dealing with systems where they blindly shipped bugs with no observability, and reports about those bugs started to trickle in over the next hours, days, weeks, months, or years. Most engineers are dealing with systems that are obfuscated and obscure, systems which are tangled heaps of bugs and poorly understood behavior for years compounding upon years on end.

That’s why it doesn’t seem like such a big deal to you break up that tight, short feedback loop. That’s why it doesn’t fill you with horror to think of merging on Friday morning and deploying on Monday. That’s why it doesn’t appall you to clump together all the changes that happen to get merged between Friday and Monday and push them out in a single deploy.

It just doesn’t seem that much worse than what you normally deal with. You think this raging trash fire is, unfortunately … normal.

How realistic is this, though, really?

Maybe you’re rolling your eyes at me now. “Sure, Charity, that’s nice for you, on your brand new shiny system. Ours has years of technical debt, It’s unrealistic to hold us to the same standard.”

Yeah, I know. It is much harder to dig yourself out of a hole than it is to not create a hole in the first place. No doubt about that.

Harder, yes. But not impossible.

I have done it.

Parse in 2013 was a trash fire. It woke us up every night, we spent a lot of time stabbing around in the dark after every deploy. But after we got acquired by Facebook, after we started shipping some data sets into Scuba, after (in retrospect, I can say) we had event-level observability for our systems, we were able to start paying down that debt and fixing our deploy systems.

We started hooking up that virtuous feedback loop, step by step.

We reworked our CI/CD system so that it built a new artifact after every single merge.
We put developers at the steering wheel so they could push their own changes out.
We got better at instrumentation, and we made a habit of going to look at it during or after each deploy.
We hooked up the pager so it would alert the person who merged the last diff, if an alert was generated within an hour after that service was deployed.

We started finding bugs quicker, faster, and paying down the tech debt we had amassed from shipping code without observability/visibility for many years.

Developers got in the habit of shipping their own changes, and watching them as they rolled out, and finding/fixing their bugs immediately.

It took some time. But after a year of this, our formerly flaky, obscure, mysterious, massively multi-tenant service that was going down every day and wreaking havoc on our sleep schedules was tamed. Deploys were swift and drama-free. We stopped blocking deploys on Fridays, holidays, or any other days, because we realized our systems were more stable when we always shipped consistently and quickly.

Allow me to repeat. Our systems were more stable when we always shipped right after the changes were merged. Our systems were less stable when we carved out times to pause deployments. This was not common wisdom at the time, so it surprised me; yet I found it to be true over and over and over again.

This is literally why I started Honeycomb.

When I was leaving Facebook, I suddenly realized that this meant going back to the Dark Ages in terms of tooling. I had become so accustomed to having the Parse+scuba tooling and being able to iteratively explore and ask any question without having to predict it in advance. I couldn’t fathom giving it up.

The idea of going back to a world without observability, a world where one deployed and then stared anxiously at dashboards — it was unthinkable. It was like I was being asked to give up my five senses for production — like I was going to be blind, deaf, dumb, without taste or touch.

Look, I agree with nearly everything in the author’s piece. I could have written that piece myself five years ago.

But since then, I’ve learned that systems can be better. They MUST be better. Our systems are getting so rapidly more complex, they are outstripping our ability to understand and manage them using the past generation of tools. If we don’t change our ways, it will chew up another generation of engineering lives, sleep schedules, relationships.

Observability isn’t the whole story. But it’s certainly where it starts. If you can’t see where you’re going, you can’t go very far.

Get you some observability.

And then raise your standards for how systems should feel, and how much of your human life they should consume. Do better.

Because I couldn’t agree with that other post more: it really is all about people and their real lives.

Listen, if you can swing a four day work week, more power to you (most of us can’t). Any day you aren’t merging code to master, you have no need to deploy either. It’s not about Fridays; it’s about the swift, virtuous feedback loop.

And nobody should be shamed for what they need to do to survive, given the state of their systems today.

But things aren’t gonna get better unless you see clearly how you are contributing to your present pain. And congratulating ourselves for blocking Friday deploys is like congratulating ourselves for swatting ourselves in the face with the flyswatter. It’s a gross hack.

Maybe you had a good reason. Sure. But I’m telling you, if you truly do care about people and their work/life balance: we can do a lot better.

charity.

SLOs Are The API For Your Engineering Team

October 19, 2019August 31, 2023 mipsytipsySLOsLeave a comment

(Originally posted on InfoQ in October, 2019)

If your job involves direct leadership of engineering teams, managing how and what they deliver, you’ve surely come across situations where the pressure to deliver features won out and led to poor service reliability. You’ve probably had your team’s workflow disrupted by interference from senior managers about minor individual issues. Or, you’ve seen or heard execs questioning your team’s planned work to reduce technical debt or improve your delivery processes.

These kinds of clashes are extremely common between engineering teams and management, as well as among different engineering teams. They are all various manifestations of a single issue: the need for a better abstraction layer for people and teams who are trying to interact or collaborate with your team. That abstraction layer is called Service Level Objectives.

You might be furrowing your brow right now, “But I thought SLOs were for users! And isn’t that a technical thing?”

Rather than define SLIs (Service Level Indicators), SLOs (Service Level Objectives), or SLAs (Service Level Agreements) at length here — there’s plenty of documentation out there about that — here’s a quick summary:

An SLI is the indicator for goodness.
The SLO is your objective for how often you can afford for it to fail.
And an SLA is an agreement with your users about it.

What I want to focus on is why SLOs are necessary for the humans who leverage them, and in particular, how they can benefit the relationships between your team and other individuals and teams.

The common problem across the examples above is that every one of them describes a messy boundary between roles and teams. For example, a VP’s job should not be to nitpick the order you are going to resolve tasks in, or to understand every spike on every dashboard. But this desire often comes from a well-meaning place; they actually care, and this kind of interaction may be the only signal available to them about how things are going or how a user is experiencing your system.

So you need to give them something better to care about.

Establish your team’s perimeter

You need to establish the perimeter of your team, secure it, build entry points and rules for coming and going, and hold people accountable for using them correctly. And then ignore or bounce every attempt to breach the perimeter and communicate through unauthorized channels, or to get someone to make an exception for them. SLOs can help you make this happen. When you go through the process of identifying what matters to your business by establishing and agreeing upon SLOs and their associated SLIs, you have a framework for managing what is demanded of your team, and how those demands are made.

For these boundaries to hold, all stakeholders must agree on your SLIs and SLO, and you must make sure you are measuring and tracking these appropriately. This is no small task, but for the purposes of the focus of this article, assume you have done so and everyone has signed off on a number they believe in. For example, perhaps you have agreed to a SLO stating that for every rolling 90 days, for 99.9% of users of your website, your home page will load “quickly enough” based on the SLI your engineering team has identified for latency in this situation, which might be “ten seconds”.

Share the ownership of Production Excellence

Beyond their value in ensuring consistent, predictable service delivery, SLOs are a powerful weapon to wield against micromanagers, meddlers, and feature-hungry PMs. That is why it’s so important to get everyone on board and signed off on your SLO. When they sign off on it, they own it too. They agree that your first responsibility is to hold the service to a certain bar of quality. If your service has deteriorated in reliability and availability, they also agree it is your top priority to restore it to good health.

Ensuring adequate service performance requires a set of skills that people and teams need to continuously develop over time, namely: measuring the quality of our users’ experience, understanding production health with observability, sharing expertise, keeping a blameless environment for incident resolution and post-mortems, and addressing structural problems that pose a risk to service performance. They require a focus on production excellence, and a (time) budget for the team to acquire the necessary skills. The good news is that this investment is now justified by the SLOs that management agreed to. The discussion should move away from which pieces of work are being prioritized to which service objectives are we trying to achieve and keep over time.

Let’s look at three possible scenarios of how this could play out in real life.

Scenario: Boss Freaks Out

The team keeps a dashboard on the wall of errors and latency. This is great most of the time, but when the boss’s boss happens to walk by and notices a spike in errors, he freaks out and starts DMing the engineering lead, or asking the nearest engineer what is wrong.

The team now has to take valuable time out of their day to explain what’s wrong, or explain that nothing is wrong and it just looks bad because it’s unexpected user behavior. It’s time consuming and gets in the way of actually fixing things. Senior management might not understand that fifty thousand things a day are broken, and the team cannot stop and fix or care about every single one of them.

The SLOs help us train managers to care about the important things, and let the unimportant things aside. We can remind them of the page that shows the team’s SLIs and SLOs, so they can see where the team is in their error budget.

Scenario: CEO Leapfrogs Priorities

The CEO messages the engineering lead a few times a week because one user has messaged the CEO on twitter to complain about an issue affecting their particular app. The CEO wants to know what’s wrong and when she can tell the user it is fixed.

Occasionally this can be helpful, when it helps us catch problems that our monitoring didn’t catch, but far too often it just means that a user’s trivial bug leapfrogs the more important work on our roadmap. Or one of our engineers will spend time shipping a one-off fix for that user, and then we have to fix it twice.

So how can you negotiate with the CEO for less of this type of disruption to planned work?

Check to make sure that this isn’t an example of a real problem lurking or not being captured by your SLOs. Let’s say your mobile app times out and serves an error in 5 seconds. So some segment of mobile traffic is not able to load your page with its 10 second SLI, yet those users are not being identified. If it is being tracked, assure your CEO that it’s within your error budget and will be checked when appropriate. If it is not, bring it up in your SLO periodic review so you can add a new SLI or otherwise account for it in your SLO moving forward.

Scenario: Feature Frenzy

As an engineering manager you need to keep the on call volume reasonable and protect your team’s ability to sleep through the night. But you might have a hard time pushing back against execs and all the stakeholders who want features shipped and bugs fixed, to carve out enough contiguous development time to address underlying architectural problems, harden your deploy pipeline, and so on. This kind of work is never the most pressing thing at any given time, even though over the long term it may be THE most important thing.

How do you wrestle back enough time to deal with technical debt? And how can you keep stakeholders from micromanaging your roadmap?

As agreed-upon, your first job is to meet your SLO. All other feature work or bug fixing is secondary to this. A SLO is the quality of availability you have committed to provide for your users. That means it is also what you have committed your team to delivering. This has implications for what you choose to build, and when.

Based on this agreement, those asking for your time have already acknowledged that their requests are lower priority until the work your team is doing to stabilize the deployment process has been completed, for example. Perhaps you need your team to work on reducing deployment time so that a bug fix can be deployed to production via a deployment pipeline in less than 10 minutes, otherwise the corresponding SLO for restoring service will blow out.

The Knob Goes Both Ways

Conversely, some engineering teams will go on tinkering and refactoring forever in a quest for perfection, when you really need them shipping new features. How can you tell whether it’s time to stop polishing and time to get back to shipping new stuff? When you are exceeding your SLO you can stand to add more chaos to the system, so turn the knob back up.

SLOs are the right level of abstraction for agreements between teams within an org for the same reasons they are useful between companies. You don’t care about the implementation details under the hood for your network provider; you just want to know that it will be available 99.95% of the time, and clear communication when it is down and back up. Train your management and other teams to interact with you at the same level of abstraction and trust. Google has a good policy doc for how to deal with SLO violations.

In this way, SLOs are even helpful within teams.They can help perfectionists tell when it’s okay to relax and let loose a bit, and they can guide the pathological corner cutters toward knowing when it’s time to rein it in and measure twice, cut once.

A source of comfort … and more capacity to focus

Once you get used to thinking this way, it’s actually a huge relief. Instead of having to deeply understand and evaluate the risk of every single situation in its own unique glory, we have a simple common language for evaluating risk in terms of error budgets. SLOs save everyone involved both time and energy, which you can redirect toward more important things, like keeping your customers happy.

The (Real) 11 Reasons I Don’t Hire You

October 18, 2019July 14, 2023 mipsytipsyculture, interviews, management, startups, tech culture28 Comments

(With 🙏 to Joe Beda, whose brilliant idea for a blog post this was. Thanks for letting me borrow it!)

Interviewing is hard and it sucks.

In theory, it really shouldn’t be. You’re a highly paid professional and your skills are in high demand. This ought to be a meeting between equals to mutually explore what a longer-term relationship might look like. Why take the outcome personally? There are at least as many reasons for you to decide not to join a company as for the company to decide not to hire you, right?

In reality, of course, all the situational cues and incentives line up to make you feel like the whole thing is a referendum on whether or not you personally are Good Enough (smart enough, senior enough, skilled enough, cool enough) to join their fancy club.

People stay at shitty jobs far, far longer than they ought to, just because interviews can be so genuinely crushing to your spirit and sense of self. Even when they aren’t the worst, it can leave a lasting sting when they decline to hire you.

But there is an important asymmetry here. By not hiring someone, I very rarely mean it as a rejection of that person. (Not unless they were, like, mean to the office manager, or directed all their technical questions to the male interviewers.) On the contrary, I generally hold the people we decline to hire — or have had to let go! — in extremely high opinion.

So if someone interviews at Honeycomb, I do not want them to walk away feeling stung, hurt, or bad about themselves. I would like them to walk away feeling good about themselves and our interactions, even if one or both of us are disappointed by the outcome. I want them to feel the same way about themselves as I feel about them, especially since there’s a high likelihood that I may want to work with them in the future.

So here are the real, honest-to-god most common reasons why I don’t hire someone.

1. Scarcity

If you’ve worked at a Google or Facebook before, you may have a certain mental model of how hiring works. You ask the candidate a bunch of questions, and if they do well enough, you hire them. This could not be more different from early stage startup hiring, which is defined in every way by scarcity.

I only have a few precious slots to fill this year, and every single one of them is tied to one or more key company initiatives or goals, without which we may fail as a company. Emily and I spend hours obsessively discussing what the profile we are looking for is, what the smallest possible set of key strengths and skills that this hire must have, inter-team and intra-team dynamics and what elements are missing or need to be bolstered from the team as it stands. And at the end of the day, there are not nearly as many slots to fill as there are awesome people we’d like to hire. Not even close. Having to choose between several differently wonderful people can be *excruciating*.

2. Diversity.

No, not that kind. (Yes, we care about cultivating a diverse team and support that goal through our recruiting and hiring processes, but it’s not a factor in our hiring decisions.) I mean your level, stage in your career, educational background, professional background, trajectory, areas of focus and strengths. We are trying to build radical new tools for sociotechnical systems; tools that are friendly, intuitive, and accessible to every engineer (and engineering-adjacent profession) in the world.

How well do you think we’re going to do at our goal if the people building it are all ex-Facebook, ex-MIT senior engineers? If everyone has the exact same reference points and professional training, we will all have the same blind spots. Even if our team looks like a fucking Benetton ad.

3. We are assembling a team, not hiring individuals.

We spend at least as much time hashing out what the subtle needs of the team are right now as talking about the individual candidate. Maybe what we need is a senior candidate who loves mentoring with her whole heart, or a language polyglot who can help unify the look and feel of our integrations across ten different languages and platforms. Or maybe we have plenty of accomplished mentors, but the team is really lacking someone with expertise in query profiling and db tuning, and we expect this to be a big source of pain in the coming year. Maybe we realize we have nobody on the team who is interested in management, and we are definitely going to need someone to grow into or be hired on as a manager a year or two from now.

There is no value judgment or hierarchy attached to any of these skills or particulars. We simply need what we need, and you are who you are.

4. I am not confident that we can make you successful in this role at this time.

We rarely turn people down for purely technical reasons, because technical skills can be learned. But there can be some combination of your skills, past experience, geographical location, time zone, experience with working remotely, etc — that just gives us pause. If we cast forward a year, do we think you are going to be joyfully humming along and enjoying yourself, working more-or-less independently and collaboratively? If we can’t convince ourselves this is true, for whatever reasons, we are unlikely to hire you. (But we would love to talk with you again someday.)

5. The team needs someone operating at a different level.

Don’t assume this always means “you aren’t senior enough”. We have had to turn down people at least as often for being too senior as not senior enough. An organization can only absorb so many principal and senior engineers; there just isn’t enough high-level strategic work to go around. I believe happy, healthy teams are comprised of a range of levels — you need more junior folks asking naive questions that give senior folks the opportunity to explain themselves and catch their dumb mistakes. You need there to be at least one sweet child who is just so completely stoked to build their very first login page.

A team staffed with nothing but extremely senior developers will be a dysfunctional, bored and contentious team where no one is really growing up or being challenged as they should.

6. We don’t have the kind of work you need or want.

The first time we tried hiring junior developers, we ran into this problem hardcore. We simply didn’t have enough entry-level work for them to do. Everything was frustratingly complex and hard for them, so they weren’t able to operate independently, and we couldn’t spare an engineer to pair with them full time.

This also manifests in other ways. Like, lots of SREs and data engineers would LOVE to work at honeycomb. But we don’t have enough ops engineering work or data problems to keep them busy full time. (Well — that’s not precisely true. They could probably keep busy. But it wouldn’t be aligned with our core needs as a business, which makes them premature optimizations we cannot afford.)

7. Communication skills.

We select highly for communication skills. The core of our technical interview involves improving and extending a piece of code, then bringing it in the next day to discuss it with your peers. We believe that if you can explain what you did and why, you can definitely do the work, and the reverse is not necessarily true. We also believe that communication skills are at the foundation of a team’s ability to learn from its mistakes and improve as a unit. We value high-performing teams, therefore we select for those skills.

There are many excellent engineers who are not good communicators, or who do not value communication the way we do, and while we may respect you very much, it’s not a great fit for our team.

8. You don’t actually want to work at a startup.

“I really want to work at a startup. Also the things that are really important to me are: work/life balance, predictability, high salary, gold benefits, stability, working from 10 to 5 on the dot, knowing what i’ll be working on for the next month, not having things change unexpectedly, never being on call, never needing to think or care about work out of hours …”

To be clear, it is not a red flag if you care about work/life balance. We care about that too — who the hell doesn’t? But startups are inherently more chaotic and unpredictable, and roles are more fluid and dynamic, and I want to make sure your expectations are aligned with reality.

9. You just want to work for women.

I hate it when I’m interviewing someone and I ask why they’re interested in Honeycomb, and they enthusiastically say “Because it was founded by women!”, and I wait for the rest of it, but that’s all there is. That’s it? Nothing interests you about the problem, the competitive space, the people, the customers … nothing?? It’s fine if the leadership team is what first caught your eye. But it’s kind of insulting to just stop there. Just imagine if somebody asked you out on a date “because you’re a woman”. Low. Fucking. Bar.

10. I truly want you to be happy.

I have no interest in making a hard sell to people who are dubious about Honeycomb. I don’t want to hire people who can capably do the job, but whose hearts are really elsewhere doing other things, or who barely tolerate going to work every day. I want to join with people who see their labor as an extension of themselves, who see work as an important part of their life’s project. I only want you to work here if it’s what’s best for you.

11. I’m not perfect.

We have made the wrong decision before, and will do so again. >_<

In conclusion…

As a candidate, it is tempting to feel like you will get the job if you are awesome enough, therefore if you do not get the job it must be because you were insufficiently awesome. But that is not how hiring works — not for highly constrained startups, anyway.

If we brought you in for an interview, we already think you’re awesome. Period. Now we’re just trying to figure out if you narrowly intersect the skill sets we are lacking that we need to succeed this year.

If you could be a fly on the wall, listening to us talk about you, the phrase you would hear over and over is not “how good are they?”, but “what will they need to be successful? can we provide the support they need?” We know this is as much of a referendum on us as it is on you. And we are not perfect.

But we are hiring. ☺️