Love (and Alerting) in the Time of Cholera (and Observability)

I made a vow this year to post one blog post a month, then I didn’t post anything at all from May to September.  I have some catching up to do.  😑   I’ve also been meaning to transcribe some of the twitter rants that I end up linking back to into blog posts, so if Graph Everything, Kittensthere’s anything you especially want me to write about, tell me now while I’m in repentance mode.

This is one request I happened to make a note of because I can’t believe I haven’t already written it up!  I’ve been saying the same thing over and over in talks and on twitter for years, but apparently never a blog post.

The question is: what is the proper role of alerting in the modern era of distributed systems?  Has it changed?  What are the updated best practices for alerting?

It’s a great question.  I want to wax philosophically about some stuff, but first let me briefly outline the way to modernize your alerting best practices:

  1. implement observability
  2. implement SLOs and/or end-to-end checks that traverse key code paths and correlate to user-impacting events
  3. create a secondary channel (tasks, ticketing system, whatever) for “things that on call should look at soon, but are not impacting users yet” which does not page anyone, but which on call is expected to look at (at least) first thing in the morning, last thing in the evening, and midday
  4. move as many paging alerts as possible to the secondary channel, by engineering your services to auto-remediate or run in degraded mode until they can be patched up
  5. wake people up only for SLOs and health checks that correlate to user-impacting events

Or, in an even shorter formulation: delete all your paging alerts, then page only on e2e alerts that mean users are in pain.  Rely on debugging tools for debugging, and paging only when users are in pain.

To understand why I advocate deleting all your paging alerts, and when it’s safe to delete them, first we need to understand why have we accumulated so many crappy paging alerts over the years.

Monoliths, LAMP stacks, and death by pagebomb

Here, let’s crib a couple of slides from one of my talks on observability.  Here are the characteristics of older monolithic LAMP-stack style systems, and best practices for running them:

 

The sad truth is, that when all you have is time series aggregates and traditional monitoring dashboards, you aren’t really debugging with science so much as you are relying on your gut and a handful of dashboards, using intuition and scraps of data to try and reconstruct an impossibly complex system state.

This works ok, as long as you have a relatively limited set of failure scenarios that happen over and over again.  You can just pattern match from past failures to current data, and most of the time your intuition can bridge the gap correctly.  Every time there’s Graph Everything Unicorn 2x2an outage, you post mortem the incident, figure out what happened, build a dashboard “to help us find the problem immediately next time”, create a detailed runbook for how to respond to it, and (often) configure a paging alert to detect that scenario.

Over time you build up a rich library of these responses.  So most of the time when you get paged you get a cluster of pages that actually serves to help you debug what’s happening.  For example, at Parse, if the error graph had a particular shape I immediately knew it was a redis outage.  Or, if I got paged about a high % of app servers all timing out in a short period of time, I could be almost certain the problem was due to mysql connections.  And so forth.

Things fall apart; the pagebomb cannot stand

However, this model falls apart fast with distributed systems.  There are just too many failures.  Failure is constant, continuous, eternal.  Failure stops being interesting.  It has to stop being interesting, or you will die.

 

 

 

Instead of a limited set of recurring error conditions, you have an infinitely long list of things that almost never happen …. except that one time they do.  If you invest your time into runbooks and monitoring checks, it’s wasted time if that edge case never happens again.

Frankly, any time you get paged about a distributed system, it should be a genuinely new failure that requires your full creative attention.  You shouldn’t just be checking your phone, going “oh THAT again”, and flipping through a runbook.  Every time you get paged it should be genuinely new and interesting.

And thus you should actually have drastically fewer paging alerts than you used to.

A better way: observability and SLOs.

Instead of paging alerts for every specific failure scenario, the technically correct answer is to define your SLOs (service level objectives) and page only on those, i.e. when you are going to run out of budget ahead of schedule.  But most people aren’t yet operating at this level of sophistication.  (SLOs sound easy, but are unbelievably challenging to do well; many great teams have tried and failed.  This is why we have built an SLO feature into Honeycomb that does the heavy lifting for you.  Currently alpha testing with users.)

If you haven’t yet caught the SLO religion, the alternate answer is that “you should only page on high level end-to-end alerts, the ones which traverse the code paths that make you money and correspond to user pain”.  Alert on the three golden signals: request rate, latency, and errors, and make sure to traverse every shard and/or storage type in your critical path.

That’s it.  Don’t alert on the state of individual storage instances, or replication, or anything that isn’t user-visible.

(To be clear: by “alert” I mean “paging humans at any time of day or night”.  You might reasonably choose to page people during normal work hours, but during sleepy hours most errors should be routed to a non-paging address.  Only wake people up for actual user-visible problems.)

Here’s the thing.  The reason we had all those paging alerts was because we depended on them to understand our systems.

Once you make the shift to observability, once you have rich instrumentation and the ability to swiftly zoom in from high level “there might be a problem” to identifying specifically what the errors have in common, or the source of the problem — you no longer need to lean on that scattershot bunch of pagebombs to understand your systems.  You should be able to confidently ask any question of your systems, understand any system state — even if you have never encountered it before.

With observability, you debug by systematically following the trail of crumbs back to their source, whatever that is.  Those paging alerts were a crutch, and now you don’t need them anymore.

Everyone is on call && on call doesn’t suck.

I often talk about how modern systems require software ownership.  The person who is writing the software, who has the original intent in their head, needs to shepherd that code out into production and watch real users use it.  You can’t chop that up into multiple roles, dev and ops.  You just can’t.  Software engineers working on highly available systems need to be on call for their code.Graph Unicorn 4_x4_

But the flip side of this responsibility belongs to management.  If you’re asking everyone to be on call, it is your sworn duty to make sure that on call does not suck.  People shouldn’t have to plan their lives around being on call.  People shouldn’t have to expect to be woken up on a regular basis.  Every paging alert out of hours should be as serious as a heart attack, and this means allocating real engineering resources to keeping tech debt down and noise levels low.

And the way you get there is first invest in observability, then delete all your paging alerts and start over from scratch.

It works.  It really does. 🌈

 

 

Love (and Alerting) in the Time of Cholera (and Observability)

17 Reasons NOT To Be A Manager

Yesterday we had a super fun meetup here at Intercom in Dublin.  We split up into small discussion groups and talked about things related to managing teams and being a senior individual contributor (IC), and going back and forth throughout your career.

One interesting question that came up repeatedly was: “what are some reasons that someone might not want to be a manager?”

Fascinatingly, I heard it asked over the full range of tones from extremely positive (“what kind of nutter wouldn’t want to manage a team?!”) to extremely negative (“who would ever want to manage a team?!”).  So I said I would write a piece and list some reasons.

Point of order: I am going to focus on intrinsic reasons, not external ones.  There are lots of toxic orgs where you wouldn’t want to be a manager for many reasons — but that list is too long and overwhelming, and I would argue you probably don’t want to work there in ANY capacity.  Please assume the surroundings of a functional, healthy org (I know, I know — whopping assumption).

https://twitter.com/jetpack/status/1169675819716763649

1. You love what you do.

Never underestimate this one, and never take it for granted.  If you look forward to work and even miss it on vacation; if you occasionally leave work whistling with delight and/or triumph; if your brain has figured out how to wring out regular doses of dopamine and serotonin while delivering ever-increasing value; if you look back with pride at what you have learned and built and achieved, if you regularly tap into your creative happy place … hell, your life is already better than 99.99% of all the humans who have ever labored and lived.  Don’t underestimate the magnitude of your achievement, and don’t assume it will always be there waiting for you to just pick it right back up again.

https://twitter.com/jetpack/status/1169780841532276736

2. It is easy to get a new engineering job.  Really, really easy.

Getting your first gig as an engineer can be a challenge, but after that?  It is possibly easier for an experienced engineer to find a new job than anyone else on the planet. There is so much demand this skill set that we actually complain about how annoying it is being constantly recruited!  Amazing.

It is typically harder to find a new job as a manager.  If you think interview processes for engineers are terrible (and they are, honey), they are even weirder and less predictable (and more prone to implicit bias) for managers.  So much of manager hiring is about intangibles like “culture fit” and “do I like you” — things you can’t practice or study or know if you’ve answered correctly.  And soooo much of your skill set is inevitably bound up in navigating the personalities and bureaucracies of particular teams and a particular company.  A manager’s effectiveness is grounded in trust and relationships, which makes it much less transferrable than engineering skills.

3. There are fewer management jobs.

I am not claiming it is equally trivial for everyone to get a new job; it can be hard if you live in an out-of-the-way place, or have an unusual skill, etc.  But in almost every case, it becomes harder if you’re a manager.  Besides — given that the ratio of engineers to line managers is roughly 7 to one — there will be almost an order of magnitude fewer eng manager jobs than engineering jobs.

4. Manager jobs are the first to get cut.

Engineers (in theory) add value directly to the bottom line.  Management is, to be brutally frank, overhead.  Middle management is often the first to be cut during layoffs

Remember how I said that creation is the engineering superpower?  That’s a nicer way of saying that managers don’t directly create any value.  They may indirectly contribute to increased value over time — the good ones do — but only by working through other people as a force multiplier, mentor etc.  When times get tough, you don’t cut the people who build the product, you cut the ones whose value-added is contingent or harder to measure.

Another way this plays out is when companies are getting acquired.  As a baseline for acquihires, the acquiring company will estimate a value of $1 million per engineer, then deduct $500k for every other role being acquired.  Ouch.

5. Managers can’t really job hop.

Where it’s completely normal for an engineer to hop jobs every 1-3 years, a manager who does this will not get points for learning a wide range of skills, they’ll be seen as “probably difficult to work with”.  I have no data to support this, but I suspect the job tenure of a successful manager is at least 2-3x as long as that of a successful IC.  It takes a year or two just to gain the trust of everyone on your team and the adjacent teams, and to learn the personalities involved in navigating the organization.  At a large company, it may take a few times that long.  I was a manager at Facebook for 2.5 years and I still learned some critical new detail about managing teams there on a weekly basis.  Your value to the org really kicks in after a few years have gone by, once a significant part of the way things get done resides in your cranium.

https://twitter.com/jetpack/status/1169698084240031744

6. Engineers can be little shits.

You know the type.  Sneering about how managers don’t do any “real work”, looking down on them for being “less technical”.  Basically everyone who utters the question “.. but how technical are they?” in that particular tone of voice is a shitbird.  Hilariously, we had a great conversation about whether a great manager needs to be technical or not — many people sheepishly admitted that the best managers they had ever had knew absolutely nothing about technology, and yet they gave managers coding interviews and expected them to be technical.  Why?  Mostly because the engineers wouldn’t respect them otherwise.

https://twitter.com/jetpack/status/1169685458340573184

7.  As a manager, you will need to have some hard conversations.  Really, really hard ones.

Do you shy away from confrontation?  Does it seriously stress you out to give people feedback they don’t want to hear?  Manager life may not be for you.  There hopefully won’t be too many of these moments, but when they do happen, they are likely to be of outsized importance.  Having a manager who avoids giving critical feedback can be  really damaging, because it deprives you of the information you need to make course corrections before the problem becomes really big and hard.

8.  A manager’s toolset is smaller than you think.

As an engineer, if you really feel strongly about something, you just go off and do it yourself.  As a manager, you have to lead through influence and persuasion and inspiring other people to do things.  It can be quite frustrating.  “But can’t I just tell people what to do?” you might be thinking.  And the answer is no.  Any time you have to tell someone what to do using your formal authority, you have failed in some way and your actual influence and power will decrease.  Formal authority is a blunt, fragile instrument.

9. You will get none of the credit, and all of the blame.

When something goes well, it’s your job to push all the credit off onto the people who did the work.  But if you failed to ship, or and, or hire, or whatever?  The responsibility is all on you, honey.

https://twitter.com/jetpack/status/1169828158566125569

10.  Use your position as an IC to bring balance to the Force.

I LOVE working in orgs where ICs have power and use their voices.  I love having senior ICs around who model that, who walk around confidently assuming that their voice is wanted and needed in the decision-making process.  If your org is not like that, do you know who is best positioned to shift the balance of power back?  Senior ICs, with some behind-the-scenes support from managers.  For this reason, I am always a little sad when a vocal, powerful IC who models this behavior transitions to management.  If ALL of the ICs who act this way become managers, it sends a very dismaying message to the ranks — that you only speak up if you’re in the process of converting to management.

11.  Management is just a collection of skills, and you should be able to do all the fun ones as an IC.

Do you love mentoring?  Interviewing, constructing hiring loops, defining the career ladder?  Do you love technical leadership and teaching other people, or running meetings and running projects?  Any reasonably healthy org should encourage all senior ICs to participate and have leadership roles in these areas.  Management can be unbundled into a lot of different skills and roles, and the only ones that are necessarily confined to management are the shitty ones, like performance reviews and firing people.  I LOVE it when an engineer expresses the desire to start learning more management skills, and will happily brainstorm with them on next steps — get an intern? run team meetings?  there are so many things to choose from!  When I say that all engineers should try management at some point in their career, what I really mean is these are skills that every senior engineer should develop.  Or as Jill says:

12. Joy is much harder to come by.

That dopamine drip in your brain from fixing problems and learning things goes away, and it’s … real tough.  This is why I say you need to commit to a two year stint if you’re going to try management: that, plus it takes that long to start to get your feet under you and is hard on your team if they’re switching managers all the time.  It usually takes a year or two to rewire your brain to look for the longer timeline, less intense rewards you get from coaching other people to do great things.  For some of us, it never does kick in.  It’s genuinely hard to know whether you’ve done anything worth doing.

https://twitter.com/jetpack/status/1169826158751338497

13. It will take up emotional space at the expense of your personal life.

When I was an IC, I would work late and then go out and see friends or meet up at the pub almost every night.  It was great for my dating life and social life in general.  As a manager, I feel like curling up in a fetal position and rolling home around 4 pm.  I’m an introvert, and while my capacity has increased a LOT over the past several years, I am still sapped every single day by the emotional needs of my team.

14. Your time doesn’t belong to you.

It’s hard to describe just how much your life becomes not your own.

https://twitter.com/jetpack/status/1169804763933753345

15. Meetings.

16. If technical leadership is what your heart loves most, you should NOT be a manager.

If you are a strong tech lead and you convert to management, it is your job to begin slowly taking yourself out of the loop as tech lead and promoting others in your place.  Your technical skills will stop growing at the point that you switch careers, and will slowly decay after that.  Moreover, if you stay on as tech lead/manager you will slowly suck all the oxygen from the room.  It is your job to train up and hand over to your replacements and gradually step out of the way, period.

17. It will always be there for you later.

In conclusion

Given all this, why should ANYONE ever be a manager?  Shrug.  I don’t think there’s any one good or bad answer.  I used to think a bad answer would be “to gain power and influence” or “to route around shitty communication systems”, but in retrospect those were my reasons and I think things turned out fine.  It’s a complex calculation.  If you want to try it and the opportunity arises, try it!  Just commit to the full two year experiment, and pour yourself into learning it like you’re learning a new career — since, you know, you are.

https://twitter.com/jetpack/status/1169641930713645057

But please do be honest with yourself.  One thing I hate is when someone wants to be a manager, and I ask why, and they rattle off a list of reasons they’ve heard that people SHOULD want to become managers (“to have a greater impact than I can with just myself, because I love helping other people learn and grow, etc”) but I am damn sure they are lying to themselves and/or me.

Introspection and self-knowledge are absolutely key to being a decent manager, and lord knows we need more of those.  So don’t kick off your grand experiment by lying to yourself, ok?

 

17 Reasons NOT To Be A Manager