Dan Golant asked a great question today: “Any advice/reading on how to establish a team’s critical path?”
I repeated back: “establish a critical path?” and he clarified:
Yea, like, you talk about buttoning up your “critical path”, making sure it’s well-monitored etc. I think that the right first step to really improving Observability is establishing what business processes *must* happen, what our “critical paths” are. I’m trying to figure out whether there are particularly good questions to ask that can help us document what these paths are for my team/group in Eng.
“Critical path” is one of those phrases that I think I probably use a lot. Possibly because the very first real job I ever had was when I took a break from college and worked at criticalpath.net (“we handle the world’s email”) — and by “work” I mean, “lived in SF for a year when I was 18 and went to a lot of raves and did a lot of drugs with people way cooler than me”. Then I went back to college, the dotcom boom crashed, and the CP CFO and CEO actually went to jail for cooking the books, becoming the only tech execs I am aware of who actually went to jail.
Where was I.
Right, critical path. What I said to Dan is this: “What makes you money?”
Like, if you could only deploy three end-to-end checks that would perform entire operations on your site and ensure they work at all times, what would they be? what would they do? “Submit a payment” is a super common one; another is new user signups.
The idea here is to draw up a list of the things that are absolutely worth waking someone up to fix immediately, night or day, rain or shine. That list should be as compact and well-defined as possible. This allows you to be explicit about the fact that anything else can wait til morning, or some other less-demanding service level agreement.
And typically the right place to start on this list is by asking yourselves: “what makes us money?” as a proxy for the real questions, which are: “what actions allow us to survive as a business? What do our customers care the absolute most about? What makes us us?” That’s your critical path.
Someone will usually seize this opportunity to argue that absolutely any deterioration in service is worth paging someone immediately to fix it, day or night. They are wrong, but it’s good to flush these assumptions out and have this argument kindly out in the open.
(Also, this is really a question about service level objectives. So if you’re asking yourself about the critical path, you should probably consider buying Alex Hidalgo’s book on SLOs, and you may want to look into the Honeycomb SLO product, the only one in the industry that actually implements SLOs as the Google SRE book defines them (thanks Liz!) and lets you jump straight from “what are our customers experiencing?” to “WHY are they experiencing it”, without bouncing awkwardly from aggregate metrics to logs and back and just … hoping … the spikes line up according to your visual approximations.)
I made a vow this year to post one blog post a month, then I didn’t post anything at all from May to September. I have some catching up to do. 😑 I’ve also been meaning to transcribe some of the twitter rants that I end up linking back to into blog posts, so if there’s anything you especially want me to write about, tell me now while I’m in repentance mode.
This is one request I happened to make a note of because I can’t believe I haven’t already written it up! I’ve been saying the same thing over and over in talks and on twitter for years, but apparently never a blog post.
The question is: what is the proper role of alerting in the modern era of distributed systems? Has it changed? What are the updated best practices for alerting?
@mipsytipsy I've seen your thoughts on dashboards vs searching but haven't seen many thoughts from you on Alerting. Let me know if I've missed a blog somewhere on that! 🙂
It’s a great question. I want to wax philosophically about some stuff, but first let me briefly outline the way to modernize your alerting best practices:
implement SLOs and/or end-to-end checks that traverse key code paths and correlate to user-impacting events
create a secondary channel (tasks, ticketing system, whatever) for “things that on call should look at soon, but are not impacting users yet” which does not page anyone, but which on call is expected to look at (at least) first thing in the morning, last thing in the evening, and midday
move as many paging alerts as possible to the secondary channel, by engineering your services to auto-remediate or run in degraded mode until they can be patched up
wake people up only for SLOs and health checks that correlate to user-impacting events
Or, in an even shorter formulation: delete all your paging alerts, then page only on e2e alerts that mean users are in pain. Rely on debugging tools for debugging, and paging only when users are in pain.
To understand why I advocate deleting all your paging alerts, and when it’s safe to delete them, first we need to understand why have we accumulated so many crappy paging alerts over the years.
Monoliths, LAMP stacks, and death by pagebomb
Here, let’s crib a couple of slides from one of my talks on observability. Here are the characteristics of older monolithic LAMP-stack style systems, and best practices for running them:
The sad truth is, that when all you have is time series aggregates and traditional monitoring dashboards, you aren’t really debugging with science so much as you are relying on your gut and a handful of dashboards, using intuition and scraps of data to try and reconstruct an impossibly complex system state.
This works ok, as long as you have a relatively limited set of failure scenarios that happen over and over again. You can just pattern match from past failures to current data, and most of the time your intuition can bridge the gap correctly. Every time there’s an outage, you post mortem the incident, figure out what happened, build a dashboard “to help us find the problem immediately next time”, create a detailed runbook for how to respond to it, and (often) configure a paging alert to detect that scenario.
Over time you build up a rich library of these responses. So most of the time when you get paged you get a cluster of pages that actually serves to help you debug what’s happening. For example, at Parse, if the error graph had a particular shape I immediately knew it was a redis outage. Or, if I got paged about a high % of app servers all timing out in a short period of time, I could be almost certain the problem was due to mysql connections. And so forth.
Things fall apart; the pagebomb cannot stand
However, this model falls apart fast with distributed systems. There are just too many failures. Failure is constant, continuous, eternal. Failure stops being interesting. It has to stop being interesting, or you will die.
Instead of a limited set of recurring error conditions, you have an infinitely long list of things that almost never happen …. except that one time they do. If you invest your time into runbooks and monitoring checks, it’s wasted time if that edge case never happens again.
Frankly, any time you get paged about a distributed system, it should be a genuinely new failure that requires your full creative attention. You shouldn’t just be checking your phone, going “oh THAT again”, and flipping through a runbook. Every time you get paged it should be genuinely new and interesting.
Oh damn this talk looks baller. 😍 "Failure is important, but it is no longer interesting" — @this_hits_home… Netflix once again shining the light on where the rest of us need to get to over the next 3-5 years. 🙌🏅🎬 https://t.co/OY40Y0BTSa
And thus you should actually have drastically fewer paging alerts than you used to.
A better way: observability and SLOs.
Instead of paging alerts for every specific failure scenario, the technically correct answer is to define your SLOs (service level objectives) and page only on those, i.e. when you are going to run out of budget ahead of schedule. But most people aren’t yet operating at this level of sophistication. (SLOs sound easy, but are unbelievably challenging to do well; many great teams have tried and failed. This is why we have built an SLO feature into Honeycomb that does the heavy lifting for you. Currently alpha testing with users.)
If you haven’t yet caught the SLO religion, the alternate answer is that “you should only page on high level end-to-end alerts, the ones which traverse the code paths that make you money and correspond to user pain”. Alert on the three golden signals: request rate, latency, and errors, and make sure to traverse every shard and/or storage type in your critical path.
That’s it. Don’t alert on the state of individual storage instances, or replication, or anything that isn’t user-visible.
(To be clear: by “alert” I mean “paging humans at any time of day or night”. You might reasonably choose to page people during normal work hours, but during sleepy hours most errors should be routed to a non-paging address. Only wake people up for actual user-visible problems.)
Here’s the thing. The reason we had all those paging alerts was because we depended on them to understand our systems.
Once you make the shift to observability, once you have rich instrumentation and the ability to swiftly zoom in from high level “there might be a problem” to identifying specifically what the errors have in common, or the source of the problem — you no longer need to lean on that scattershot bunch of pagebombs to understand your systems. You should be able to confidently ask any question of your systems, understand any system state — even if you have never encountered it before.
With observability, you debug by systematically following the trail of crumbs back to their source, whatever that is. Those paging alerts were a crutch, and now you don’t need them anymore.
Everyone is on call && on call doesn’t suck.
I often talk about how modern systems require software ownership. The person who is writing the software, who has the original intent in their head, needs to shepherd that code out into production and watch real users use it. You can’t chop that up into multiple roles, dev and ops. You just can’t. Software engineers working on highly available systems need to be on call for their code.
But the flip side of this responsibility belongs to management. If you’re asking everyone to be on call, it is your sworn duty to make sure that on call does not suck. People shouldn’t have to plan their lives around being on call. People shouldn’t have to expect to be woken up on a regular basis. Every paging alert out of hours should be as serious as a heart attack, and this means allocating real engineering resources to keeping tech debt down and noise levels low.
And the way you get there is first invest in observability, then delete all your paging alerts and start over from scratch.
Power has a way of flowing towards people managers over time, no matter how many times you repeat “management is not a promotion, it’s a career change.”
It’s natural, like water flowing downhill. Managers are privy to performance reviews and other personal information that they need to do their jobs, and they tend to be more practiced communicators. Managers facilitate a lot of decision-making and routing of people and data and things, and it’s very easy to slip into making the all decisions rather than empowering people to make them. Sometimes you want to just hand out assignments and order everyone to do as told. (er, just me??)
But if you let all the power drift over to the engineering managers, pretty soon it doesn’t look so great to be an engineer. Now you have people becoming managers for all the wrong reasons, or everyone saying they want to be a manager, or engineers just tuning out and turning in their homework (or quitting). We all want autonomy and impact, we all crave a seat at the table. You need to work harder to save those seats for non-managers.
So, in the spirit of the enumerated rights and responsibilities of our musty Constitution, here are some of the commitments we make to our engineers at Honeycomb — and some of the expectations we have for managering and engineering roles. Some of them mirror each other, and others are very different.
(Incidentally, I find it helpful to practice visualizing the org chart hierarchies upside down — placing managers below their teams as support structure rather than perched atop.)
Engineer’s Bill of Rights
You should be free to go heads down and focus, and trust that your manager will tap you when you are needed (or would want to be included).
We will invest in you as a leader, just like we invest in managers. Everybody will have opportunities to develop their leadership and interpersonal skills.
Technical decisions must remain the provenance of engineers, not managers.
You deserve to know how well you are performing, and to hear it early and often if you aren’t meeting expectations.
On call should not substantially impact your life, sleep, or health (other than carrying your devices around). If it does, we will fix it.
Your code reviews should be turned around in 24 hours or less, under ordinary circumstances.
You should have a career path that challenges you and contributes to your personal life goals, with the coaching and support you need to get there.
You should substantially choose your own work, in consultation with your manager and based on our business goals. This is not a democracy, but you will have a voice in our planning process.
You should be able to do your work whether in or out of the office. When you’re working remotely, your team will loop you in and have your back.
Make forward progress on your projects every week. Be transparent.
Make forward progress on your career every quarter. Push your limits.
Build a relationship of trust and mutual vulnerability with your manager and team, and invest in those relationships.
Know where you stand: how well are you performing, how quickly are you growing?
Develop your technical judgment and leadership skills. Own and be accountable for engineering outcomes. Ask for help when you need it, give help when asked.
Give feedback early and often, receive feedback gracefully. Practice both saying no and hearing no. Let people retract and try again if it doesn’t come out quite right.
Own your time and actively manage your calendar. Spend your attention tokens mindfully.
Recruit and hire and train your team. Foster a sense of solidarity and “teaminess” as well as real emotional safety.
Care for every engineer on your team. Support them in their career trajectory, personal goals, work/life balance, and inter- and intra-team dynamics.
Give feedback early and often. Receive feedback gracefully. Always say the hard things, but say them with love.
Move us relentlessly forward, watching out for overengineering and work that doesn’t contribute to our goals. Ensure redundancy/coverage of critical areas.
Own the quarterly planning process for your team, be accountable for the goals you set. Allocate resources by communicating priorities and recruiting eng leads. Add focus or urgency where needed.
Own your time and attention. Be accessible. Actively manage your calendar. Try not to make your emotions everyone else’s problems (but do lean on your own manager and your peers for support).
Make your own personal growth and self-care a priority. Model the values and traits we want our engineers to pattern themselves after.
I’d love to hear from anyone else who has a list like this.
Last week was the West Coast Velocity conference. I had a terrific time — I think it’s the
best Velocity I’ve been to yet. I also slipped in quite late, the evening before last, to catch Gareth’s session on DevOps vs SRE.
And it was worth it! Holy crap, this was such a fun barnburner of a talk, with Gareth schizophrenically arguing both for and against the key premise of the talk, which was about “Google Infrastructure for Everyone Else (GIFEE)” and whether SRE is a) the highest, noblest goal that we should all aspire towards, or b) mostly irrelevant to anyone outside the Google confines.
Which Gareth won? Check out the slides and judge for yourself. 🙃
At some point in his talk, though, Gareth tossed out something like “Charity probably already has a blog post on this drafted up somewhere.” And I suddenly remembered “Fuck! I DO!” it’s been sitting in my Drafts for months god dammit.
So this is actually a thing I dashed off back in April, after CraftConf. Somebody asked me for my opinion on the internet — always a dangerous proposition — and I went off on a bit of a rant about the differences and similarities between DevOps and SRE, as philosophies and practices.
Time passed and I forgot about it, and then decided it was too stale. I mean who really wants to read a rehash of someone’s tweetstorm from two months ago?
Well Gareth, apparently.
SRE vs DevOps: TWO PHILOSOPHIES ENTER, BOTH ARE PHENOMENALLY SUCCESSFUL AND MUTUALLY DUBIOUS OF ONE ANOTHER
It also has some really fucking obnoxious blurbs. Things like about how “ONLY GOOGLE COULD HAVE DONE THIS”, and an whiff of snobbery throughout the book as though they actually believe this (which is far worse if true).
You can’t really blame the poor blurb’ers, but you can certainly look askance at a massive systems engineering org when it seems as though they’ve never heard of DevOps, or considered how it relates to SRE practices, and may even be completely unaware of what the rest of the industry has been up to for the past 10-plus years. It’s just a little weird.
So here, for the record, is what I said about it.
1) a lot of the philosophical volleying between devops / SRE comes down to a failure to recognize the overwhelming power of context.
Google is a great company with lots of terrific engineers, but you can only say they are THE
BEST at what they do if you’re defining what they do tautologically, i.e. “they are the best at making Google run.” Etsyans are THE BEST at running Etsy, Chefs are THE BEST at building Chef, because … that’s what they do with their lives.
Context is everything here. People who are THE BEST at Googling often flail and flame out in early startups, and vice versa. People who are THE BEST at early-stage startup engineering are rarely as happy or impactful at large, lumbering, more bureaucratic companies like Google. People who can operate equally well and be equally happy at startups and behemoths are fairly rare.
And large companies tend to get snobby and forget this. They stop hiring for unique strengths and start hiring for lack of weaknesses or “Excellence in Whiteboard Coding Techniques,” and congratulate themselves alot about being The Best. This becomes harmful when it translates into to less innovation, abysmal diversity numbers, and a slow but inexorable drift into dinosaurdom.
2) operations engineering is a specialized skill set *at large scale* or *on hard ops problems*. many -- most? companies don't have those.
Everybody thinks their problems are hard, but to a seasoned engineer, most startup problems are not technically all that hard. They’re tedious, and they are infinite, but anyone can figure this shit out. The hard stuff is the rest of it: feverish pace, the need to reevaluate and reprioritize and reorient constantly, the total responsibility, the terror and uncertainty of trying to find product/market fit and perform ten jobs at once and personally deliver to your promises to your customers.
At a large company, most of the hardest problems are bureaucratic. You have to come to terms with being a very tiny cog in a very large wheel where the org has a huge vested interest in literally making you as replicable and replaceable as possible. The pace is excruciatingly slow if you’re used to a startup. The autonony is … well, did I mention the politics? If you want autonomy, you have to master the politics.
3) the outcomes associated with operations (reliability, scalability, operability) are the responsibility of *everyone* from support to CEO.
Everyone. Operational excellence is everyone’s job. Dude, if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on your team is an operational risk (not to mention, you know, like moral issues and stuff).
But the more engineering-focused your role is, the more direct your impact will be on operational outcomes.
4) therefore, the more literate you are with operational skills, the more effective and powerful you can be -- esp as a software engineer.
As a software engineer, developing strong ops chops makes you powerful. It makes you better at debugging and instrumentation, building resiliency and observability into your own systems and interdependent systems, and building systems that other people can come along and understand and maintain long after you’re gone.
As an operations engineer, those skills are already your bread and butter. You can increase your power in other ways, like by leveling up at software engineering skills like test coverage and automation, or DBA stuff like query optimization and storage engine internals, or by helping the other teams around you level up on their skills (communication and persuasion are chronically underrecognized as core operations engineering skills).
5) specialization is not a bad thing. specialization is how we scale and do capitalism! the problem is when this becomes compartmentalizing.
This doesn’t mean that everyone can or should be able to do everything. (I can’t even SAY the words “full stack engineer” without rolling my eyes.) Generalists are awesome! But past a certain inflection point, specialization is the only way an org can scale.
It’s the only way you make room for those engineering archetypes who only want to dive deep, or who really really love refactoring, or who will save the world then disappear for weeks. Those engineers can be incredibly valuable as part of a team … but they are most valuable in a large org where you have enough generalists to keep the oars rowing along in the meantime.
6) so: Google SRE has an incredibly powerful set of best practices, that enable them to run the largest site in the world incredibly well.
So, back to Google. They’ve done, ahem, rather well for themselves. Made shitbuckets of money, pushed the boundaries of tech, service hardly ever goes down. They have operational demands that most of us never have seen and never will, and their engineers are definitely to be applauded for doing a lot of hard technical and cultural labor to get there.
Mostly because it comes off a little tone deaf in places. I’m not personally pissed off by
the google SRE book, actually, just a little bemused at how legitimately unaware they seem to be about … anything else that the industry has been doing over the past 10 years, in terms of cultural transformation, turning sysadmins into better engineers, sharing on-call rotations, developing processes around empathy and cross-functionality, engineering best practices, etc.
If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.
However, it happens that Jen Davis and Katherine Daniels just published a book called Effective DevOps, which covers a lot of the same ground with a much more varied and inclusive approach. And one of the things they return to over and over again is the power of context, and how one-size-fits-all solutions simply don’t exist, just like unisex OSFA t-shirts are a dirty fucking lie.
Google insularity is … a thing. On the one hand it’s great that they’re opening up a bit! On the other hand it’s a little bit like when somebody barges onto a mailing list and starts spouting without skimming any of the archives. And don’t even get me started on what happens when you hire long, longterm ex-Googlers back into to the real world.
So, so many of us have had this experience of hiring ex-Googlers who automatically assume that the way Google does a thing is CORRECT, not just contextually appropriate. Not just right for Google, but right for everyone, always. Which is just obviously untrue. But the reassimilation process can be quite long and exhausting when the Kool-Aid is so strong.
8) DevOps as a philosophy is much more sensitive to context than SRE philosophy, because it grew from a broader collaborative base.
Because yeah, this is a conversation and a transformation that the industry has been having for a long time now. Compared with the SRE manifesto, the DevOps philosophy is much more crowd-sourced, more flexible, and adaptable to organizations of all stages of developments, with all different requirements and key business differentiators, because it’s benefited from loud, mouthy contributors who aren’t all working in the same bubble all along.
And it’s like Google isn’t even aware this was happening, which is weird.
9) that's it, basically all i'm saying is "all blanket statements are false" including probably this one 🙂 #devops#sre
Orrrrrr, maybe I’m just a wee bit annoyed that I’ve been drawn into this position of having to defend “DevOps”, after many excellent years spent being grumpy about the word and the 10000010101 ways it is used and abused.
(Tell me again about your “DevOps Engineering Team”, I dare you.)
(^^ thanks to @kellan and others who particularly influenced/clarified my thinking around #8, the crowdsourcing of devops)
P.S. I highly encourage you to go read the epic hours-long rant by @matthiasr that kicked off the whole thing. some of which I definitely endorse and some of which not, but I think we could go drink whiskey and yell about this for a week or two easy breezy <3
So, a few hot takes on that … 1) SRE, as practiced by Google, is really just Ops with a lot of management support