DevOps vs SRE: delayed coverage of the dumbest war

Last week was the West Coast Velocity conference.  I had a terrific time — I think it’s the
best Velocity I’ve been to yet.  I also slipped in quite late, the evening before last, to catch Gareth’s session on DevOps vs SRE.

had to catch it, because Gareth Rushgrove (of DevOps Weekly glory) was taunting @lusis and me about it on the Internet.

rainbowdropletAnd it was worth it!   Holy crap, this was such a fun barnburner of a talk, with Gareth schizophrenically arguing both for and against the key premise of the talk, which was about “Google Infrastructure for Everyone Else (GIFEE)” and whether SRE is a) the highest, noblest goal that we should all aspire towards, or b) mostly irrelevant to anyone outside the Google confines.

Which Gareth won?  Check out the slides and judge for yourself.  🙃

 

At some point in his talk, though, Gareth tossed out something like “Charity probably already has a blog post on this drafted up somewhere.”  And I suddenly remembered “Fuck!  I DO!”  it’s been sitting in my Drafts for months god dammit.

So this is actually a thing I dashed off back in April, after CraftConf.  Somebody asked me for my opinion on the internet — always a dangerous proposition — and I went off on a bit of a rant about the differences and similarities between DevOps and SRE, as philosophies and practices.
colorful-rainbow

Time passed and I forgot about it, and then decided it was too stale.  I mean who really wants to read a rehash of someone’s tweetstorm from two months ago?

Well Gareth, apparently.

Anyway: enjoy.


SRE vs DevOps: TWO PHILOSOPHIES ENTER, BOTH ARE PHENOMENALLY SUCCESSFUL AND MUTUALLY DUBIOUS OF ONE ANOTHER


 

So in case you haven’t noticed, Google recently published a book about Site Reliability Engineering: How Google Runs Production Systems.  It contains some really terrific wisdom on how to scale both systems and orgs.  It contains chapters written by dear friends of mine.  It’s a great book, and you should buy it and read it!

Rainbow-Umbrella-Z-5_5It also has some really fucking obnoxious blurbs.  Things like about how “ONLY GOOGLE COULD HAVE DONE THIS”, and an whiff of snobbery throughout the book as though they actually believe this (which is far worse if true).

You can’t really blame the poor blurb’ers, but you can certainly look askance at a massive systems engineering org when it seems as though they’ve never heard of DevOps, or considered how it relates to SRE practices, and may even be completely unaware of what the rest of the industry has been up to for the past 10-plus years.  It’s just a little weird.

So here, for the record, is what I said about it.

 

Google is a great company with lots of terrific engineers, but you can only say they are THE

sre
The Google SRE Bible

BEST at what they do if you’re defining what they do tautologically, i.e. “they are the best at making Google run.”  Etsyans are THE BEST at running Etsy, Chefs are THE BEST at building Chef, because … that’s what they do with their lives.

Context is everything here.  People who are THE BEST at Googling often flail and flame out in early startups, and vice versa.  People who are THE BEST at early-stage startup engineering are rarely as happy or impactful at large, lumbering, more bureaucratic companies like Google.  People who can operate equally well and be equally happy at startups and behemoths are fairly rare.

And large companies tend to get snobby and forget this.  They stop hiring for uniquerainbow-swirl strengths and start hiring for lack of weaknesses or “Excellence in Whiteboard Coding Techniques,” and congratulate themselves alot about being The Best.  This becomes harmful when it translates into to less innovation, abysmal diversity numbers, and a slow but inexorable drift into dinosaurdom.

Everybody thinks their problems are hard, but to a seasoned engineer, most startup problems are not technically all that hard.  They’re tedious, and they are infinite, but anyone can figure this shit out.  The hard stuff is the rest of it: feverish pace, the need to reevaluate and reprioritize and reorient constantly, the total responsibility, the terror and uncertainty of trying to find product/market fit and perform ten jobs at once and personally deliver to your promises to your customers.

rainbow-cloud-dropletAt a large company, most of the hardest problems are bureaucratic.  You have to come to terms with being a very tiny cog in a very large wheel where the org has a huge vested interest in literally making you as replicable and replaceable as possible.  The pace is excruciatingly slow if you’re used to a startup.  The autonony is … well, did I mention the politics?  If you want autonomy, you have to master the politics.

 

Everyone.  Operational excellence is everyone’s job.  Dude, if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on  your team is an operational risk (not to mention, you know, like moral issues and stuff).

But the more engineering-focused your role is, the more direct your impact will be on operational outcomes.

As a software engineer, developing strong ops chops makes you powerful.  It makes you better at debugging and instrumentation, building resiliency and observability into your own systems and interdependent systems, and building systems that other people can come along and understand and maintain long after you’re gone.rainbow-shade

As an operations engineer, those skills are already your bread and butter.  You can increase your power in other ways, like by leveling up at software engineering skills like test coverage and automation, or DBA stuff like query optimization and storage engine internals, or by helping the other teams around you level up on their skills (communication and persuasion are chronically underrecognized as core operations engineering skills).

This doesn’t mean that everyone can or should be able to do everything.  (I can’t even SAYrainbow-dot-ball the words “full stack engineer” without rolling my eyes.)  Generalists are awesome!  But past a certain inflection point, specialization is the only way an org can scale.

It’s the only way you make room for those engineering archetypes who only want to dive deep, or who really really love refactoring, or who will save the world then disappear for weeks.  Those engineers can be incredibly valuable as part of a team … but they are most valuable in a large org where you have enough generalists to keep the oars rowing along in the meantime.

So, back to Google.  They’ve done, ahem, rather  well for themselves.  Made shitbuckets of money, pushed the boundaries of tech, service hardly ever goes down.  They have operational demands that most of us never have seen and never will, and their engineers are definitely to be applauded for doing a lot of hard technical and cultural labor to get there.

So why did this SRE book ruffle a few feathers?

Mostly because it comes off a little tone deaf in places.  I’m not personally pissed off by
the google SRE book, actually, just a little bemused at how legitimately unaware they seem to be about … anything else that the industry has been doing over the past 10 years, in terms of cultural transformation, turning sysadmins into better engineers, sharing on-call rotations, developing processes around empathy and cross-functionality, engineering best practices, etc.

devops
DevOps for the rest of us

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

However, it happens that Jen Davis and Katherine Daniels just published a book called Effective DevOps, which covers a lot of the same ground with a much more varied and inclusive approach.  And one of the things they return to over and over again is the power of context, and how one-size-fits-all solutions simply don’t exist, just like unisex OSFA t-shirts are a dirty fucking lie.

Google insularity is … a thing.  On the one hand it’s great that they’re opening up a bit!rainbow-umbrella-clipart-1 On the other hand it’s a little bit like when somebody barges onto a mailing list and starts spouting without skimming any of the archives.  And don’t even get me started on what happens when you hire long, longterm ex-Googlers back into to the real world.

So, so many of us have had this experience of hiring ex-Googlers who automatically assume that the way Google does a thing is CORRECT, not just contextually appropriate.   Not just right for Google, but right for everyone, always.  Which is just obviously untrue.  But the reassimilation process can be quite long and exhausting when the Kool-Aid is so strong.

Because yeah, this is a conversation and a transformation that the industry has been having for a long time now.  Compared with the SRE manifesto, the DevOps philosophy is much more crowd-sourced, more flexible, and adaptable to organizations of all stages of developments, with all different requirements and key business differentiators, because it’s benefited from loud, mouthy contributors who aren’t all working in the same bubble all along.

And it’s like Google isn’t even aware this was happening, which is weird.

Orrrrrr, maybe I’m just a wee bit annoyed that I’ve been drawn into this position rainbow-dot-ballof having to defend “DevOps”, after many excellent years spent being grumpy about the word and the 10000010101 ways it is used and abused.

(Tell me again about your “DevOps Engineering Team”, I dare you.)

P.S. I highly encourage you to go read the epic hours-long rant by @matthiasr that kicked off the whole thing.  some of which I definitely endorse and some of which not, but I think we could go drink whiskey and yell about this for a week or two easy breezy  ❤

Anyway what the fuck do I know, I’ve never worked in the Google lair, so maybe I am just under-equipped to grasp the true glory, majesty and superiority of their achievements over us all.

Or maybe they should go read Katherine and Jen’s book and interact with the “UnGoogled” once in a while.  ☺️

colorful-rainbow

 

DevOps vs SRE: delayed coverage of the dumbest war

Operational Best Practices #serverless

This post is part two of my recap of last week’s terrific Serverless conference.  If you feel like getting bitchy with me about what serverless means or #NoOps or whatever, please refer back to the prequel post, where I talked about operations engineering in the modern world.

*Then* you can get bitchy with me.  (xoxoxxooxo)

The title of my talk was:

Screen Shot 2016-05-30 at 8.43.39 PM

The theme of my talk was basically: what should software engineers know and care about when it comes to operations in a world where we are outsourcing more and more core functionality?

If you care about running a quality service or product, or providing your customers with a reasonable level of service, you have to care about operational concerns like design, resiliency, instrumentation and debuggability.  No matter how many abstractions there are between you and the bare metal.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault.  You chose them, it’s on you.  It’s tacky to blame the software or the service, and besides your customers don’t give a shit whose “fault” it is.

So given an infinite number of things to care about, where do you start?

What is your mission, and what are your differentiators?

The first question must always be: what is your mission?  Your mission is not writing software.  Your mission is delivering whatever it is your customers are paying you for, and you use software to get there.  (Code is kind of a liability so you should write as little of it as necessary.  hey!! sounds like a good argument for #serverless!)

Second: what are your core differentiators?  What are the things that you are doing that are unique, and difficult to replicate, or the things where you have to actually be world class experts in those things?

Those are the things that you will have the hardest time outsourcing, or that you should think about very carefully before outsourcing.

Screen Shot 2016-05-30 at 7.57.06 PM

Facts

You can outsource labor, but you can’t outsource caring.  And nobody but you is in the position to think about your core differentiators and your product in a holistic way.

If you’re a typical early startup, you’re probably using somewhere between 5 and 20 SaaS products to get rid of some of the crap work and offload it to dedicated teams who can do it better than you can, much more cheaply, so you are freed up to work on your core value proposition.

GOOD.

But you still have to think about things like reliability, your security model, your persistent storage models, your query performance, how all these lovely services talk to each other, how you’re going to debug them, how you’re going to repro when things go wrong, etc.  You still own these things, even if you don’t run them.

For example, take AWS Lambda.  It’s a pretty great service on many dimensions.  It’s an early version of the future.  It is also INCREDIBLY irritating and challenging to debug in a practically infinite number of insanity-inducing ways.

** Important side note — I’m talking about actual production systems.  Parse, Heroku, Lambda, etc are GREAT for prototyping and can take you a long, long way.  Early stage startups SHOULD optimize for agility and rapid developer iteration, not reliability.  Thx to @joeemison for reminding me that i left that out of the recap.

Screen Shot 2016-05-30 at 8.03.01 PM

Focus on the critical path

Your users don’t care if your internal jenkins builds are broken.  They don’t care about a whole lot of things that you have to care about … eventually.  They do care a lot if your product isn’t actually functional.  Which means you have to think through the behavioral and failure characteristics of the providers you’re relying on in any user visible fashion.

Ask lots of questions if you can.  (AWS often won’t tell you much, but smaller providers will.)  Find out as much as you can about their cotenancy model (shared hardware or isolation?), their typical performance variance (run your own tests, don’t trust their claims), and the underlying storage systems.

Think about how you can bake in resiliency from the user’s perspective, that doesn’t rely on provider guarantees.  If you’re on mobile, can you provide a reasonable offline experience?  Like Parse did a lot of magic here in the APIs, where it would back off and retry saves if there were any errors.

Can you fail over to another provider if one is down?  Is it even worth it at your company’s stage of maturity and engineering resources to invest in this?

How willing are you to be locked into a vendor or provider, and what is the story if you find yourself forced to switch?  Or if that service goes away, as so many, many, many of them have done and will do.  (RIP, parse.com.)

Screen Shot 2016-05-30 at 8.11.10 PM

Tradeoffs

Listen, outsourcing is awesome.  I do it as much as I can.  I’m literally helping build a service that provides outsourced metrics, I believe in this version of the future!  It’s basically the latest iteration of capitalism in a nutshell: increased complexity –> increased specialization –> you pay other people to do the job better than you –> everybody wins.

But there are tradeoffs, so let’s be real.

The service, if it is smart, will put strong constraints on how you are able to use it, so they are more likely to deliver on their reliability goals.  When users have flexibility and options it creates chaos and unreliability.  If the platform has to choose between your happiness vs thousands of other customers’ happiness, they will choose the many over the one every time — as they should.

Limits may mysteriously change or be invented as they are discovered, esp with fledgling services.  You may be desperate for a particular feature, but you can’t build it.  (This is why I went for Kafka over Kinesis.)

You need to think way more carefully and more deeply about visibility and introspection up front than you would if you were running your own services, because you have no ability to log in and use strace or gdb or tail a logfile or run any system profiling commands when things go dark.

In the best case, you’re giving up some control and quality in exchange for experts doing the work better than you could for cheaper (e.g. i’m never running a fucking physical data center again, jesus.  EC24lyfe).  In a common worse case, it’s less reliable than what you would build AND it’s also opaque AND you can’t tell if it’s down for you or for everyone because frankly it’s just massively harder to build a service that works for thousands/millions of use cases than for any one of them individually.

Screen Shot 2016-05-30 at 8.21.36 PM.png

Stateful services

Ohhhh and let’s just briefly talk about state.

The serverless utopia mostly ignores the problems of stateful services.  If pressed they will usually say DynamoDB, or Firebase, or RDS or Aurora or something.

This is a big, huge, deep, wide lake of crap to wade in to so all I’m going to say is that there is no such thing as having the luxury of not having to understand how your storage systems work.  Queries will get slow, and you’ll need to be able to figure out why and fix them.  You’ll hit scaling cliffs where suddenly a perfectly-usable app just starts timing everything out because of that extra second of latency coming from …

¯\_(ツ)_/¯

The hardware underlying your instance will degrade (there’s a server somewhere under all those abstractions, don’t forget).  The provider will have mysterious failures.  They will be better than you, probably, but less inclined to give you satisfactory progress updates because there are hundreds or thousands or millions of you all clamoring.

The more you understand about your storage system (and the more you stay in the lane of how it was intended to be used), the happier you’ll be.

Screen Shot 2016-05-30 at 8.28.29 PM

In conclusion

These trends are both inevitable and, for the most part, very good news for everyone.

Operations engineering is becoming a more fascinating and specialized skill set.  The best engineers are flocking to solve category problems — instead of building the same system at company after company, they are building SaaS solutions to solve it for the internet at large.  Just look at the massive explosion in operational software offerings over the past 5-6 years.

This means that the era of the in-house dedicated ops team, which serves as an absorbent buffer for all the pain of software development, is mostly on its way out the door.  (And good riddance.)

People are waking up to the fact that software quality improves when feedback loops are tighter for software engineers, which means being on call and owning services end to end.  The center of gravity is shifting towards engineering teams owning the services they built.

This is awesome!  You get to rent engineers from Google, AWS, Pagerduty, Pingdom, Heroku, etc for much cheaper than if you hired them in-house — if you could even get them, which you probably can’t because talent is scarce.

Screen Shot 2016-05-30 at 8.39.42 PM

But the flip side of this is that application engineers need to get better at thinking in traditionally operations-oriented ways about reliability, architecture, instrumentation, visibility, security, and storage.  Figure out what your core differentiators are, and own the shit out of those.

Nobody but you can care about your mission as much as you can.  Own it, do it.  Have fun.

 

Operational Best Practices #serverless