WTF is operations? #serverless

I just got back from the very first ever @serverlessconf in NYC.  I have a soft spot for well-curated single-track conferences, and the organizers did an incredible job.  Major kudos to @iamstan and team for pulling together such a high-caliber mix of attendees as well as presenters.

I’m really honored that they asked me to speak.  And I had a lot of fun delivering my talk!  But in all honesty, I turned it down a few times — and then agreed, and then backed out, and then agreed again at the very last moment.  I just had this feeling like the attendees weren’t going to want to hear what I was gonna say, or like we weren’t gonna be speaking the same language.

Rainbow_dash_12_by_xpesifeindx-d5giyirWhich … turned out to be mmmmostly untrue.  To the organizers’ credit, when I expressed this concern to them, they vigorously argued that they wanted me to talk *because* they wanted a heavy dose of real talk in the mix along with all the airy fairy tales of magic and success.


So #serverless is the new cloud or whatever

Hi, I’m grouchy and I work with operations and data and backend stuff.  I spent 3.5 years helping Parse grow from a handful of apps to over a million.  Literally building serverless before it was cool TYVM.

So when I see kids saying “the future is serverless!” and “#NoOps!” I’m like okay, that’s cute.  I’ve lived the other side of this fairytale.  I’ve seen what happens when application developers think they don’t have to care about the skills associated with operations engineering.  When they forget that no matter how pretty the abstractions are, you’re still dealing with dusty old concepts like “persistent state” and “queries” and “unavailability” and so forth, or when they literally just think they can throw money at a service to make it go faster because that’s totally how services work.

I’m going to split this up into two posts.  I’ll write up a recap of my talk in a sec, but first let’s get some things straight.  Like words.  Like operations.

What is operations?

Let’s talk about what “operations” actually means, in the year 2016, assuming a reasonably high-functioning engineering environment.

Screen Shot 2016-05-30 at 4.29.09 PM

At a macro level, operational excellence is not a role, it’s an emergent property.  It is how you get shit done.

Operations is the sum of all of the skills, knowledge and values that your company has built up around the practice of shipping and maintaining quality systems and software.  It’s your implicit values as well as your explicit values, habits, tribal knowledge, reward systems.  Everybody from tech support to product people to CEO participates in your operational outcomes, even though some roles are obviously more specialized than others.

Saying you have an ops team who is solely responsible for reliability is about as silly as saying that “HR defines and owns our company culture!”  No.  Managers and HR professionals may have particular skills and responsibilities, but culture is an emergent property and everyone contributes (and it only takes a couple bad actors to spoil the bushel).

Thinking about operational quality in terms of “a thing some other team is responsible for” is just generally not associated with great outcomes.  It leads to software engineers who are less proficient or connected to their outcomes, ops teams who get burned out, and an overall lower quality of software and services that get shipped to customers.

These are the specialized skill sets that I associate with really good operations engineers.  Do these look optional to you?

Screen Shot 2016-05-30 at 4.41.21 PM

It depends on your mission, but usually these are not particularly optional.  If you have customers, you need to care about these things.  Whether you have a dedicated ops team or not.  And you need to care about the tax it imposes on your humans too, especially when it comes to the cognitive overhead of complex systems.

So this is my definition of operations.  It doesn’t have to be your definition.  But I think it is a valuable framework for helping us reason about shipping quality software and healthy teams.  Especially given the often invisible nature of operations labor when it’s done really well.  It’s so much easier to notice and reward shipping shiny features than “something didn’t break”.

The inglorious past

Don’t get me wrong — I understand why “operations” has fallen out of favor in a lot of crowds.  I get why Google came up with “SRE” to draw a line between what they needed and what the average “sysadmin” was doing 10+ years ago.

Ops culture has a number of well-known and well-documented pathologies: hero/martyr complexes, risk aversion, burnout, etc.  I understand why this is offputting and we need to fix it.

Also, historically speaking, ops has attracted a greater proportion of nontraditional oddballs who just love debugging and building things — fewer Stanford CS PhDs, more tinkerers and liberal arts majors and college dropouts (hi).  And so they got paid less, and had less engineering discipline, and burned themselves out doing too much ad hoc labor.Rainbow_Dash_3.png

But — this is no longer our overwhelming reality, and it is certainly not the reality we are hurtling towards.  Thanks to the SRE movement, and the parallel and even more powerful & diverse open source DevOps movement, operations engineers are … engineers.  Who specialize in infrastructure.  And there’s more value than ever in empathy and fluid skill sets, in engineers who are capable of moving between disciplines and translating between specialties.  This is where the “full-stack developer” buzzword comes from.  It’s annoying, but reflects a real craving for generalist skill sets.

The BOFH stereotype is dead.  Some of the most creative cultural and technical changes in the technical landscape are being driven by the teams most identified with operations and developer tooling.  The best software engineers I know are the ones who consistently value the impact and lifecycle of the code they ship, and value deployment and instrumentation and observability.  In other words, they rock at ops stuff.

The Glorious Future

And so I think it’s time to bring back “operations” as a term of pride.  As a thing that is valued, and rewarded.  As a thing that every single person in an org understands as being critical to success.  Every organization has unique operational needs, and figuring out what they are and delivering on them takes a lot of creativity and ingenuity on both the cultural and technical front.

“Operations” comes with baggage, no doubt.  But I just don’t think that distance and denial are aRainbow_Dash_in_flightn effective approach for making something better, let alone trash talking and devaluing the skill sets that you need to deliver quality services.

You don’t make operational outcomes magically better by renaming the team “DevOps” or “SRE” or anything else.  You make it better by naming it and claiming it for what it is, and helping everyone understand how their role relates to your operational objectives.

And now that I have written this blog post I can stop arguing with people who want to talk about “DevOps Engineers” and whether “#NoOps” is a thing and maybe I can even stop trolling them back about the nascent “#NoDevs” movement.  (Haha just kidding, that one is too much fun.)

Part 2: Operations in a #Serverless World


WTF is operations? #serverless

39 thoughts on “WTF is operations? #serverless

  1. I like a definition of software architecture that says it’s the area of software design that caters for non functional requirements which, not surprisingly, include all of your core competencies.

  2. Great post. Thank you.
    Agree with you on the 7 core competencies of operational engineering. Strong operational engineers are able to reason about the *entire system* at perhaps the most inconvenient time. To me that’s the difference between good operational heads and weak operational heads. The definition of entire can vary depending on the scope /responsibility of that person but they have to be able to reason about it as whole – as a thing. So understanding scalability ( and dimensions that are relevant to the app ) is important, metrics are important etc.
    Will reserve judgment on the impact of SRE and open source devops movements (in by themselves) but your point about more diverse and fluid people making a big impact in this space is absolutely bang-on.
    The reason being that you can’t fake being a good ops person. You actually can’t – not for more than 10minutes anyway. You can absolutely fake being an average ‘developer’ for a period of time, you can fake being an ‘architect’ for lot longer than that( sometimes decades ). Because the job itself is incompatible with fakery it automatically becomes less prone to systemic biases. Almost all of my senior engineers ( by definition need to be good operationally as well) come from diverse backgrounds and all (30+) except 1 do not have any formal CS schooling. And I’ve been doing this for a long time so it’s not a small sample size.
    Interesting that you mention empathy as a factor – I didn’t think about it that way but upon reflection that makes a lot of sense. Empathy with the end user, with the development team and empathy with management ( ideally very close to the end user in that respect) all are more important in a runtime setting than a more traditional ‘compile’ time one.

    1. yes!! completely! in a well functioning engineering team, the pain of the team is tied as closely as possible with the pain of the customers, which means your incentives are automatically aligned.

  3. […] WTF is operations? #serverless In this second part of a two part series (featured here last week), Charity Majors delves into what operations means as we move toward a “serverless” infrastructure. If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault.  You chose them, it’s on you. […]

  4. Ma Qingli says:

    I am thinking if it’s a good idea if we ask a developer to do the operation work for one month per year. Will their output become better?

    1. This is a great idea! But I think that shorter blocks of time more frequently will have more impact. A lot will change in a year. Doing a week every couple months, or a couple days a month, or doing on call work / shadowing with a buddy on a month or two rotation will keep their skills fresh, and keep them from having to take long breaks away from their project. It will also have less negative impact on your ops team in terms of training costs (and newbie people breaking things :))

    1. Well, they were different talks to *very* different audiences on only somewhat overlapping themes, so that seems rather expected. 🙂 Do you have any … further comments?

  5. GP says:

    @mipsytipsy thanks for the post. I wonder why you don’t think security is a core competency of an operations engineering team? We already know that developers don’t really care about it but if even operations doesn’t, then who does? 🙂

    1. oh man, it totally is! i guess i tend to wrap ‘secure’ in to the umbrella with reliable and maintainable, but in lots of orgs it is totally worthy of calling out on its own. thanks for rthe note!

  6. So much could be said in this topic. The bottom line for me is that “serverless” is only possible because some other organization deals with the servers and abstracts that away. Operations still exists, you just don’t see it running around in your office.
    Abstractions created by cloud services providers are generating a trend of abstraction in job functions. DevOps is a philosophy and culture that still requires operations activities, and serverless is the natural progression from cloud computing.
    I am just wondering what the next step of abstraction will be, who will champion it (AWS, google, or MS (no way)) and what new type of job description it will create.

Leave a Reply to Why Serverless Still Needs Critical Alerting | OnPageCancel reply