Operational Best Practices #serverless

May 31, 2016June 1, 2016 mipsytipsyaws, operations, serverless

This post is part two of my recap of last week’s terrific Serverless conference. If you feel like getting bitchy with me about what serverless means or #NoOps or whatever, please refer back to the prequel post, where I talked about operations engineering in the modern world.

*Then* you can get bitchy with me. (xoxoxxooxo)

The title of my talk was:

the tooth fairy is a useful abstraction for the pain of life and rewards of having people who care about you ^_^

— Charity Majors (@mipsytipsy) May 23, 2016

The theme of my talk was basically: what should software engineers know and care about when it comes to operations in a world where we are outsourcing more and more core functionality?

If you care about running a quality service or product, or providing your customers with a reasonable level of service, you have to care about operational concerns like design, resiliency, instrumentation and debuggability. No matter how many abstractions there are between you and the bare metal.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault. You chose them, it’s on you. It’s tacky to blame the software or the service, and besides your customers don’t give a shit whose “fault” it is.

So given an infinite number of things to care about, where do you start?

What is your mission, and what are your differentiators?

The first question must always be: what is your mission? Your mission is not writing software. Your mission is delivering whatever it is your customers are paying you for, and you use software to get there. (Code is kind of a liability so you should write as little of it as necessary. hey!! sounds like a good argument for #serverless!)

Second: what are your core differentiators? What are the things that you are doing that are unique, and difficult to replicate, or the things where you have to actually be world class experts in those things?

Those are the things that you will have the hardest time outsourcing, or that you should think about very carefully before outsourcing.

Facts

You can outsource labor, but you can’t outsource caring. And nobody but you is in the position to think about your core differentiators and your product in a holistic way.

If you’re a typical early startup, you’re probably using somewhere between 5 and 20 SaaS products to get rid of some of the crap work and offload it to dedicated teams who can do it better than you can, much more cheaply, so you are freed up to work on your core value proposition.

GOOD.

But you still have to think about things like reliability, your security model, your persistent storage models, your query performance, how all these lovely services talk to each other, how you’re going to debug them, how you’re going to repro when things go wrong, etc. You still own these things, even if you don’t run them.

For example, take AWS Lambda. It’s a pretty great service on many dimensions. It’s an early version of the future. It is also INCREDIBLY irritating and challenging to debug in a practically infinite number of insanity-inducing ways.

** Important side note — I’m talking about actual production systems. Parse, Heroku, Lambda, etc are GREAT for prototyping and can take you a long, long way. Early stage startups SHOULD optimize for agility and rapid developer iteration, not reliability. Thx to @joeemison for reminding me that i left that out of the recap.

Focus on the critical path

Your users don’t care if your internal jenkins builds are broken. They don’t care about a whole lot of things that you have to care about … eventually. They do care a lot if your product isn’t actually functional. Which means you have to think through the behavioral and failure characteristics of the providers you’re relying on in any user visible fashion.

Ask lots of questions if you can. (AWS often won’t tell you much, but smaller providers will.) Find out as much as you can about their cotenancy model (shared hardware or isolation?), their typical performance variance (run your own tests, don’t trust their claims), and the underlying storage systems.

Think about how you can bake in resiliency from the user’s perspective, that doesn’t rely on provider guarantees. If you’re on mobile, can you provide a reasonable offline experience? Like Parse did a lot of magic here in the APIs, where it would back off and retry saves if there were any errors.

Can you fail over to another provider if one is down? Is it even worth it at your company’s stage of maturity and engineering resources to invest in this?

How willing are you to be locked into a vendor or provider, and what is the story if you find yourself forced to switch? Or if that service goes away, as so many, many, many of them have done and will do. (RIP, parse.com.)

Tradeoffs

Listen, outsourcing is awesome. I do it as much as I can. I’m literally helping build a service that provides outsourced metrics, I believe in this version of the future! It’s basically the latest iteration of capitalism in a nutshell: increased complexity –> increased specialization –> you pay other people to do the job better than you –> everybody wins.

But there are tradeoffs, so let’s be real.

The service, if it is smart, will put strong constraints on how you are able to use it, so they are more likely to deliver on their reliability goals. When users have flexibility and options it creates chaos and unreliability. If the platform has to choose between your happiness vs thousands of other customers’ happiness, they will choose the many over the one every time — as they should.

Limits may mysteriously change or be invented as they are discovered, esp with fledgling services. You may be desperate for a particular feature, but you can’t build it. (This is why I went for Kafka over Kinesis.)

You need to think way more carefully and more deeply about visibility and introspection up front than you would if you were running your own services, because you have no ability to log in and use strace or gdb or tail a logfile or run any system profiling commands when things go dark.

In the best case, you’re giving up some control and quality in exchange for experts doing the work better than you could for cheaper (e.g. i’m never running a fucking physical data center again, jesus. EC24lyfe). In a common worse case, it’s less reliable than what you would build AND it’s also opaque AND you can’t tell if it’s down for you or for everyone because frankly it’s just massively harder to build a service that works for thousands/millions of use cases than for any one of them individually.

Stateful services

Ohhhh and let’s just briefly talk about state.

The serverless utopia mostly ignores the problems of stateful services. If pressed they will usually say DynamoDB, or Firebase, or RDS or Aurora or something.

Real question how does state get persisted with #serverless?

I understand scale out of stateless servers, but who stores the state?

— Caitie McCaffrey (@caitie) May 26, 2016

This is a big, huge, deep, wide lake of crap to wade in to so all I’m going to say is that there is no such thing as having the luxury of not having to understand how your storage systems work. Queries will get slow, and you’ll need to be able to figure out why and fix them. You’ll hit scaling cliffs where suddenly a perfectly-usable app just starts timing everything out because of that extra second of latency coming from …

¯\_(ツ)_/¯

The hardware underlying your instance will degrade (there’s a server somewhere under all those abstractions, don’t forget). The provider will have mysterious failures. They will be better than you, probably, but less inclined to give you satisfactory progress updates because there are hundreds or thousands or millions of you all clamoring.

The more you understand about your storage system (and the more you stay in the lane of how it was intended to be used), the happier you’ll be.

In conclusion

These trends are both inevitable and, for the most part, very good news for everyone.

Operations engineering is becoming a more fascinating and specialized skill set. The best engineers are flocking to solve category problems — instead of building the same system at company after company, they are building SaaS solutions to solve it for the internet at large. Just look at the massive explosion in operational software offerings over the past 5-6 years.

This means that the era of the in-house dedicated ops team, which serves as an absorbent buffer for all the pain of software development, is mostly on its way out the door. (And good riddance.)

People are waking up to the fact that software quality improves when feedback loops are tighter for software engineers, which means being on call and owning services end to end. The center of gravity is shifting towards engineering teams owning the services they built.

This is awesome! You get to rent engineers from Google, AWS, Pagerduty, Pingdom, Heroku, etc for much cheaper than if you hired them in-house — if you could even get them, which you probably can’t because talent is scarce.

But the flip side of this is that application engineers need to get better at thinking in traditionally operations-oriented ways about reliability, architecture, instrumentation, visibility, security, and storage. Figure out what your core differentiators are, and own the shit out of those.

Nobody but you can care about your mission as much as you can. Own it, do it. Have fun.

28 thoughts on “Operational Best Practices #serverless”

football betting tips says:

Pretty great post. I simply stumbled upon your weblog and wanted to mention that I’ve truly enjoyed surfing around your blog posts.
After all I’ll be subscribing to your feed and I’m hoping you
write once more very soon!

Loading...

June 1, 2016 at 7:55 am Reply
WTF is operations? #serverless | 神刀安全网 says:

[…] Part 2: Operations in a #Serverless World […]

Loading...

June 1, 2016 at 3:20 pm Reply
Jeremy Zawodny says:

“The best engineers are flocking to solve category problems” maybe. Or it could be that the engineers who like to solve category problems are flocking to solve them. 🙂

Overall, great post. I enjoyed it. That bold line just jumped out at me as being more than a bit subjective.

Loading...

June 1, 2016 at 6:27 pm Reply
1. mipsytipsy says:
  
  Oh absolutely! Bold, unsubstantiated, based completely on my impressions and personal experience. It feels like a lot of people are just getting sick of solving the same problems over and over at single companies. But I have absolutely zero data to back this up. ^_^
  
  Loading...
  
  June 1, 2016 at 7:52 pm Reply
Martin De Wulf (@madewulf) says:

I really enjoyed this article, probably because I agree with a lot of it 🙂

Loading...

June 2, 2016 at 1:11 pm Reply
DevOpsGuys (@DevOpsGuys) says:

Totally agree with the views expressed. We wrote something similar back in 2015 – https://blog.devopsguys.com/2015/06/30/what-does-the-future-of-it-operations-look-like-in-a-devops-world/ – about how the role of operations is changing and how we need to evolve different techniques to deal with the rate of change and abstracted hosting.

Loading...

June 3, 2016 at 8:28 am Reply
COUNTERSTRIKE says:

Wow because this is really helpfulexcellent work! Congrats and keep it up

Loading...

June 5, 2016 at 8:30 am Reply
SRE Weekly Issue #26 – SRE WEEKLY says:

[…] Operational Best Practices #serverless – charity.wtf Here’s Charity Majors being awesome as always. There’s a reason this article is first this week. In this part one of two articles, Charity recaps her recent talk at serverlessconf in which she argues that you can never get away from operations, no matter how “serverless” you go. […] no matter how pretty the abstractions are, you’re still dealing with dusty old concepts like “persistent state” and “queries” and “unavailability” and so forth […] […]

Loading...

June 6, 2016 at 2:42 am Reply
CS-Cart says:

Great solution, thank you for an excellent article.

Loading...

June 6, 2016 at 6:49 am Reply
nuclearpengy says:

Great post, I think I would have enjoyed your presentation. This should be required reading for startup founders before writing their first line of code. 🙂

Loading...

June 8, 2016 at 10:33 pm Reply
Bill says:

I’ll right away grasp your rss feed as I can not in finding your e-mail subscription link or e-newsletter service.
Do you have any? Kindly let me know so that I may just subscribe.
Thanks.

Loading...

June 18, 2016 at 12:07 am Reply
Siju Oommen George says:

charity.wtf, you don’t throw a pile of demeaning words on somebody and cowardly go and hide behind twitters block option. From your actions your snobish racism is very clear.No more communication please if you can afford it. Thanks. Not irritated at you either. First time a person behaves irrationally to a question asked and subsequently goes to call him stupid and mock him and hide behind twitter’s block option 🙂 🙂 🙂 cheers!

Loading...

June 19, 2016 at 5:33 am Reply
1. mipsytipsy says:
  
  i have no idea who this is or what this is about, but cheers!
  
  Loading...
  
  January 5, 2019 at 6:54 am Reply
Jane says:

There’s certainly a lot to find out about this topic.
I really like all of the points you made.

Loading...

June 30, 2016 at 6:58 am Reply
067: Fried chicken, Docker Swarm, tech journalism, or, “but that sweet @MattRay interpolation, tho.” – Software Defined Talk | Cote.io says:

[…] Charity Major’s write-up from her talk at the #Serverless conference […]

Loading...

July 1, 2016 at 4:49 pm Reply
The Serverless Cloud, part 1 | Daggie.be says:

[…] NoOps (No Operations) is the concept that an IT environment can become so automated and abstracted from the underlying infrastructure that there is no need for a dedicated team to manage software in-house. NoOps isn’t a new concept as this article from 2011 proves. When Serverless started gaining popularity, some people claimed there was no longer a need for Operations. Since we already established that Serverless doesn’t mean no servers, it’s obvious it also doesn’t mean No Operations. It might mean that Operations gets outsourced to a team with specialised skills, but we are still going to need: monitoring, security, remote debugging, … I am curious to see the impact on current DevOps teams though. A very interesting article on the NoOps topic, can be found over here. […]

Loading...

July 5, 2016 at 9:05 pm Reply
The Serverless Cloud, part 1 - JAX London says:

[…] NoOps (No Operations) is the concept that an IT environment can become so automated and abstracted from the underlying infrastructure that there is no need for a dedicated team to manage software in-house. NoOps isn’t a new concept as this article from 2011 proves. When Serverless started gaining popularity, some people claimed there was no longer a need for Operations. Since we already established that Serverless doesn’t mean no servers, it’s obvious it also doesn’t mean No Operations. It might mean that Operations gets outsourced to a team with specialised skills, but we are still going to need: monitoring, security, remote debugging, … I am curious to see the impact on current DevOps teams though. A very interesting article on the NoOps topic, can be found over here. […]

Loading...

August 5, 2016 at 1:03 pm Reply
Die Serverless Cloud: AWS Lambda und das Serverless Framework says:

[…] Da Serverless aber, wie erwähnt, keineswegs impliziert, dass wir gar keine Server mehr brauchen, entfallen offensichtlich auch die Administrationsaufgaben nicht völlig. Was Serverless vielleicht mit sich bringen könnte, ist eine Auslagerung der Administration an ein Team mit spezialisierten Fähigkeiten. Doch wir benötigen auch dann noch Dinge wie Monitoring, Security, Remote Debugging, etc. Ich bin auf jeden Fall gespannt auf die Folgen, die Serverless Programming für gegenwärtige DevOps-Teams mit sich bringen wird. Einen interessanten Artikel über NoOps kann man hier finden. […]

Loading...

November 2, 2016 at 10:10 am Reply
Serverless-Computing ermöglicht revolutionäre Geschäftsmodelle says:

[…] Remote-Debugging, usw. Für alle, die mehr über das Thema wissen wollen, ist dieser Artikel von Charity Majors zu […]

Loading...

November 17, 2016 at 9:36 am Reply
Moving To Serverless Cloud Apps - says:

[…] Majors gave a fantastic talk at the first Serverless.conf on this very issue. Looking at the security angle, we know from the […]

Loading...

February 14, 2017 at 2:31 pm Reply
Moving To Serverless Cloud Apps - InfoSecHotSpot says:

[…] Majors gave a fantastic talk at the first Serverless.conf on this very issue. Looking at the security angle, we know from the […]

Loading...

February 14, 2017 at 3:38 pm Reply
Moving To Serverless Cloud Apps – Terry & CoCo says:

[…] Majors gave a fantastic talk at the first Serverless.conf on this very issue. Looking at the security angle, we know from the […]

Loading...

February 17, 2017 at 6:02 am Reply
Operational Best Practices #serverless says:

[…] This post is part two of my recap of last week’s terrific Serverless conference. If you feel like getting bitchy with me about what serverless means or #NoOps or whatever, please refer back … – Read full story at Hacker News […]

Loading...

February 25, 2017 at 3:28 pm Reply
معماری های Serverless – Refactor.Ir says:

[…] چک کنید. تا آن موقع می توانید نوشته های او را از اینجا و اینجا […]

Loading...

April 6, 2018 at 2:17 pm Reply
The High Cost and Low Benefit of Unused Index Advice says:

[…] can’t outsource caring about your database’s performance to a […]

Loading...

April 27, 2018 at 6:17 am Reply
What is Serverless Architecture « andytanoko says:

[…] Charity Majors gave a great talk on this subject at the first Serverlessconf. (You can also read her two write-ups on it: WTF is operations? and Operational Best Practices.) […]

Loading...

June 18, 2019 at 4:30 pm Reply
Jayden Lemberg says:

This is an outstanding post that’s filled with so many useful nuggets. Thank you for being so detailed on serverless.

Loading...

January 20, 2021 at 10:20 am Reply
Thoughts about serveless – James On Programming says:

[…] is not to say that serverless means being able to ignore ops completely, as Charity Majors has explained. Observability is vital, and you will still encounter issues where the abstractions of serverless […]

Loading...

August 22, 2023 at 5:29 am Reply