Shipping Software Should Not Be Scary

On twitter this week, @srhtcn noted that “Many incidents happen during or right after release” and asked for advice on ways to fix this.

And he’s right! Rolling out new software is the proximate cause for the overwhelming majority of incidents, at companies of all sizes. Upgrading software is both a necessity and a minor insanity, considering how often it breaks things.

I’m not going to recap the history of continuous integration and delivery, suffice it to say that we now know that smaller and more frequent changes are much safer than larger and less frequent changes.

But it’s still risky. And most issues are still caused by humans and our pesky need for “improvements”. So what can be done?

It’s not ok for software releases to be scary and hazardous

First of all: If releasing is risky for you, you need to fix that. Make this a priority. Track your failures, practice post mortems, evaluate your on call practices and culture. Know if you’re getting better or worse. This is a project that will take weeks if not months until you can be confident in the results.

You have to fix it though, because these things are self-reinforcing. If shipping changes is scary and fraught, people will do it less and it will get even MORE scary and treacherous.

Likewise, if you turn it into a non-cortisol inducing event and set expectations, engineers will ship their code more often in smaller diffs and therefore break the world less.

Fixing deploys isn’t about eliminating errors, it’s about making your pipeline resilient to errors. It’s fundamentally about detecting common failures and recovering from them, without requiring human intervention.

Value your tools more

As an short term patch, you should run deploys in the mornings or whenever everyone is around and fresh. Then take a hard look at your deploy pipeline.

In too many organizations, deploy code is a technical backwater, an accumulation of crufty scripts and glue code, forked gems and interns’ earnest attempts to hack up Capistrano. It usually gives off a strong whiff of “sloppily evolved from many 2 am patches with no code review”.

This is insane. Deploy software is the most important software you have. Treat it that way: recruit an owner, allocate real time for development and testing, bake in metrics and track them over time.

If it doesn’t have an owner, it will never improve. And you will need to invest in frequent improvements even after you’re over this first hump.

Signal high organizational value by putting one of your best engineers on it.
Recruit help from the design side of the house as well. The “right” thing to do must be the fastest, easiest thing to do, with friendly prompts and good docs. No “shortcuts” for people to reach for at the worst possible time. You need user research and design here.
Track how often deploys fail and why. Managers should pay close attention to this metric, just like the one for people getting interrupted or woken up, and allocate time to fixing this early whenever it sags. Before it gets bad.
Allocate real time for development, testing, and training — don’t expect the work to get shoved into people’s “spare time” or post mortem cleanup time. Make sure other managers understand the impact of this work and are on board. Make this one of your KPIs.

In other words, make deploy tools a first class citizen of your technical toolset. Make the work prestigious and valued — even aspirational. If you do performance reviews, recognize the impact there.

(Btw, “how we hardened our deploys” is total Velocity-bait (&& other practitioner conferences) as well as being great for recruiting and general visibility in blog post form. People love these stories; there definitely aren’t enough of them.)

Turn software engineers into software owners

The canonical CI/CD advice starts with “ship early, ship often, ship smaller change sets”. That’s great advice: you should definitely do those things. But they are covered plenty elsewhere. What’s software ownership?

Software ownership is the natural end state of DevOps. Software engineers, operations engineers, platform engineers, mobile engineers — everyone who writes code should be own the full lifecycle of their software.

Software owners are people who:

Write code
Can deploy and roll back their own code
Are able to debug their own issues in prod (via instrumentation, not ssh)

If you’re lacking any one of those three ingredients, you don’t have ownership.

Why ownership? Because software ownership makes for better engineers, better software, and a better experience for customers. It shortens feedback loops and means the person debugging is usually the person with the most context on what has recently changed.

Some engineers might balk at this, but you’ll be doing them a favor. We are all distributed systems engineers now, and distributed systems require a much higher level of operational literacy. May as well start today.

Fail fast, fix fast

This is about shifting your mindset from one of brittleness and a tight grip, to one of flexibility where failures are no big deal because they happen all the time, don’t impact users, and give everyone lots of practice at detecting and recovering from them.

Here are a few of the best practices you should adopt with this practice.

The engineer who writes the code and merges the PR should also run the deploy
Everyone who writes code must be trained in how to deploy, roll back & revert to last known good state (before escalating if necessary). They should also know the basics of instrumentation, feature flagging and debugging in prod..
After deploying you MUST go verify: are your changes behaving as expected? Does anything else look .. unexpected? You have the most context on what to expect; just two minutes spent verifying that things look reasonable will catch the overwhelming majority of errors before users even notice.
Practice observability-driven development. Instrument each change so you can verify it is working. (Hell, instrument in advance in order to determine the impact of your proposed change and see if it’s even worth doing.)
(You need solid observability for your instrumentation in order to expect your engineers to do this kind of side-by-side comparison, something with high cardinality support (like honeycomb) that lets you drill down to the individual event level. It limits the amount of ownership you can reasonably expect if your software engineers are flying blind.)

Make operability a high-value skill set. Never promote someone to “senior engineer” if they can’t deploy and debug their own code.

Software engineers don’t have to become operational experts. They do need to know the bare basics of instrumentation, deploy/revert, and debugging.

Everyone who puts software in production needs to understand and feel responsible for the full lifecycle of their code, not just how it works in their IDE.

Baking: it’s not just for cookies

Shipping something to production is a process of incrementally gaining confidence, not a switch you can flip.

You can’t trust code until it’s been in prod a while, until you’ve seen it perform under a wide range of load and concurrency scenarios, in lots of partial failure modes. Only over time can you develop confidence in it not being terrible.

Nothing is production except production. Don’t rely on never failing; expect failure, embrace failure. Practice failure! Build guard rails around your production systems to help you find and fix problems quickly.

The changes you need to make your pipeline more resilient are roughly the same changes you need to safely test in production. These are a few of your guard rails.

Use feature flags to switch new code paths on and off
Build canaries for your deploy process, so you can promote releases gracefully and automatically to larger subsets of your traffic as you gain confidence in them
Create cohorts. Deploy to internal users first, then any free tier, etc in order of ascending importance. Don’t jump from 10% to 25% to 50% and then 100% — some changes are related to saturating backend resources, and the 50%-100% jump will kill you.
Have robots check the health of your software as it rolls out to decide whether to promote the canary. Over time the robot checks will mature and eventually catch a ton of problems and regressions for you.

The quality of code is not knowable before it hits production. You may able to spot some problems, but you can never guarantee a lack of then. It takes time to bake a new release and gain incremental confidence in new code.

In summary.

Get someone to own the deploy software
Value the work
Create a culture of software ownership
LOOK at what you’ve done after you do it
Be suspicious of new versions until they prove themselves

Two blog posts in one weekend! That’s definitely never happened before. Thanks to Baron for asking me to draft this up following the weekend’s twitter thread: https://twitter.com/mipsytipsy/status/1030340072741064704.

15 thoughts on “Shipping Software Should Not Be Scary”

Shaun Abram » Blog Archive » Testing in Production says:

[…] in Production, the safe way” post by Cindy Sridharan, as well as Charity Major’s “Shipping software should not be scary“, but I have also liberally taken from many other great sources listed at the end. In some […]

Loading...

August 24, 2018 at 6:36 am Reply
SRE Weekly Issue #136 – SRE WEEKLY says:

[…] Shipping Software Should Not Be Scary – charity.wtf […]

Loading...

August 27, 2018 at 1:00 am Reply
avenetj says:

Cool article

Loading...

August 27, 2018 at 1:01 pm Reply
=== popurls.com === popular today says:

[…] Shipping Software Should Not Be Scary charity.wtf […]

Loading...

August 29, 2018 at 6:02 am Reply
Four short links: 29 August 2018 - Tesco Inc. Online Electronics Store says:

[…] Shipping Software Should Not Be Scary (Charity Majors) — Deploy software is the most important software you have. Treat it that way. […]

Loading...

August 29, 2018 at 12:42 pm Reply
Graham Wheeler says:

Love love love this!

Loading...

August 31, 2018 at 3:36 pm Reply
JIMIT JOSHI says:

Most hilarious and educational blog I came across so far on Production Releases. Thanks for writing it.

Loading...

September 14, 2018 at 5:58 am Reply
Shaun Abram » Blog Archive » Testing in Production Presentation – SVCC 2018 says:

[…] Charity Majors, including Shipping software should not be scary […]

Loading...

October 13, 2018 at 2:51 pm Reply
Karl Pickett says:

This is the most truth I’ve read since @DEVOPS_BORAT in 2013. And all the images – I’m dead.

Loading...

January 29, 2019 at 2:52 am Reply
1. mipsytipsy says:
  
  hahahaha thanks. those are actually my stickers 🙂
  
  Loading...
  
  February 4, 2019 at 4:16 am Reply
The Product Manager's Reading List - The Product Manager says:

[…] ‘Shipping software should not be scary’ by Charity Majors (7 min read). Create cohorts when you release. Deploy to internal users first, then any free tier, etc in order of ascending importance. Don’t jump from 10% to 25% to 50% and then 100% — some changes are related to saturating backend resources, and the 50%-100% jump will kill you. […]

Loading...

November 29, 2019 at 12:51 am Reply
Emily Lucas says:

The second picture on the page is missing? It would be the picture on the right side under this header “IT’S NOT OK FOR SOFTWARE RELEASES TO BE SCARY AND HAZARDOUS”

Loading...

January 17, 2020 at 11:30 pm Reply
1. mipsytipsy says:
  
  oh thanks!
  
  Loading...
  
  April 6, 2020 at 8:14 pm Reply
Friday Deploy Freezes Are Exactly Like Murdering Puppies – charity.wtf says:

[…] You can find more tips for boring deploys in my piece on why shipping software should not be scary. […]

Loading...

March 2, 2021 at 7:05 am Reply
What to ask at job interviews when you are being interviewed - DEV Community says:

[…] code arrives in production signals maturity of the business development process. The faster it is, the more mature and uneventful shipping code is, and this is what you're looking for at your job. The scarier it is for a team to push changes, the […]

Loading...

January 15, 2022 at 9:34 am Reply

charity.wtf

charity wtf's about technology, databases, startups, engineering management, and whiskey.

Shipping Software Should Not Be Scary

It’s not ok for software releases to be scary and hazardous

Value your tools more

Turn software engineers into software owners

Fail fast, fix fast

Baking: it’s not just for cookies

In summary.

Like this:

15 thoughts on “Shipping Software Should Not Be Scary”

Leave a Reply to Emily LucasCancel reply

It’s not ok for software releases to be scary and hazardous

Value your tools more

Turn software engineers into software owners

Fail fast, fix fast

Baking: it’s not just for cookies

In summary.

Share this:

Like this:

15 thoughts on “Shipping Software Should Not Be Scary”

Leave a Reply to Emily LucasCancel reply

Discover more from charity.wtf