Shipping Software Should Not Be Scary

On twitter this week, @srhtcn noted that “Many incidents happen during or right after release” and asked for advice on ways to fix this.

And he’s right!  Rolling out new software is the proximate cause for the overwhelming majority of incidents, at companies of all sizes.  Upgrading software is both a necessity and a minor insanity, considering how often it breaks things.

Image result for deploy production memeI’m not going to recap the history of continuous integration and delivery, suffice it to say that we now know that smaller and more frequent changes are much safer than larger and less frequent changes.

But it’s still risky.  And most issues are still caused by humans and our pesky need for “improvements”.  So what can be done?

It’s not ok for software releases to be scary and hazardous

First of all: If releasing is risky for you, you need to fix that.  Make this a priority.  Track your failures, practice post mortems, evaluate your on call practices andImage result for test in production meme culture.  Know if you’re getting better or worse.  This is a project that will take weeks if not months until you can be confident in the results.

You have to fix it though, because these things are self-reinforcing.  If shipping changes is scary and fraught, people will do it less and it will get even MORE scary and treacherous.

Likewise, if you turn it into a non-cortisol inducing event and set expectations, engineers will ship their code more often in smaller diffs and therefore break the world less.

Fixing deploys isn’t about eliminating errors, it’s about making your pipeline resilient to errors.  It’s fundamentally about detecting common failures and recovering from them, without requiring human intervention.

Value your tools more

As an short term patch, you should run deploys in the mornings or whenever everyone is around and fresh.  Then take a hard look at your deploy pipeline.

In too many organizations, deploy code is a technical backwater, an accumulation of crufty scripts and glue code, forked gems and interns’ earnest attempts to hack up Capistrano.  It usually gives off a strong whiff of “sloppily evolved from many 2 am patches with no code review”.Image result for test in production meme

This is insane.  Deploy software is the most important software you have.  Treat it that way: recruit an owner, allocate real time for development and testing, bake in metrics and track them over time.

If it doesn’t have an owner, it will never improve.  And you will need to invest in frequent improvements even after you’re over this first hump.

  • Signal high organizational value by putting one of your best engineers on it.
  • Recruit help from the design side of the house as well.  The “right” thing to do must be the fastest, easiest thing to do, with friendly prompts and good docs.  No “shortcuts” for people to reach for at the worst possible time.  You need user research and design here.  Image result for deploy production meme
  • Track how often deploys fail and why.  Managers should pay close attention to this metric, just like the one for people getting interrupted or woken up, and allocate time to fixing this early whenever it sags.  Before it gets bad.
  • Allocate real time for development, testing, and training — don’t expect the work to get shoved into people’s “spare time” or post mortem cleanup time.  Make sure other managers understand the impact of this work and are on board.  Make this one of your KPIs.Image result for deploy production meme

In other words, make deploy tools a first class citizen of your technical toolset.  Make the work prestigious and valued — even aspirational.  If you do performance reviews, recognize the impact there.

(Btw, “how we hardened our deploys” is total Velocity-bait (&& other practitioner conferences) as well as being great for recruiting and general visibility in blog post form.  People love these stories; there definitely aren’t enough of them.)

Turn software engineers into software owners

The canonical CI/CD advice starts with “ship early, ship often, ship smaller change sets”.  That’s great advice: you should definitely do those things.  But they are covered plenty elsewhere.  What’s software ownership?

Software ownership is the natural end state of DevOps.  Software engineers, operations engineers, platform engineers, mobile engineers — everyone who writes code should be own the full lifecycle of their software.

Software owners are people who:

  1. Write codeImage result for deploy production meme
  2. Can deploy and roll back their own code
  3. Are able to debug their own issues in prod (via instrumentation, not ssh)

If you’re lacking any one of those three ingredients, you don’t have ownership.

Why ownership?  Because software ownership makes for better engineers, better software, and a better experience for customers.  It shortens feedback loops and means the person debugging is usually the person with the most context on what has recently changed.

Some engineers might balk at this, but you’ll be doing them a favor.  We are all distributed systems engineers now, and distributed systems require a much higher level of operational literacy.  May as well start today.

Fail fast, fix fast

This is about shifting your mindset from one of brittleness and a tight grip, to one of flexibility where failures are no big deal because they happen all the time, don’t impact users, and give everyone lots of practice at detecting and recovering from them.

Here are a few of the best practices you should adopt with this practice.

Make operability a high-value skill set.  Never promote someone to “senior engineer” if they can’t deploy and debug their Image result for test in production memeown code.

Software engineers don’t have to become operational experts.  They do need to know the bare basics of instrumentation, deploy/revert, and debugging.

Everyone who puts software in production needs to understand and feel responsible for the full lifecycle of their code, not just how it works in their IDE.

Baking: it’s not just for cookies

Shipping something to production is a process of incrementally gaining confidence, not a switch you can flip.

You can’t trust code until it’s been in prod a while, Image result for deploy production memeuntil you’ve seen it perform under a wide range of load and concurrency scenarios, in lots of partial failure modes.  Only over time can you develop confidence in it not being terrible.

Nothing is production except production.  Don’t rely on never failing; expect failure, embrace failure.  Practice failure!  Build guard rails around your production systems to help you find and fix problems quickly.

The changes you need to make your pipeline more resilient are roughly the same changes you need to safely test in production.  These are a few of your guard rails.

  • Use feature flags to switch new code paths on and offImage result for test in production meme
  • Build canaries for your deploy process, so you can promote releases gracefully and automatically to larger subsets of your traffic as you gain confidence in them
  • Create cohorts.  Deploy to internal users first, then any free tier, etc in order of ascending importance.  Don’t jump from 10% to 25% to 50% and then 100% — some changes are related to saturating backend resources, and the 50%-100% jump will kill you.
  • Have robots check the health of your software as it rolls out to decide whether to promote the canary.  Over time the robot checks will mature and eventually catch a ton of problems and regressions for you.

The quality of code is not knowable before it hits production.  You may able to spot some problems, but you can never guarantee a lack of then.  It takes time to bake a new release and gain incremental confidence in new code.

In summary.

  1. Get someone to own the deploy software
  2. Value the work
  3. Create a culture of software ownership
  4. LOOK at what you’ve done after you do it
  5. Be suspicious of new versions until they prove themselves

Image result for deploy production meme

Two blog posts in one weekend!  That’s definitely never happened before.  Thanks to Baron for asking me to draft this up following the weekend’s twitter thread: https://twitter.com/mipsytipsy/status/1030340072741064704.

 

 

Shipping Software Should Not Be Scary

On Engineers and Influence

(Based on yesterday’s tweetstorm and the ensuing conversation, https://twitter.com/mipsytipsy/status/1029608573217587201)

Let’s talk about influence. As an engineer, how do you get influence? What does influence look like, what is it rooted in, how do you wield it or lose it? How is it different from the power and influence you might have as a manager?[0]

This often comes up in the context of ICs who desperately want to become managers in order to have more access to information and influence over decisions. This is a bad signal, though it’s sadly very common.

When that happens, you need to do some soul-searching. Does your org make space for senior ICs to lead and own decisions? Do you have an IC track that runs parallel to the manager track at least as high as director level? Are they compensated equally? Do youImage result for engineer software meme individual contributorhave a career ladder? Are your decision-making processes mysterious to anyone who isn’t a manager? Don’t assume what’s obvious to you is obvious to others; you have to ask around.

If so, it’s probably their own personal baggage speaking. Maybe they don’t believe you. Maybe they’ve only worked in orgs where managers had all the power. Maybe they’ve even worked in lots of places that said the exact same things as you are saying about how ICs can have great impact, but it was all a lie and now they’re burned. Maybe they aren’t used to feeling powerful for all kinds of reasons.

Regardless, people who want to be managers in order to perpetuate a bad power structure are the last people you want to be managers.[1]

But what does engineering influence look like?  How do your powers manifest?

I am going to avoid discussing the overlapping and interconnected issues of gender, race and class, let’s just acknowledge that it’s much more structurally difficult for some to wield power than for others, ok?

The power to create

Doing is the engineering superpower. We create things with just a laptop and our brain! It’s incredible! We don’t have to constantly convince and cajole and coerce others into building on our behalf, we can just build.

This may seem basic, but it matters. Creation is the ur-power from which all our forms of power flow. Nothing gets built unless we agree to build it (which makes this an ethical issue, too).

Facebook had a poster that said “CODE WINS ARGUMENTS”. Problematic in many ways, absolutely. But how many times have you seen a technical dispute resolved by who wasImage result for code wins arguments facebook willing to do the work? Or “resolved” one way.. then reversed by doing? Doing ends debates. Doing proves theories. Doing is powerful. (And “doing” doesn’t only mean “write code”.)

Furthermore, building software is a creative activity, and doing it at scale is an intensely communal one. As a creative act, we are better builders when we are motivated and inspired and passionate about our work (as compared to say, chopping wood). And as a collaborative act, we do better work when we have high trust and social cohesion.

Engineering ability and judgment, autonomy and sense of purpose, social trust and cooperative behaviors: this is the raw stuff of great engineering. Everybody has a mode or two that they feel most comfortable and authoritative operating from: we can group these roughly into archetypes.

(Examples drawn from some of the stupendously awesome senior engineers I’ve gotten to work with over the years, as well as the ways I loved to fling my weight around as an engineer.)

Archetypes of influence

  • “Doing the work that is desperately hard and desperately needed — and often desperately dull.” SOC2 compliance, backups and restores, terrifying refactors, any auth integration ever: if it’s moving the business forward, they don’t give a shit how dull the work is. If you are this engineer, you have a deep well of respect and gratitude.
  • Debugger of last resort.” Often the engineer who has been there the longest or originally built the system. If you are helpful and cheerful with your history and context, this is a huge asset. (People tend to wildly overestimate this person’s indispensability, actually; please don’t encourage this.)Image result for engineer software meme manager
  • The “expert” archetype is closely related. If you are the deep subject matter expert in some technology component, you have a shit ton of influence over anything that uses or touches that component. (You should stay up on impending changes to retain your edge.)
  • There are people who deliver a bafflingly powerful firehose of sustained output, sometimes making headway on multiple fronts at once. Some work long hours, others just have an unerring instinct for how to maximize impact (this sometimes maps to junior/senior manifestations). Nobody wants to piss off those people. Their consent is critical for … everything. Their participation will often turbo charge a project or pull a foundering effort over the finish line.

Not all influence is rooted in raw technical strength or output.  Just a few of the wide variety of creative/collaborative/interpersonal strengths:

  • Some engineers are infinitely curious, and have a way of consistently sniffing a few steps ahead of the pack. They might seem to be playing around with something pointless, and you want to scold them; then they save your ass from total catastrophe. You learn to value their playing around.
  • Some engineers solve problems socially, by making friends and trading tips and fixes and favors in the industry. Don’t underestimate social debugging, it’s often the quickest path to the right answer.Image result for influence meme
  • Some are dazzlingly lazy and blow your mind with their elegant shortcuts and corners correctly cut.
  • Some are recruiting magnets, and it’s worth paying their salary just for all the people who want to work with them again.
  • Some are skilled at driving consensus among stakeholders.
  • Some are killer explainers and educators and storytellers.
  • Some are the senior engineer everyone silently wants to grow up to be.
  • Some can tell such an inspiring story of tomorrow that everyone will run off to make it so.
  • Some teach by turning code reviews into a pedagogical art form.
  • Some make everyone around them somehow more productive and effective. Some create relentless forward momentum. Some are good at saying no.

And there are a few special wells of power that bear calling out as such.

  • Engineers who have been managers are worth their weight in gold.  They can translate business goals for junior engineers in their native language with impeccable credibility (something managers never really have, esp in junior engineers’ eyes.). They make strong tech leads, they can carve up projects into components that challenge but do not overwhelm each contributor while hitting deadlines.
  • Some engineers are a royal pain in the ass because they questionImage result for engineer software meme individual contributor and challenge every system and hierarchy. But these are sharp, powerful rocks that can polish great teams. Though they do require a strong manager, to channel t
    heir energy towards productive dialogue and improvement and keep them from pissing off the whole team.
  • And let’s not forget engineers who are on call. If you have a healthy on call culture,your ownership over production creates a deep, deep well of power and moral authority — to make demands, drive change, to prioritize. On call should not be a shit salad served up to those who can’t refuse, it should be a badge of honor and seriousness shouldered by every engineer who ships code. (And it should not be miserable or regularly life-impacting.)

… I could go on all day. Engineering is such a powerful role and skill set. It’s definitely worth unpacking where your own influence comes from, and understanding how others perceive your strengths.

Most forms of power boil down to “influence, wielded”.

But just banging out code is not enough. You may have credibility, but having it is not the same as using it. To transform influence into power you have to use it.  And the way you use it is by communicating.

What’s locked up in your head has no impact on the rest of us.  You have to get it out.

You can do this in lots of ways: by writing, in 1x1s, conversations with small groups, openly recruiting allies, convincing someone with explicit authority, broadcasting inImage result for engineer software meme individual contributorpublic, etc.

Because engineering is a creative activity, authoritarian power is actually quite brittle and damaging. The only sustainable forms of power are so-called “soft powers” like influencing and inspiring, which is why good managers use their soft power freely and hard power sparingly/with great reluctance. If your leadership invokes authority on the regular, that’s an antipattern.[2]

If you don’t speak up, you don’t have the right to sit and fume over your lack of influence. And speaking up does mean being vulnerable — and sometimes wrong — in front of other people.

This is not a zero-sum game.

Most of you have far more latent power than you realize or are used to wielding, because you don’t feel powerful or don’t recognize what you do in those terms.

Managers may have hard power and authority, but the real meaty decisions about technical delivery and excellence are more properly made by the engineers closest to them. These belong properly to the doers, in large part because they are the ones who have to support the consequences of these decisions.Image result for engineer software meme individual contributor

Power tends to flow towards managers because they are privy to more information. That makes it important to hire managers who are aware of this and lean against it to push power back to others.

In the same way that submissives have ultimate power in healthy BDSM relationships, engineers actually have the ultimate power in healthy teams. You have the ultimate veto: you can refuse to create.  Demand is high for your skills.  You can usually afford to look for better conditions. Many of you probably should.

And when technical and managerial priorities collide, who wins? Ideally you work together to find the best solution for the business and the people. The teams that feel 🔥on fire🔥 always have tight alignment between the two.

Pick your battles.

One final thought. You can have a lot of say in what gets built and how it gets built, if you cultivate your influence and spend it wisely. But you can’t have a say in everything. It doesn’t work that way.

Think of it like @mcfunley’s famous “innovation tokens”, but for attention and fucks given.
Image result for engineer software meme
The more you use your influence for good outcomes, the more you build up over time, yes … but it’s a precision tool, not background noise. Imagine someone trying to give you a massage by laying down on your whole back instead of pushing their elbow or hand into knots and trigger points. A too-broad target will diffuse your force and limit your potential impact.

Spend your attention tokens wisely.

And once you have influence, don’t forget to use it on behalf of others. Pay attention to those who aren’t being heard, and amplify their voices. Give your time, lend your patronage and credibility, and most of all teach the skills that have made you powerful to others who need them.

charity

P.S. I owe a huge debt to all the awesome senior engineers i’ve gotten to work with.  Mad love to you all.  ❤
Image result for influence meme

  • [0] I successfully answered one (1) of these questions before running out of steam.  Later. 
  • [1] Sheepish confession: this is why I became a manager.
  • [2] It’s also a bad sign if they won’t grant any explicit authority to the people they hold responsible for outcomes. I’m talking about relatively healthy orgs here, not pathological ones where people (often women) are told they don’t need promotions or explicit authority, they should just use their “soft power” — esp when the hard forms of power aligned against with them. That’s setting you up for failure.
  • [3] Some people seem caught off guard by my use of “power” to signal anything other than explicit granted powers by the org. This doesn’t make any sense to me. I find it too depressing and disempowering to think of power as merely granted authority. It doesn’t map to how I experience the world, either. Individual clout is a thing that waxes and wanes and only exists in relation to others’. I’ve seen plenty of weak managers pushed around by strong personalities (which is terrible too).
On Engineers and Influence

The Engineer/Manager Pendulum

Lately I’ve been doing some career counseling for people off Twitter (long story). The central drama for many people goes something like this:

“I’m a senior engineer, but I’m thinking about being a manager. I really like engineering, but I feel like I’m just solving the same problems over and over and it seems like the real problems are people problems. I have to be a manager to get promoted. I hope it isn’t terrible, once I make the switch. I hear it’s terrible.”

I’ve been meaning to write this post for a while. There’s a lot but let’s start with: Fuck the whole idea that only managers get career progression. And fuckkkk the idea you have to choose a “lane” and grow old there.  I completely reject this kind of slotting.

“Your advice is bad and you should feel bad”:

The best frontline eng managers in the world are the ones that are never more than 2-3 years removed from hands-on work, full time down in the trenches. The best individual contributors are the ones who have done time in management.

unicorns-are-jerksAnd the best technical leaders in the world are often the ones who do both. Back and forth.  Like a pendulum.

I’ve done this a few times myself now; start out as an early or first infra engineering hire, build the stack, then build the team, then manage the team, then … leave and start it all over again. I get antsy, I get restless. I start to feel like I know what I’m doing (… a telltale sign something’s wrong).

It’s a good cycle for people who like early stage companies, or have ADD. But I don’t see people talking about it as a career path. So I’m here to advocate for it, as an intentional and awesome way of life.

(h/t to @sarahmei who was tweetstorming this up at the EXACT SAME TIME as i was writing this.  Yes Virginia, internet feminists ARE linked by a mystical hive brain.)

On being a manager (of technical projects)

Promoting managers from within means you get those razor sharp skills from the people who just built the thing. That gives them credibility, while they struggle with their newly achieved incompetence in a different role.

helpfulcat

That’s one of the only ways you can achieve the temporary glory of a hybrid manager+tech lead. This is an unstable combination, because your engineering skills and context-sharpness are decaying the longer you do it.

You can only really improve at one of these things at a time: engineering or management.  And if you’re a manager, your job is to get better at management.  Don’t try to cling to your former glory.

Management is highly interruptive, and great engineering — where you’re learning things — requires blocking out interruptions. You can’t do these two opposite things at once.  As a manager, it is your job to be available for your team, to be interrupted. It is your job to choose to hand off the challenging assignments, so that your engineers can get better at engineering.

On being a tech lead (of people):

Conversely: the best tech leads in the world are always the ones who done time in management. This is not because they’re always the best programmers or debuggers; it’s because they know how to get shit done, which means they know how to communicate and manage other people.

A tech lead is a manager … but their first priority is achieving the task at hand, not grooming and minding the humans who work on it.

They still need the full manager toolset.  They’ll need to know how to rally people and teams and motivate them, or how to triage and restart a stalled project that everybody dreads. They still need to connect the dots between business objectives and technical objectives, and break down big objectives rainbow-swooshinto components. They need to be able to size up a junior engineer’s ability and craft a meaningful assignment, one that pushes their boundaries without crushing them … then do the same for another twenty contributors. This is management work, from the slightly shifted perspective of “Get Thing X Done” not “care for these people”.

So these tech leads usually spend more time in meetings than building things, and they will bitch about it but do it anyway, because writing code is not the best use of their time.  Tech is the easy part, herding humans is the harder part.

Senior engineers who have both these toolsets are the kind of tech leads you can build an org around, or a company around. They get shit done. And they are rare.

Almost all of them have spent considerable time in management.

The Pendulum

We don’t talk about this nearly enough: the immense breadth and strength that accrues to engineers who make a practice of going back and forth.

Being an IC is like reverse-engineering how a company works with very little Rainbow_dash_12_by_xpesifeindx-d5giyirinformation. A lot of things seem ridiculous, or pointless or inefficient from the perspective of a leaf node. .

Being a manager teaches you how the business works.  It also teaches you how people work. You will learn to have uncomfortable conversations. You will learn how to still get good work out of people who are irritated, or resentful, or who hate your guts.  You will learn how to resolve conflicts, dear god will you ever learn to resolve conflicts.  (Actually you’ll learn to YEARN for conflicts because straightforward conflict is usually better than all the other options.) You’ll go home exhausted every day and unable to articulate anything you actually did.  But you did stuff.rainbow-clouds

You’ll miss the dopamine hit of fixing something or solving something.  You’ll miss it desperately.

One last thing about management. There’s a myth that makes it really hard for people to stop managing, even when it makes them and everyone around them miserable.  And that’s the idea that management is a promotion.

Management is NOT a promotion.

Seriously, fuck that so hard. It is SUCH an insidious myth, and it leads to so many people managing even though they hate managing and have no business managing, and also starves the senior eng pool of the great mentors and elder wizards we need.

Management is not a promotion, management is a change of profession. And you will be bad at it for a long time after you start doing it.  If you don’t think you’re bad at it, you aren’t doing your job.

Managing because it feeds your ego is a terrific way to be sure that your engineers get to report to someone miserable and resentful, someone who should really be writing code

charity-gofmt
my feelings on having to only manager OR engineer for the rest of my life

or finding something else that brings them joy.

 

There’s nothing worse than reporting to someone forced into managing.  Please don’t be one of the reasons people burn out hard on tech.

It isn’t a promotion, so you don’t have any status to give up. Do it as long as it makes you happy, and the people around you happy. Then stop. Go back to building things. Wait til you get that itch again.

Then do it all over again. ❤

The Engineer/Manager Pendulum

The Accidental DBA

This morning there was yet another comment thread on hacker news about Yet Another outage involving MongoDB and data loss, this time by some company called “CleverTap”.

Recap

To summarize: the CleverTap engineering team noticed that the WiredTiger storage engine was faster than MMAPv1 for MongoDB.  They decided to … “upgrade the following weekend” (that sentence alone made my eyes bulge).

According to the blog post, they upgraded from 2.6 to 3.0, while simultaneously changing storage engines from MMAPv1 to WiredTiger, while leaving zero secondaries snapshot nodes with data on MMAPv1.  All over the course of 3 days.

(They are also running sharded mongo, with a mere 300 ops/sec on each primary, which RAISES A LOT OF QUESTIONS but I already feel like I’m beating up on these kids so I won’t pursue that.)

Questions …

(But seriously what the *hell* can you be doing to have such a low request rate, that you
rainbow-umbrella-hineed to shard at an infinitesimal volume?  Why did you specify it in req/min instead of req/sec?  What is the breakdown of reads/writes?  What is the lock percentage?  What is the avg object size??  Are these like multi-MB documents????  Why did you pause all incoming traffic and process it after the upgrade?  If the primary can’t take the extra load, why not rs.syncFrom() a secondary?   If that doesn’t work, don’t you have other, bigger problems??)

Most bafflingly of all: why wait only a few minutes after electing a new WiredTiger primary for the first time ever, and then immediately DELETE your only known-good copies of the data on MMAPv1 and re-sync over them with WiredTiger?

Accidental DBAs

Okay.  So here’s the thing: you are clearly a team of accidental DBAs.  You are operations and software engineers who have found yourselves in charge of the data.

It’s cool.  I am too!  It’s a really neat and fun place to be in.  DBAs and network admins are kind of the last remaining priesthoods in our industry. umbrella-rainbow_cm-f

There’s a lot of powerful and fun stuff to be done for generalists who pick up specialty knowledge in one of those areas, or specialists (like my neteng friend Leslie) who start bringing their skills back to the generalist side and merging the two.

(Oh Right, We Wrote A Book About This!!!)

My friend Laine and I are writing a book for people on the data side, called “Database Reliability Engineering“, which is aimed at generalist engineers who want to learn how to deal with data responsibly and effectively.

(Actually that’s a good point, I am supposed to be pitching this book! — which is really screen-shot-2016-10-01-at-7-00-15-pmmostly Laine with a smidgen of me but it’s going to be super awesome.  Consider this your sales pitch.)

So first, as an accidental DBA, you should obviously buy this book  :).  Second: stateful services require a different mindset[*].  It’s cool that you are running your own databases!  But reading post mortems like this where the conclusion is “MongoDB sucks” makes me fucking grind my teeth.

Stop treating your databases like stateless services.

There are lots of ways that MongoDB (and every other database on the planet) really sucks.  Mongo set themselves up for special rage by overpromising too much early on, and seeming tone deaf to criticism from real database engineers.

But *I* can criticize Mongo all day long.  You children on hacker news who have never run it don’t get to. 😛  If you don’t know what the fuck you’re talking about, if you’re cargo culting other people’s years-old complaints, just shut up already.

Managing stateful services like databases means that you need to be more paranoid than you did with stateless services.  With stateless services the best practices are to to roll early, roll fast, roll often, roll back.  When you’re dealing with state, you need to be careful.

With stateful services you can’t play it fast and loose like that.  You’re going to have data loss, corruption, unpredictable results, catastrophic failures that you can’t simply roll back from.  Data loss can be ruinous to your company.  (This can also be true for stateless services that sit close to your data and mutate it a lot.)

But that’s what makes it fun.  🙂

Be paranoid.

When we were moving from MMAPv1 to RocksDB at Parse, we ran hybrid replica sets for 6-9 months.  We were paranoid.  It was justified!  We spent half a year capturing production workloads and replaying them, electing Rocks primaries and rolling back, and even then keeping snapshlightningots and secondaries of both storage engines for *months*.

This isn’t because MongoDB sucks.  It’s the nature of the game, it’s the difference between stateful and stateless services.

Do you know that there was a total query engine rewrite in 2.6?  We spent months flushing out tons of crazy bugs.  Do you know about the index intersection changes?  We helped chase down bugs in those too.  (You’re welcome.)

You can’t just go “dudes it’s faster” and jump off a cliff.  This shit is basic.  Test real production workloads. Have a rollback plan.  (Not for *10 days* … try a month or two.)

Lessons

If CleverTap had run their plan past anyone experienced with data, they would have called out all of those completely predictable failures, and advised them to change it:

  1. Make one change at a time.  Do a major version upgrade separately from the storage engine upgrade.
  2. Delay between each change.  Two weeks is absolutely minimal, any thing less is careless.  Let them bake.
  3. Storage engine changes are scary.  It takes years to gain confidence in a new way of laying bits down on disk.  (Whenever people bitch and moan about mongo, I remind Rainbow-Umbrella-Z-5_5them that I’ve still lost WAY more data to MyISAM, InnoDB, and MySQL overall than Mongo.
  4. You can run lots and lots of replicas (up to 7 votes per replica set, even more nodes) per each replica set in Mongo.  This is a killer feature.  Why didn’t you use it?
  5. Keep backups around for months in the new storage engine *and* the old storage engine, just in case.  Have two hidden snapshot nodes.  The only cost is in dollars, which is fucking cheap compared to data or engineering time.

If you are a new accidental DBA, you have to make a point of learning things.  Go to conferences.  Read books.  Buy bottles of whiskey for your data friends and pick their brains.  Remember that they know things you do not.  Don’t blame the vendors when you fucked up.

Network engineering is the same way, but mistakes tend to be a lot less … permanent.  You drop some packets..  like grains of sand. ^_^

Remember that you’re in charge of keeping people’s data safe and secure.  You have much to learn.  Learn it.

And get off my fucking lawn.  ❤

Some slides from a couple of relevant talks I’ve given on the subject:

 

[*] P.S.:  “Stop treating your stateful services like stateless services” … this is a fact, but it’s not the aspiration.  DB folks should all be leaning in to the model of learning to treat our stateful services like stateless services, with the same casual disregard for individual nodes.  This is hard, and it’s going to take some time, but it’s clearly where the world is heading and it’s definitely a good thing.  🙂  The learning goes both ways!

 

rainbow-cloud-droplet

The Accidental DBA

DevOps vs SRE: delayed coverage of the dumbest war

Last week was the West Coast Velocity conference.  I had a terrific time — I think it’s the
best Velocity I’ve been to yet.  I also slipped in quite late, the evening before last, to catch Gareth’s session on DevOps vs SRE.

had to catch it, because Gareth Rushgrove (of DevOps Weekly glory) was taunting @lusis and me about it on the Internet.

rainbowdropletAnd it was worth it!   Holy crap, this was such a fun barnburner of a talk, with Gareth schizophrenically arguing both for and against the key premise of the talk, which was about “Google Infrastructure for Everyone Else (GIFEE)” and whether SRE is a) the highest, noblest goal that we should all aspire towards, or b) mostly irrelevant to anyone outside the Google confines.

Which Gareth won?  Check out the slides and judge for yourself.  🙃

 

At some point in his talk, though, Gareth tossed out something like “Charity probably already has a blog post on this drafted up somewhere.”  And I suddenly remembered “Fuck!  I DO!”  it’s been sitting in my Drafts for months god dammit.

So this is actually a thing I dashed off back in April, after CraftConf.  Somebody asked me for my opinion on the internet — always a dangerous proposition — and I went off on a bit of a rant about the differences and similarities between DevOps and SRE, as philosophies and practices.
colorful-rainbow

Time passed and I forgot about it, and then decided it was too stale.  I mean who really wants to read a rehash of someone’s tweetstorm from two months ago?

Well Gareth, apparently.

Anyway: enjoy.


SRE vs DevOps: TWO PHILOSOPHIES ENTER, BOTH ARE PHENOMENALLY SUCCESSFUL AND MUTUALLY DUBIOUS OF ONE ANOTHER


 

So in case you haven’t noticed, Google recently published a book about Site Reliability Engineering: How Google Runs Production Systems.  It contains some really terrific wisdom on how to scale both systems and orgs.  It contains chapters written by dear friends of mine.  It’s a great book, and you should buy it and read it!

Rainbow-Umbrella-Z-5_5It also has some really fucking obnoxious blurbs.  Things like about how “ONLY GOOGLE COULD HAVE DONE THIS”, and an whiff of snobbery throughout the book as though they actually believe this (which is far worse if true).

You can’t really blame the poor blurb’ers, but you can certainly look askance at a massive systems engineering org when it seems as though they’ve never heard of DevOps, or considered how it relates to SRE practices, and may even be completely unaware of what the rest of the industry has been up to for the past 10-plus years.  It’s just a little weird.

So here, for the record, is what I said about it.

 

Google is a great company with lots of terrific engineers, but you can only say they are THE

sre
The Google SRE Bible

BEST at what they do if you’re defining what they do tautologically, i.e. “they are the best at making Google run.”  Etsyans are THE BEST at running Etsy, Chefs are THE BEST at building Chef, because … that’s what they do with their lives.

Context is everything here.  People who are THE BEST at Googling often flail and flame out in early startups, and vice versa.  People who are THE BEST at early-stage startup engineering are rarely as happy or impactful at large, lumbering, more bureaucratic companies like Google.  People who can operate equally well and be equally happy at startups and behemoths are fairly rare.

And large companies tend to get snobby and forget this.  They stop hiring for uniquerainbow-swirl strengths and start hiring for lack of weaknesses or “Excellence in Whiteboard Coding Techniques,” and congratulate themselves alot about being The Best.  This becomes harmful when it translates into to less innovation, abysmal diversity numbers, and a slow but inexorable drift into dinosaurdom.

Everybody thinks their problems are hard, but to a seasoned engineer, most startup problems are not technically all that hard.  They’re tedious, and they are infinite, but anyone can figure this shit out.  The hard stuff is the rest of it: feverish pace, the need to reevaluate and reprioritize and reorient constantly, the total responsibility, the terror and uncertainty of trying to find product/market fit and perform ten jobs at once and personally deliver to your promises to your customers.

rainbow-cloud-dropletAt a large company, most of the hardest problems are bureaucratic.  You have to come to terms with being a very tiny cog in a very large wheel where the org has a huge vested interest in literally making you as replicable and replaceable as possible.  The pace is excruciatingly slow if you’re used to a startup.  The autonony is … well, did I mention the politics?  If you want autonomy, you have to master the politics.

 

Everyone.  Operational excellence is everyone’s job.  Dude, if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on  your team is an operational risk (not to mention, you know, like moral issues and stuff).

But the more engineering-focused your role is, the more direct your impact will be on operational outcomes.

As a software engineer, developing strong ops chops makes you powerful.  It makes you better at debugging and instrumentation, building resiliency and observability into your own systems and interdependent systems, and building systems that other people can come along and understand and maintain long after you’re gone.rainbow-shade

As an operations engineer, those skills are already your bread and butter.  You can increase your power in other ways, like by leveling up at software engineering skills like test coverage and automation, or DBA stuff like query optimization and storage engine internals, or by helping the other teams around you level up on their skills (communication and persuasion are chronically underrecognized as core operations engineering skills).

This doesn’t mean that everyone can or should be able to do everything.  (I can’t even SAYrainbow-dot-ball the words “full stack engineer” without rolling my eyes.)  Generalists are awesome!  But past a certain inflection point, specialization is the only way an org can scale.

It’s the only way you make room for those engineering archetypes who only want to dive deep, or who really really love refactoring, or who will save the world then disappear for weeks.  Those engineers can be incredibly valuable as part of a team … but they are most valuable in a large org where you have enough generalists to keep the oars rowing along in the meantime.

So, back to Google.  They’ve done, ahem, rather  well for themselves.  Made shitbuckets of money, pushed the boundaries of tech, service hardly ever goes down.  They have operational demands that most of us never have seen and never will, and their engineers are definitely to be applauded for doing a lot of hard technical and cultural labor to get there.

So why did this SRE book ruffle a few feathers?

Mostly because it comes off a little tone deaf in places.  I’m not personally pissed off by
the google SRE book, actually, just a little bemused at how legitimately unaware they seem to be about … anything else that the industry has been doing over the past 10 years, in terms of cultural transformation, turning sysadmins into better engineers, sharing on-call rotations, developing processes around empathy and cross-functionality, engineering best practices, etc.

devops
DevOps for the rest of us

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

However, it happens that Jen Davis and Katherine Daniels just published a book called Effective DevOps, which covers a lot of the same ground with a much more varied and inclusive approach.  And one of the things they return to over and over again is the power of context, and how one-size-fits-all solutions simply don’t exist, just like unisex OSFA t-shirts are a dirty fucking lie.

Google insularity is … a thing.  On the one hand it’s great that they’re opening up a bit!rainbow-umbrella-clipart-1 On the other hand it’s a little bit like when somebody barges onto a mailing list and starts spouting without skimming any of the archives.  And don’t even get me started on what happens when you hire long, longterm ex-Googlers back into to the real world.

So, so many of us have had this experience of hiring ex-Googlers who automatically assume that the way Google does a thing is CORRECT, not just contextually appropriate.   Not just right for Google, but right for everyone, always.  Which is just obviously untrue.  But the reassimilation process can be quite long and exhausting when the Kool-Aid is so strong.

Because yeah, this is a conversation and a transformation that the industry has been having for a long time now.  Compared with the SRE manifesto, the DevOps philosophy is much more crowd-sourced, more flexible, and adaptable to organizations of all stages of developments, with all different requirements and key business differentiators, because it’s benefited from loud, mouthy contributors who aren’t all working in the same bubble all along.

And it’s like Google isn’t even aware this was happening, which is weird.

Orrrrrr, maybe I’m just a wee bit annoyed that I’ve been drawn into this position rainbow-dot-ballof having to defend “DevOps”, after many excellent years spent being grumpy about the word and the 10000010101 ways it is used and abused.

(Tell me again about your “DevOps Engineering Team”, I dare you.)

P.S. I highly encourage you to go read the epic hours-long rant by @matthiasr that kicked off the whole thing.  some of which I definitely endorse and some of which not, but I think we could go drink whiskey and yell about this for a week or two easy breezy  ❤

Anyway what the fuck do I know, I’ve never worked in the Google lair, so maybe I am just under-equipped to grasp the true glory, majesty and superiority of their achievements over us all.

Or maybe they should go read Katherine and Jen’s book and interact with the “UnGoogled” once in a while.  ☺️

colorful-rainbow

 

DevOps vs SRE: delayed coverage of the dumbest war

WTF is operations? #serverless

I just got back from the very first ever @serverlessconf in NYC.  I have a soft spot for well-curated single-track conferences, and the organizers did an incredible job.  Major kudos to @iamstan and team for pulling together such a high-caliber mix of attendees as well as presenters.

I’m really honored that they asked me to speak.  And I had a lot of fun delivering my talk!  But in all honesty, I turned it down a few times — and then agreed, and then backed out, and then agreed again at the very last moment.  I just had this feeling like the attendees weren’t going to want to hear what I was gonna say, or like we weren’t gonna be speaking the same language.

Rainbow_dash_12_by_xpesifeindx-d5giyirWhich … turned out to be mmmmostly untrue.  To the organizers’ credit, when I expressed this concern to them, they vigorously argued that they wanted me to talk *because* they wanted a heavy dose of real talk in the mix along with all the airy fairy tales of magic and success.

 

So #serverless is the new cloud or whatever

Hi, I’m grouchy and I work with operations and data and backend stuff.  I spent 3.5 years helping Parse grow from a handful of apps to over a million.  Literally building serverless before it was cool TYVM.

So when I see kids saying “the future is serverless!” and “#NoOps!” I’m like okay, that’s cute.  I’ve lived the other side of this fairytale.  I’ve seen what happens when application developers think they don’t have to care about the skills associated with operations engineering.  When they forget that no matter how pretty the abstractions are, you’re still dealing with dusty old concepts like “persistent state” and “queries” and “unavailability” and so forth, or when they literally just think they can throw money at a service to make it go faster because that’s totally how services work.

I’m going to split this up into two posts.  I’ll write up a recap of my talk in a sec, but first let’s get some things straight.  Like words.  Like operations.

What is operations?

Let’s talk about what “operations” actually means, in the year 2016, assuming a reasonably high-functioning engineering environment.

Screen Shot 2016-05-30 at 4.29.09 PM

At a macro level, operational excellence is not a role, it’s an emergent property.  It is how you get shit done.

Operations is the sum of all of the skills, knowledge and values that your company has built up around the practice of shipping and maintaining quality systems and software.  It’s your implicit values as well as your explicit values, habits, tribal knowledge, reward systems.  Everybody from tech support to product people to CEO participates in your operational outcomes, even though some roles are obviously more specialized than others.

Saying you have an ops team who is solely responsible for reliability is about as silly as saying that “HR defines and owns our company culture!”  No.  Managers and HR professionals may have particular skills and responsibilities, but culture is an emergent property and everyone contributes (and it only takes a couple bad actors to spoil the bushel).

Thinking about operational quality in terms of “a thing some other team is responsible for” is just generally not associated with great outcomes.  It leads to software engineers who are less proficient or connected to their outcomes, ops teams who get burned out, and an overall lower quality of software and services that get shipped to customers.

These are the specialized skill sets that I associate with really good operations engineers.  Do these look optional to you?

Screen Shot 2016-05-30 at 4.41.21 PM

It depends on your mission, but usually these are not particularly optional.  If you have customers, you need to care about these things.  Whether you have a dedicated ops team or not.  And you need to care about the tax it imposes on your humans too, especially when it comes to the cognitive overhead of complex systems.

So this is my definition of operations.  It doesn’t have to be your definition.  But I think it is a valuable framework for helping us reason about shipping quality software and healthy teams.  Especially given the often invisible nature of operations labor when it’s done really well.  It’s so much easier to notice and reward shipping shiny features than “something didn’t break”.

The inglorious past

Don’t get me wrong — I understand why “operations” has fallen out of favor in a lot of crowds.  I get why Google came up with “SRE” to draw a line between what they needed and what the average “sysadmin” was doing 10+ years ago.

Ops culture has a number of well-known and well-documented pathologies: hero/martyr complexes, risk aversion, burnout, etc.  I understand why this is offputting and we need to fix it.

Also, historically speaking, ops has attracted a greater proportion of nontraditional oddballs who just love debugging and building things — fewer Stanford CS PhDs, more tinkerers and liberal arts majors and college dropouts (hi).  And so they got paid less, and had less engineering discipline, and burned themselves out doing too much ad hoc labor.Rainbow_Dash_3.png

But — this is no longer our overwhelming reality, and it is certainly not the reality we are hurtling towards.  Thanks to the SRE movement, and the parallel and even more powerful & diverse open source DevOps movement, operations engineers are … engineers.  Who specialize in infrastructure.  And there’s more value than ever in empathy and fluid skill sets, in engineers who are capable of moving between disciplines and translating between specialties.  This is where the “full-stack developer” buzzword comes from.  It’s annoying, but reflects a real craving for generalist skill sets.

The BOFH stereotype is dead.  Some of the most creative cultural and technical changes in the technical landscape are being driven by the teams most identified with operations and developer tooling.  The best software engineers I know are the ones who consistently value the impact and lifecycle of the code they ship, and value deployment and instrumentation and observability.  In other words, they rock at ops stuff.

The Glorious Future

And so I think it’s time to bring back “operations” as a term of pride.  As a thing that is valued, and rewarded.  As a thing that every single person in an org understands as being critical to success.  Every organization has unique operational needs, and figuring out what they are and delivering on them takes a lot of creativity and ingenuity on both the cultural and technical front.

“Operations” comes with baggage, no doubt.  But I just don’t think that distance and denial are aRainbow_Dash_in_flightn effective approach for making something better, let alone trash talking and devaluing the skill sets that you need to deliver quality services.

You don’t make operational outcomes magically better by renaming the team “DevOps” or “SRE” or anything else.  You make it better by naming it and claiming it for what it is, and helping everyone understand how their role relates to your operational objectives.

And now that I have written this blog post I can stop arguing with people who want to talk about “DevOps Engineers” and whether “#NoOps” is a thing and maybe I can even stop trolling them back about the nascent “#NoDevs” movement.  (Haha just kidding, that one is too much fun.)

Part 2: Operations in a #Serverless World

 

WTF is operations? #serverless