Questionable Advice: “After Being A Manager, Can I Be Happy As A Cog?”

One of my stretch goals for 2019 was to start writing an advice column.  I get a lot of questions about everything under the sun: observability, databases, career advice, management problems, what the best stack is for a startup, how to hire and interview, etc.  And while I enjoy this, having a high opinion of my own opinions and all, it doesn’t scale as well as writing essays.  I do have a (rather all-consuming) day job.

So I’d like to share some of the (edited and lightly anonymized) questions I get asked and some of the answers I have given.  With permission, of course.  And so, with great appreciation to my anonymous correspondent for letting me publish this, here is one.

Hi Charity,

I’ve been in tech for 25 years.  I don’t have a degree, but I worked my way up from menial jobs to engineering, and since then I have worked on some of the biggest sites in the world.  I have been offered a management role many times, but every time I refused.  Until about two years ago, when I said “fuck it, I’m almost 40; why not try.”

I took the job with boundless enthusiasm and motivation, because the team was honestly a mess.  We were building everything on-prem, and ops was constantly bullying developers over their supposed incompetence.  I had gone to conferences, listened to podcasts, and read enough blog posts that my head was full of “DevOps/CloudNative/ServiceOriented//You-build-it-you-run-it/ServantLeaders” idealism.  I knew I couldn’t make it any worse, and thought maybe, just maybe I could even make it better.softwarenegineeroncall_2 (1)

Soon after I took the job, though, there were company-wide layoffs.   It was not done well, and morale was low and sour.  People started leaving  for happier pastures.  But I stayed.  It was an interesting challenge, and I threw my heart and soul into it.

For two years I have stayed and grinded it out: recruiting (oh that is so hard), hiring, and then starting a migration to a cloud provider, and with the help of more and more people on the new team, slowly shifted the mindset of the whole engineering group to embrace devops best practices.  Now service teams own their code in production and are on-call for them, migrate themselves to the cloud with my team supporting them and building tools for them.  It is almost unrecognizable compared to where we were when I began managing.

A beautiful story isn’t it?  I hope you’re still reading.  🙂

Now I have to say that with my schedule full of 1:1s, budgeting, hiring, firing, publishing papers of mission statements and OKRs, shaping the teams, wielding influence, I realized that I enjoyed none of the above.  I read your 17 reasons not to be a manager, and I check so many boxes.  It is a pain in the ass to constantly listen to people’s egos, talk to them and keep everybody aligned (which obviously never happens).  And of course I am being crushed between top-down on-the-spot business decisions and bottom-up frustration of poorly executed engineering work under deadlines.  I am also destroyed by the mistrust and power games I am witnessing (or involved in, sometimes). while I long for collaboration and trust.  And of course when things go well my team gets all the praise, and when things go wrong I take all the blame.  I honestly don’t know how one can survive without the energy provided by praise and a sense of achievement.

All of the above makes me miss being an IC (Individual Contributor), where I could work for 8 hours straight without talking to anyone, build stuff, say what I wanted when I wanted, switch jobs if I wasn’t happy, and basically be a little shit like the ones you mention in your article.

Now you may say it’s obvious: I should find a new IC job in a healthier company.  You even wrote about it.  Going back to IC after two years of management is actually a good move.

But when I think about doing it, I get stuck.  I don’t know if I would be able to do it again, or if I could still enjoy it.  I’ve seen too many things, I’ve tasted what it’s like to be (sometimes) in control, and I did have a big impact on the company’s direction over time.  I like that.  If I went back to being an IC, I would feel small and meaningless, like just another cog in the machine.  And of course, being 40-ish, I will compete with all those 20-something smartasses who were born with kubernetes.

Thank you for reading.  Could you give me your thoughts on this?  In any case, it was good to get it off my chest.

Cheers,

Cog?

Dear Cog?,

Holy shitballs!  What an amazing story!  That is an incredible achievement in just two years, let alone as a rookie manager.  You deserve huge props for having the vision, the courage, and the tenacity to drive such a massive change through.

Of COURSE you’re feeling bored and restless.  You didn’t set out on a glorious quest for a life of updating mission statements and OKRs, balancing budgets, tending to people’s egos and fluffing their feelings, tweaking job descriptions, endless 1x1s and meetings meetings meetings, and the rest of the corporate middle manager’s portfolio.  You wanted something much bigger.  You wanted to change the world.  And you did!

But now you’ve done it.  What’s next?testinprod_3

First of all, YOUR COMPANY SUCKS.  You don’t once mention your leadership — where are they in all this?  If you had a good manager, they would be encouraging you and eagerly lining up a new and bigger role to keep you challenged and engaged at work.  They are not, so they don’t deserve you.  Fuck em.  Please leave.

Another thing I am hearing from you is, you harbor no secret desire to climb the managerial ranks at this time.  You don’t love the daily rhythms of management (believe it or not, some do); you crave novelty and mastery and advancement.  It sounds like you are willing to endure being a manager, so long as that is useful or required in order to tackle bigger and harder problems.  Nothing wrong with that!  But when the music stops, it’s time to move on.  Nobody should be saddled with a manager whose heart isn’t in the work.

You’re at the two year mark.  This is a pivotal moment, because it’s the beginning of the end of the time when you can easily slip back into technical work.  It will get harder and harder over the next 2-3 years, and at some point you will no longer have the option.

Picking up another technical role is the most strategic option, the one that maximizes your future opportunities as a technical leader.  But you do not seem excited by this option; instead you feel many complex and uncomfortable things.  It feels like going backwards.  It feels like losing ground.  It feels like ceding status and power.

“Management isn’t a promotion, it’s a career change.”

But if management is not a promotion, then going back to an engineering role should not feel like a demotion!  What the fuck?!

Imeplusprodt’s one thing to say that.  Whether it’s true or not is another question entirely, a question of policy and org dynamics.  The fact is that in most places, most of the power does go to the managers, and management IS a promotion.  Power flows naturally away from engineers and towards managers unless the org actively and vigorously pushes back on this tendency by explicitly allocating certain powers and responsibilities to other roles.

I’m betting your org doesn’t do this.  So yeah, going back to being an IC WILL be a step down in terms of your power and influence and ability to set the agenda.  That’s going to feel crappy, no question. We humans hate that.

Three points.

      1. You cannot go back to doing exactly what you did before, for the very simple reason that you are not the same person.  You are going to be attuned to power dynamics and ways of influencing that you never were before — and remember, leadership is primarily exercised through influence, not explicit authority.Senior ICs who have been managers are supremely powerful beings, who tend to wield outsize influence.  Smart managers will lean on them extensively for everything from shadow management and mentorship to advice, strategy, etc.  (Dumb managers don’t.  So find a smart manager who isn’t threatened by your experience.)
      2. You’re a short-timer here, remember?  Your company sucks.  You’re just renewing your technical skills and pulling a paycheck while finding a company that will treat you better, that is more aligned with your values.
      3. Lastly (and most importantly), I have a question.  Why did you need to become a manager in order to drive sweeping technical change over the past two years?  WHY couldn’t you have done it as a senior IC?  Shouldn’t technical people be responsible for technical decisions, and people managers responsible for people decisions?
        Could this be your next challenge, or part of it?  Could you go back to being an engineer, equipped with your shiny new powers of influence and mystical aura of recent management experience, and use it to organize the other senior ICs to assert their rightful ownership over technical decisions?  Could you use your newfound clout with leadership and upper management to convince them that this will help them recruit and retain better talent, and is a better way to run a technical org — for everyone?

     

I believe this is a better way, but I have only ever seen these changes happen when agitated for and demanded by the senior ICs.  If the senior ICs don’t assert their leadership, managers are unlikely to give it to them.  If managers try, but senior ICs don’t inhabit their power, eventually the managers just shrug and go back to making all the decisions.  That is why ultimately this is a change that must be driven and owned — at a minimum co-owned — by the senior individual contributors.Shared Joys, Kittens

I hope you can push back against that fear of being small and meaningless as an individual contributor.  The fact that it very often is this way, especially in strongly hierarchical organizations, does not mean that it has to be this way; and in healthy organizations it is not this way.  Command-and-control systems are not conducive to creative flourishing.  We have to fight the baggage of the authoritarian structures we inherited in order to make better ones.

Organizations are created afresh each and every day — not created for us, but by us.  Help create the organization you want to work at, where senior people are respected equally and have domains of ownership whether they manage people or technology.  If your current gig won’t value that labor, find one that will..

They exist.  And they want to hire you.

Lots of companies are DYING to hire this kind of senior IC, someone who is still hands on yet feels responsibility for the team as a whole, who knows the business side, who knows how to mentor and craft a culture and can herd cats when nec

There are companies that know how to use ICs at the strategic level, even executive level.  There are bosses who will see you not as a threat, but as a *huge asset* they can entrust with monumental work.

As a senior contributor who moves fluidly between roles, you are especially well-equipped to help shape a sociotechnical organization.  Could you make it your mission to model the kind of relationship you want to see between management and ICs, whichever side you happen to be on?  We need more people figuring out how to build organizations where management is not a promotion, just a change of career, and where going back and forth carries no baggage about promotions and demotions.  Help us.

And when you figure it out, please don’t keep it to yourself.  Expand your influence and share your findings by writing your experiences in blog posts, in articles, in talks.  Tell stories.  Show people people how much better it is this way.  Be so magnificently effective and mysteriously influential as a senior IC that all the baby engineers you work with want to grow up to be just like you.

Hope this helps.


charity

P.S. — Oh and stop fretting about “competing” with the 20-somethings kuberneteheads, you dork. You have been learning shit your whole career and you’ll learn this shit too.  The tech is the easy part.  The tech will always be the easy part.  🙂

Questionable Advice: “After Being A Manager, Can I Be Happy As A Cog?”

Deploys: It’s Not Actually About Fridays

I just read this piece, which is basically a very long subtweet about my Friday deploy threads.  Go on and read it: I’ll wait.

Here’s the thing.  After getting over some of the personal gibes (smug optimism?  literally no one has ever accused me of being an optimist, kind sir), you may be expecting me to issue a vigorous rebuttal.  But I shan’t.  Because we are actually in violent agreement, almost entirely.

I have repeatedly stressed the following points:

  1. I want to make engineers’ lives better, by giving them more uninterrupted weekends and nights of sleep.  This is the goal that underpins everything I do.
  2. Anyone who ships code should develop and exercise good engineering judgment about when to deploy, every day of the week
  3. Every team has to make their own determination about which policies and norms are right given their circumstances and risk tolerance
  4. A policy of “no Friday deploys” may be reasonable for now but should be seen as a smell, a sign that your deploys are risky.  It is also likely to make things WORSE for you, not better, by causing you to adopt other risky practices (e.g. elongating the interval between merge and deploy, batching changes up in a single deploy)

This has been the most frustrating thing about this conversation: that a) I am not in fact the absolutist y’all are arguing against, and b) MY number one priority is engineers and their work/life balance.  Which makes this particularly aggravating:

Lastly there is some strange argument that choosing not to deploy on Friday “Shouldn’t be a source of glee and pride”. That one I haven’t figured out yet, because I have always had a lot of glee and pride in being extremely (overly?) protective of the work/life balance of the engineers who either work for me, or with me.  I don’t expect that to change.

Hold up.  Did you catch that clever little logic switcheroo?  You defined “not deploying on Friday” as being a priori synonymous with “protecting the work/life balance of engineers”.  This is how I know you haven’t actually grasped my point, and are arguing against a straw man.  My entire point is that the behaviors and practices associated with blocking Friday deploys are in fact hurting your engineers.

I, too, take a lot of glee and pride in being extremely, massively, yes even OVERLY protective of the work/life balance of the engineers who either work for me, or with me.

AND THAT IS WHY WE DEPLOY ON FRIDAYS.

Because it is BETTER for them.  Because it is part of a deploy ecosystem which results in them being woken up less and having fewer weekends interrupted overall than if I had blocked deploys on Fridays.fire_burn

It’s not about Fridays.  It’s about having a healthy ecosystem and feedback loop where you trust your deploys, where deploys aren’t a big deal, and they never cause engineers to have to work outside working hours.  And part of how you get there is by not artificially blocking off a big bunch of the week and not deploying during that time, because that breaks up your virtuous feedback loop and causes your deploys to be much more likely to fail in terrible ways.

The other thing that annoys me is when people say, primly, “you can’t guarantee any deploy is safe, but you can guarantee people have plans for the weekend.”

Know what else you can guarantee?  That people would like to sleep through the fucking night, even on weeknights.

When I hear people say this all I hear is that they don’t care enough to invest the time to actually fix their shit so it won’t wake people up or interrupt their off time, seven days a week.  Enough with the virtue signaling already.

You cannot have it both ways, where you block off a bunch of undeployable time AND you have robust, resilient, swift deploys.  Somehow I keep not getting this core point across to a substantial number of very intelligent people.  So let me try a different way.

Let’s try telling a story.

A tale of two startups

Here are two case studies.

Company X

Company X is a three-year-old startup.  It is a large, fast-growing multi-tenant platform on a large distributed system with spiky traffic, lots of user-submitted data, and a very green database.  Company X deploys the API about once per day, and does a global deploy of all services every Tuesday.  Deploys often involve some firefighting and a rollback or two, and Tuesdays often involve deploying and reverting all day (sigh).

Pager volume at Company X isn’t the worst, but usually involves getting woken up a couple times a week, and there are deploy-related alerts after maybe a third of deploys, which then need to be triaged to figure out whose diff was the cause.

Company Z

Company Z is a three-year-old startup.  It is a large, fast-growing multi-tenant platform on a large distributed system with spiky traffic, lots of user-submitted data, and a very green house-built distributed storage engine.  Company Z automatically triggers a deploy within 30 minutes of a merge to master, for all services impacted by that merge.  Developers at company Z practice observability-driven deployment, where they instrument all changes, ask “how will I know if this change doesn’t work?” during code review, and have a muscle memory habit of checking to see if their changes are working as intended or not after they merge to master.

Deploys rarely result in the pager going off at Company Z; most problems are caught visually by the engineer and reverted or fixed before any paging alert can fire.  Pager volume consists of roughly one alert per week outside of working hours, and no one is woken up more than a couple times per year.

Same damn problem, better damn solutions.

If it wasn’t extremely obvious, these companies are my last two jobs, Parse (company X, from 2012-2016) and Honeycomb (company Z, from 2016-present).

They have a LOT in common.  Both are services for developers, both are platforms, both are running highly elastic microservices written in golang, both get lots of spiky traffic and store lots of user-defined data in a young, homebrewed columnar storage engine.  They were even built by some of the same people (I built infra for both, and they share four more of the same developers).

At Parse, deploys were run by ops engineers because of how common it was for there to be some firefighting involved.  We discouraged people from deploying on Fridays, we locked deploys around holidays and big launches.  At Honeycomb, none of these things are true.  In fact, we literally can’t remember a time when it was hard to debug a deploy-related change.

Screen Shot 2019-10-28 at 12.04.48 AM

Screen Shot 2019-10-28 at 12.05.04 AM

What’s the difference between Company X and Company Z?

So: what’s the difference?  Why are the two companies so dramatically different in the riskiness of their deploys, and the amount of human toil it takes to keep them up?

I’ve thought about this a lot.  It comes down to three main things.

  1. Observability
  2. Observability-driven development
  3. Single merge per deploy

1. Observability. 

I think that I’ve been reluctant to hammer this home as much as I ought to, because I’m exquisitely sensitive about sounding like an obnoxious vendor trying to sell you things.  😛  (Which has absolutely been detrimental to my argument.)

When I say observability, I mean in the precise technical definition as I laid out in this piece: with high cardinality, arbitrarily wide structured events, etc.   Metrics and other generic telemetry will not give you the ability to do the necessary things, e.g. break down by build id in combination with all your other dimensions to see the world through the lens of your instrumentation.  Here, for example, are all the deploys for a particular service last Friday:

Screen Shot 2019-10-28 at 12.31.12 AM

Each shaded area is the duration of an individual deploy: you can see the counters for each build id, as the new versions replace the old ones,

2. Observability-driven development.

This is cultural as well as technical.  By this I mean instrumenting a couple steps ahead of yourself as you are developing and shipping code.  I mean making a cultural practice of asking each other “how will you know if this is broken?” during code review.  I mean always going and looking at your service through the lens of your instrumentation after every diff you ship.  Like muscle memory.

3.  Single merge per deploy.

The number one thing you can do to make your deploys intelligible, other than observability and instrumentation, is this: deploy one changeset at a time, as swiftly as possible after it is merged to master.  NEVER glom multiple changesets into a single deploy — that’s how you get into a state where you aren’t sure which change is at fault, or who to escalate to, or if it’s an intersection of multiple changes, or if you should just start bisecting blindly to try and isolate the source of the problem.  THIS is what turns deploys into long, painful marathons.

headlamps, illuminating whatever’s in front of my face: this is the image in my mind when i think about instrumenting my code

And NEVER wait hours or days to deploy after the change is merged.  As a developer, you know full well how this goes.  After you merge to master one of two things will happen.  Either:

  • you promptly pull up a window to watch your changes roll out, checking on your instrumentation to see if it’s doing what you intended it to or if anything looks weird, OR
  • you close the project and open a new one.

When you switch to a new project, your brain starts rapidly evicting all the rich context about what you had intended to do and and overwriting it with all the new details about the new project.

Whereas if you shipped that changeset right after merging, then you can WATCH it roll out.  And 80-90% of all problems can be, should be caught right here, before your users ever notice —  before alerts can fire off and page you.  If you have the ability to break down by build id, zoom in on any errors that happen to arise, see exactly which dimensions all the errors have in common and how they differ from the healthy requests, see exactly what the context is for any erroring requests.

Healthy feedback loops == healthy systems.

That tight, short feedback loop of build/ship/observe is the beating heart of a healthy, observable distributed system that can be run and maintained by human beings, without it sucking your life force or ruining your sleep schedule or will to live.

Most engineers have never worked on a system like this.  Most engineers have no idea what a yawning chasm exists between a healthy, tractable system and where they are now.  Most engineers have no idea what a difference observability can make.  Most engineers are far more familiar with spending 40-50% of their week fumbling around in the dark, trying to figure out where in the system is the problem they are trying to fix, and what kind of context do they need to reproduce.

Most engineers are dealing with systems where they blindly shipped bugs with no observability, and reports about those bugs started to trickle in over the next hours, days, weeks, months, or years.  Most engineers are dealing with systems that are obfuscated and obscure, systems which are tangled heaps of bugs and poorly understood behavior for years compounding upon years on end.

That’s why it doesn’t seem like such a big deal to you break up that tight, short feedback loop.  That’s why it doesn’t fill you with horror to think of merging on Friday morning and deploying on Monday.  That’s why it doesn’t appall you to clump together all the changes that happen to get merged between Friday and Monday and push them out in a single deploy.

It just doesn’t seem that much worse than what you normally deal with.  You think this raging trash fire is, unfortunately … normal.

How realistic is this, though, really?

Maybe you’re rolling your eyes at me now.  “Sure, Charity, that’s nice for you, on your brand new shiny system.  Ours has years of technical debt,  It’s unrealistic to hold us to the same standard.”

Yeah, I know.  It is much harder to dig yourself out of a hole than it is to not create a hole in the first place.  No doubt about that.

Harder, yes.  But not impossible.

I have done it.

Parse in 2013 was a trash fire.  It woke us up every night, we spent a lot of time stabbing around in the dark after every deploy.  But after we got acquired by Facebook, after we started shipping some data sets into Scuba, after (in retrospect, I can say) we had event-level observability for our systems, we were able to start paying down that debt and fixing our deploy systems.

We started hooking up that virtuous feedback loop, step by step.

  1. We reworked our CI/CD system so that it built a new artifact after every single merge.
  2. We put developers at the steering wheel so they could push their own changes out.
  3. We got better at instrumentation, and we made a habit of going to look at it during or after each deploy.
  4. We hooked up the pager so it would alert the person who merged the last diff, if an alert was generated within an hour after that service was deployed.

We started finding bugs quicker, faster, and paying down the tech debt we had amassed from shipping code without observability/visibility for many years.

Developers got in the habit of shipping their own changes, and watching them as they rolled out, and finding/fixing their bugs immediately.

It took some time.  But after a year of this, our formerly flaky, obscure, mysterious, massively multi-tenant service that was going down every day and wreaking havoc on our sleep schedules was tamed.  Deploys were swift and drama-free.  We stopped blocking deploys on Fridays, holidays, or any other days, because we realized our systems were more stable when we always shipped consistently and quickly.  

Allow me to repeat.  Our systems were more stable when we always shipped right after the changes were merged.  Our systems were less stable when we carved out times to pause deployments.  This was not common wisdom at the time, so it surprised me; yet I found it to be true over and over and over again.

This is literally why I started Honeycomb.

When I was leaving Facebook, I suddenly realized that this meant going back to the Dark Ages in terms of tooling.  I had become so accustomed to having the Parse+scuba tooling and being able to iteratively explore and ask any question without having to predict it in advance.  I couldn’t fathom giving it up.

The idea of going back to a world without observability, a world where one deployed and then stared anxiously at dashboards — it was unthinkable.  It was like I was being asked to give up my five senses for production — like I was going to be blind, deaf, dumb, without taste or touch.

Look, I agree with nearly everything in the author’s piece.  I could have written that piece myself five years ago.

But since then, I’ve learned that systems can be better.  They MUST be better.  Our systems are getting so rapidly more complex, they are outstripping our ability to understand and manage them using the past generation of tools.  If we don’t change our ways, it will chew up another generation of engineering lives, sleep schedules, relationships.

Observability isn’t the whole story.  But it’s certainly where it starts.  If you can’t see where you’re going, you can’t go very far.

Get you some observability.

And then raise your standards for how systems should feel, and how much of your human life they should consume.  Do better. 

Because I couldn’t agree with that other post more: it really is all about people and their real lives.

Listen, if you can swing a four day work week, more power to you (most of us can’t).  Any day you aren’t merging code to master, you have no need to deploy either.  It’s not about Fridays; it’s about the swift, virtuous feedback loop.

And nobody should be shamed for what they need to do to survive, given the state of their systems today.

But things aren’t gonna get better unless you see clearly how you are contributing to your present pain.  And congratulating ourselves for blocking Friday deploys is like congratulating ourselves for swatting ourselves in the face with the flyswatter.  It’s a gross hack.

Maybe you had a good reason.  Sure.  But I’m telling you, if you truly do care about people and their work/life balance: we can do a lot better.

charity.

IMG_9017

Deploys: It’s Not Actually About Fridays

Love (and Alerting) in the Time of Cholera (and Observability)

I made a vow this year to post one blog post a month, then I didn’t post anything at all from May to September.  I have some catching up to do.  😑   I’ve also been meaning to transcribe some of the twitter rants that I end up linking back to into blog posts, so if Graph Everything, Kittensthere’s anything you especially want me to write about, tell me now while I’m in repentance mode.

This is one request I happened to make a note of because I can’t believe I haven’t already written it up!  I’ve been saying the same thing over and over in talks and on twitter for years, but apparently never a blog post.

The question is: what is the proper role of alerting in the modern era of distributed systems?  Has it changed?  What are the updated best practices for alerting?

It’s a great question.  I want to wax philosophically about some stuff, but first let me briefly outline the way to modernize your alerting best practices:

  1. implement observability
  2. implement SLOs and/or end-to-end checks that traverse key code paths and correlate to user-impacting events
  3. create a secondary channel (tasks, ticketing system, whatever) for “things that on call should look at soon, but are not impacting users yet” which does not page anyone, but which on call is expected to look at (at least) first thing in the morning, last thing in the evening, and midday
  4. move as many paging alerts as possible to the secondary channel, by engineering your services to auto-remediate or run in degraded mode until they can be patched up
  5. wake people up only for SLOs and health checks that correlate to user-impacting events

Or, in an even shorter formulation: delete all your paging alerts, then page only on e2e alerts that mean users are in pain.  Rely on debugging tools for debugging, and paging only when users are in pain.

To understand why I advocate deleting all your paging alerts, and when it’s safe to delete them, first we need to understand why have we accumulated so many crappy paging alerts over the years.

Monoliths, LAMP stacks, and death by pagebomb

Here, let’s crib a couple of slides from one of my talks on observability.  Here are the characteristics of older monolithic LAMP-stack style systems, and best practices for running them:

 

The sad truth is, that when all you have is time series aggregates and traditional monitoring dashboards, you aren’t really debugging with science so much as you are relying on your gut and a handful of dashboards, using intuition and scraps of data to try and reconstruct an impossibly complex system state.

This works ok, as long as you have a relatively limited set of failure scenarios that happen over and over again.  You can just pattern match from past failures to current data, and most of the time your intuition can bridge the gap correctly.  Every time there’s Graph Everything Unicorn 2x2an outage, you post mortem the incident, figure out what happened, build a dashboard “to help us find the problem immediately next time”, create a detailed runbook for how to respond to it, and (often) configure a paging alert to detect that scenario.

Over time you build up a rich library of these responses.  So most of the time when you get paged you get a cluster of pages that actually serves to help you debug what’s happening.  For example, at Parse, if the error graph had a particular shape I immediately knew it was a redis outage.  Or, if I got paged about a high % of app servers all timing out in a short period of time, I could be almost certain the problem was due to mysql connections.  And so forth.

Things fall apart; the pagebomb cannot stand

However, this model falls apart fast with distributed systems.  There are just too many failures.  Failure is constant, continuous, eternal.  Failure stops being interesting.  It has to stop being interesting, or you will die.

 

 

 

Instead of a limited set of recurring error conditions, you have an infinitely long list of things that almost never happen …. except that one time they do.  If you invest your time into runbooks and monitoring checks, it’s wasted time if that edge case never happens again.

Frankly, any time you get paged about a distributed system, it should be a genuinely new failure that requires your full creative attention.  You shouldn’t just be checking your phone, going “oh THAT again”, and flipping through a runbook.  Every time you get paged it should be genuinely new and interesting.

And thus you should actually have drastically fewer paging alerts than you used to.

A better way: observability and SLOs.

Instead of paging alerts for every specific failure scenario, the technically correct answer is to define your SLOs (service level objectives) and page only on those, i.e. when you are going to run out of budget ahead of schedule.  But most people aren’t yet operating at this level of sophistication.  (SLOs sound easy, but are unbelievably challenging to do well; many great teams have tried and failed.  This is why we have built an SLO feature into Honeycomb that does the heavy lifting for you.  Currently alpha testing with users.)

If you haven’t yet caught the SLO religion, the alternate answer is that “you should only page on high level end-to-end alerts, the ones which traverse the code paths that make you money and correspond to user pain”.  Alert on the three golden signals: request rate, latency, and errors, and make sure to traverse every shard and/or storage type in your critical path.

That’s it.  Don’t alert on the state of individual storage instances, or replication, or anything that isn’t user-visible.

(To be clear: by “alert” I mean “paging humans at any time of day or night”.  You might reasonably choose to page people during normal work hours, but during sleepy hours most errors should be routed to a non-paging address.  Only wake people up for actual user-visible problems.)

Here’s the thing.  The reason we had all those paging alerts was because we depended on them to understand our systems.

Once you make the shift to observability, once you have rich instrumentation and the ability to swiftly zoom in from high level “there might be a problem” to identifying specifically what the errors have in common, or the source of the problem — you no longer need to lean on that scattershot bunch of pagebombs to understand your systems.  You should be able to confidently ask any question of your systems, understand any system state — even if you have never encountered it before.

With observability, you debug by systematically following the trail of crumbs back to their source, whatever that is.  Those paging alerts were a crutch, and now you don’t need them anymore.

Everyone is on call && on call doesn’t suck.

I often talk about how modern systems require software ownership.  The person who is writing the software, who has the original intent in their head, needs to shepherd that code out into production and watch real users use it.  You can’t chop that up into multiple roles, dev and ops.  You just can’t.  Software engineers working on highly available systems need to be on call for their code.Graph Unicorn 4_x4_

But the flip side of this responsibility belongs to management.  If you’re asking everyone to be on call, it is your sworn duty to make sure that on call does not suck.  People shouldn’t have to plan their lives around being on call.  People shouldn’t have to expect to be woken up on a regular basis.  Every paging alert out of hours should be as serious as a heart attack, and this means allocating real engineering resources to keeping tech debt down and noise levels low.

And the way you get there is first invest in observability, then delete all your paging alerts and start over from scratch.

It works.  It really does. 🌈

 

 

Love (and Alerting) in the Time of Cholera (and Observability)

Friday Deploy Freezes Are Exactly Like Murdering Puppies

VOICEOVER: “Previously, on twitter …”

So, that happened.

I hadn’t seen anyone say something like this in quite a while.  I remember saying things like this myself as recently as, oh, 2016, but I thought the zeitgeist had moved on to continuous delivery.

Which is not to say that Friday freezes don’t happen anymore, or even that they shouldn’t; I just thought that this was no longer seen as a badge of responsibility and honor, rather a source of mild embarrassment.  (Much like the fact that you still don’t automatedly restore your db backups and verify them every night.  Do you.)

So I responded with an equally hyperbolic and indefensible claim:

Now obviously, OBVIOUSLY, reassigning all your developer cycles is probably a terrible idea.  You don’t get 100x parallel efficiency if you put 100 developers on a single problem.  So I thought it was clear that this said somewhat tongue in cheek, serious-but-not-really.  I was wrong there too.

So let me explain.

There’s nothing morally “wrong” with Friday freezes.  But it is a costly and cumbersome bandage for a problem that you would be better served to address directly.  And if your stated goal is to protect people’s off hours, this strategy is likely to sabotage that goal and cause them to waste far more time and get woken up much more often, and it stunts your engineers’ technical development on top of that.

Fear is the mind-killer.

Fear of deploys is the ultimate technical debt.  How much time does your company waste, between engineers:

  • waiting until it is “safe” to deploy,
  • batching up changes into bigger changes that are decidedly unsafe to deploy,
  • debugging broken deploys that had many changes batched into them,Does Not Kill Us Puppy UPDATED
  • waiting nervously to get paged after a deploy goes out,
  • figuring out if now is a good time to deploy or not,
  • cleaning up terrible deploy-related catastrophuckes

Anxiety related to deploys is the single largest source of technical debt in many, many orgs.  Technical debt, lest we forget, is not the same as “bad code”.  Tech debt hurts your people.

Saying “don’t push to production” is a code smell.  Hearing it once a month at unpredictable intervals is concerning.  Hearing it EVERY WEEK for an ENTIRE DAY OF THE WEEK should be a heartstopper alarm.  If you’ve been living under this policy you may be numb to its horror, but just because you’re used to hearing it doesn’t make it any less noxious.

If you’re used to hearing it and saying it on a weekly basis, you are afraid of your deploys and you should fix that.

If you are a software company, shipping code is your heartbeat.  Shipping code should be as reliable and sturdy and fast and unremarkable as possible, because this is the drumbeat by which value gets delivered to your org.

Deploys are the heartbeat of your company.

Every time your production pipeline stops, it is a heart attack.  It should not be ok to go around nonchalantly telling people to halt the lifeblood of their systems based on something as pedestrian as the day of the week.

Why are you afraid to push to prod?  Usually it boils down to one or more factors:

  • your deploys frequently break, and require manual intervention just to get to a good state
  • your test coverage is not good, your monitoring checks are not good, so you rely on users to report problems back to you and this trickles in over days
  • recovering from deploys gone bad can regularly cause everything to grind to a halt for hours or days while you recover, so you don’t want to even embark on a deploy without 24 hours of work day ahead of you
  • your deploys are painfully slow, and take hours to run tests and go live.

These are pretty darn good reasons.  If this is the state you are in, I totally get why you don’t want to deploy on Fridays.  So what are you doing to actively fix those states?  How long do you think these emergency controls will be in effect?

The answers of “nothing” and “forever” are unacceptable.  These are eminently fixable problems, and the amount of drag they create on your engineering team and ability to execute are the equivalent of five-alarm fires.

Fix. That.  Take some cycles off product and fix your fucking deploy pipeline.

If you’ve been paying attention to the DORA report or Accelerate, you know that the way you address the problem of flaky deploys is NOT by slowing down or adding roadblocks and friction, but by shipping more QUICKLY.

Science says: ship fast, ship often.

Deploy on every commit.  Smaller, coherent changesets transform into debuggable, understandable deploys.  If we’ve learned anything from recent research, it’s that velocity of deploys and lowered error rates are not in tension with each other, they actually reinforce each other.  When one gets better, the other does too.

So by slowing down or batching up or pausing your deploys, you are materially contributing to the worsening of your own overall state.

If you block devs from merging on Fridays, then you are sacrificing a fifth of your velocity and overall output.  That’s a lot of fucking output.

If you do not block merges on Fridays, and only block deploys, you are queueing up a bunch of changes to all get shipped days later, long after the engineers wrote the code and have forgotten half of the context.  Any problems you encounter will be MUCH harder to debug on Monday in a muddled blob of changes than they would have been just shipping crisply, one at a time on Friday.  Is it worth sacrificing your entire Monday?  Monday-Tuesday?  Monday-Tuesday-Wednesday?

Good judgment matters more than rules.

I am not saying that you should make a habit of shipping a large feature at 4:55 pm on Friday and then sauntering out the door at 5.  For fucks sake.  Every engineer needs to learn and practice good technical judgment around deploy hygiene.  LIke,

  • Don’t ship before you walk out the door on *any* day.
  • Don’t ship big, gnarly features right before the weekend, if you aren’t going to be around to watch them.
  • Instrument your code, and go and LOOK at the damn thing once it’s live.
  • Use feature flags and other tools that separate turning on code paths from deploys.

But you don’t need rules for this; in fact, rules actually inhibit the development of good judgment!

Most deploy-related problems are readily obvious, if the person who has the context for the change in their heads goes and looks at it.

But if you aren’t looking for them, then sure — you probably won’t find out until user reports start to trickle in over the next few days.

So go and LOOK.

Stop shipping blind.  Actually LOOK at what you ship.

I mean, if it takes 48 hours for a bug to show up, then maybe you better freeze deploys on Thursdays too, just to be safe!  🙄

I get why this seems obvious and tempting.  The “safety” of nodeploy Friday is realized immediately, while the costs are felt later later.  They’re felt when you lose Monday (and Tuesday) to debugging the big blob deplly.  Or they get amortized out over time.  Or you experience them as sluggish ship rates and a general culture of fear and avoidance, or learned helplessness, and the broad acceptance of fucked up situations as “normal”.

But if recovering from deploys is long and painful and hard, then you should fix that.  If you don’t tend to detect reliability events until long after the event, you should fix that.  If people are regularly getting paged on Saturdays and Sundays, they are probably getting paged throughout the night, too.  You should fix that.

On call paging events should be extremely rare.  There’s no excuse for on call being something that significantly impacts a person’s life on the regular.  None.

I’m not saying that every place is perfect, or that every company can run like a tech startup.  I am saying that deploy tooling is systematically underinvested in, and we abuse people far too much by paging them incessantly and running them ragged, because we don’t actually believe it can be any better.

It can.  If you work towards it.

Devote some real engineering hours to your deploy pipeline, and some real creativity to your processes, and someday you too can lift the Friday ban on deploys and relieve your oncall from burnout and increase your overall velocity and productivity.

On virtue signaling

Finally, I heard from a alarming number of people who admitted that Friday deploy bans were useless or counterproductive, but they supported them anyway as a purely symbolic gesture to show that they supported work/life balance.

This makes me really sad.  I’m … glad they want to support work/life balance, but surely we can come up with some other gestures that don’t work directly counter to their goals of  life/work balance.

Recovery: building a healthy deploy culture

Ways to begin recovering from a toxic deploy culture:

  • Have a deploy philosophy, make sure everybody knows what it is.  Be consistent.
  • Build and deploy on every set of committed changes.  Do not batch up multiple people’s commits into a deploy.
  • Train every engineer so they can run their own deploys, if they aren’t fully automated.  Make every engineer responsible for their own deploys.
  • (Work towards fully automated deploys.)
  • Every deploy should be owned by the developer who made the changes that are rolling out.  Page the person who committed the change that triggered the deploy, not whoever is oncall.
  • Set expectations around what “ownership” means.  Provide observability tooling so they can break down by build id and compare the last known stable deploy with the one rolling out.
  • Never accept a diff if there’s no explanation for the question, “how will you know Graph Everything, Kittenswhen this code breaks?  how will you know if the deploy is not behaving as planned?”  Instrument every commit so you can answer this question in production.
  • Shipping software and running tests should be fast.  Super fast.  Minutes, tops.
  • It should be muscle memory for every developer to check up on their deploy and see if it is behaving as expected, and if anything else looks “weird”.
  • Practice good deploy hygiene using feature flags.  Decouple deploys from feature releases.  Empower support and other teams to flip flags without involving engineers.

Each deploy should be owned by the developer who made the code changes.  But your deploy pipeline needs to have a team that owns it too.  I recommend putting your most experienced, senior developers on this problem to signal its high value.

You can find more tips for boring deploys in my piece on why shipping software should not be scary.

Good teams ship often.

Ultimately, I am not dogmatic about Friday deploys.  Truly, I’m not.  If that’s the only lever you have to protect your time, use it.  But call it and treat it like the hack it is.  It’s a gross workaround, not an ideal state.

Don’t let your people settle into the idea that it’s some kind of moral stance instead of a butt-ugly hack.  Because if you do you will never, ever get rid of it.

Remember: a team’s maturity and efficiency can be represented by how long it takes to get their shit into users’ hands after they write it.  Ship it fast, while it’s still fresh in your developers’ heads.  Ship one change set at a time, so you can swiftly debug and revert them.  I promise your lives will be so much better.  Every step helps.  <3

charity.

IMG_3768

Friday Deploy Freezes Are Exactly Like Murdering Puppies

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

This is part three of a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it.  🙏❤️  charity + friends


“Now Roll It Out And Keep Them Happy”

This is the third in a series of blog posts; previously we analyzed the security challenges of using a third party service, and we worked together with the security team to build empathy to deliver the project.  You might want to read those first, since we are going to build on a lot of the ideas there to ship and maintain this integration.

Ready for launch

You’ve convinced the security team and other stakeholders, you’ve gotten the integration running, you’re getting promising results from dev-test or staging environments… now it’s time to move from proof-of-concept to full implementation.  Depending on your situation this might be a transition from staging to production, or it might mean increasing a feature flipper flag from 5% to 100%, or it might mean increasing coverage of an integration from one API endpoint to cover your entire developer footprint.

Taking into account Murphy’s Law, we expect that some things will go wrong during the rollout.  Perhaps during coverage, a developer realizes that the schema designed to handle the app’s event mechanism can’t represent a scenario, requiring a redesign or a hacky solution.  Or perhaps the metrics dashboard shows elevated error rates from the API frontend, and while there’s no smoking gun, the ops oncall decides to rollback the integration Just In Case it’s causing the incident.

This gives us another chance to practice empathy — while it’s easy, wearing the champion hat, to dismiss any issues found by looking for someone to blame, ultimately this poisons trust within your organization and will hamper success.  It’s more effective, in the long run (and often even in the short run), to find common ground with your peers in other disciplines and teams, and work through to solutions that satisfy everybody.

Keeping the lights on

In all likelihood as integration succeeds, the team will rapidly develop experts and expertise, as well as idiomatic ways to use the product.  Let the experts surprise you; folks you might not expect can step up when given a chance.  Expertise flourishes when given guidance and goals; as the team becomes comfortable with the integration, explicitly recognize a leader or point person for each vendor relationship.  Having one person explicitly responsible for a relationship lets them pay attention to those vendor emails, updates, and avoid the tragedy of the “but I thought *you* were” commons.  This Integration Lead is also a center of knowledge transfer for your organization — they won’t know everything or help every user come up to speed, but they can help empower the local power users in each team to ramp up their teams on the integration.

As comfort grows you will start to consider ways to change your usage, for example growing into new kinds of data.  This is a good time to revisit that security checklist — does the change increase PII exposure to your vendor?  Would the new data lead to additional requirements such as per-field encryption?  Don’t let these security concerns block you from gaining valuable insight using the new tool, but do take the chance to talk it over with your security experts as appropriate.

Throughout this organic growth, the Integration Lead remains core to managing your changing profile of usage of the vendor they shepherd; as new categories of data are added to the integration, the Lead has responsibility to ensure that the vendor relationship and risk profile are well matched to the needs that the new usage (and presumably, business value) is placing on the relationship.

Documenting the Intergation Lead role and responsibilities is critical. The team should know when to check in, and writing it down helps it happen.  When new code has a security implication, or a new use case potentially amplifies the cost of an integration, bringing the domain expert in will avoid unhappy surprises.  Knowing how to find out who to bring in, and when to bring them in, will keep your team getting the right eyes on their changes.

Security threats and other challenges change over time, too.  Collaborating with your security team so that they know what systems are in use helps your team take note of new information that is relevant to your business. A simple example is noting when your vendors publish a breach announcement, but more complex examples happen too — your vendor transitions cloud providers from AWS to Azure and the security team gets an alert about unexpected data flows from your production cluster; with transparency and trust such events become part of a routine process rather than an emergency.

It’s all operational

Monitoring and alerting is a fact of operations life, and this has to include vendor integrations (even when the vendor integration is a monitoring product.)  All of your operations best practices are needed here — keep your alerts clean and actionable so that you don’t develop pager fatigue, and monitor performance of the integration so that you don’t get blindsided by a creeping latency monster in your APIs.

Authentication and authorization are changing as the threat landscape evolves and industry moves from SMS verification codes to U2F/WebAuthn.  Does your vendor support your SSO integration?  If they can’t support the same SSO that you use everywhere else and can’t add it — or worse, look confused when you mention SSO — that’s probably a sign you should consider a different vendor.

A beautiful sunset

Have a plan beforehand for what needs to be done should you stop using the service.  Got any mobile apps that depend on APIs that will go away or start returning permission errors?  Be sure to test these scenarios ahead of time.

What happens at contract termination to data stored on the service?  Do you need to explicitly delete data when ceasing use?

Do you need to remove integrations from your systems before ending the commercial relationship, or can the technical shutdown and business shutdown run in parallel?

In all likelihood these are contingency plans that will never be needed, and they don’t need to be fully fleshed out to start, but a little bit of forethought can avoid unpleasant surprises.

Year after year

Industry best practice and common sense dictate that you should revisit the security questionnaire annually (if not more frequently). Use this chance to take stock of the last year and check in — are you getting value from the service?  What has changed in your business needs and the competitive landscape? 

It’s entirely possible that a new year brings new challenges, which could make your current vendor even more valuable (time to negotiate a better contract rate!) or could mean you’d do better with a competing service.  Has the vendor gone through any major changes?  They might have new offerings that suit your needs well, or they may have pivoted away from the features you need. 

Check in with your friends on the security team as well; standards evolve, and last year’s sufficient solution might not be good enough for new requirements.

 

Andy thinks out loud about security, society, and the problems with computers on Twitter.


 

❤️ Thanks so much reading, folks.  Please feel free to drop any complaints, comments, or additional tips to us in the comments, or direct them to me on twitter.

Have fun!  Stay (a little bit) Paranoid!!

— charity

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

Outsource Your O11y: Get Aligned With Security (part 2/3)

This is part two of a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it.  🙏❤️  charity + friends


“Get Aligned With Security”

by Lilly Ryan

If your team has decided on a third-party service to help you gather data and debug product issues, how do you convince an often overeager internal security team to help you adopt it?

When this service is something that provides a pathway for developers to access production data, as analytics tools often do, making the case for access to that data can screech to a halt at the mention of the word “production”. Progressing past that point will take time, empathy, and consideration.

I have been on both sides of the “adopting a new service” fence: as a developer hoping to introduce something new and useful to our stack, and now as a security professional who spends her days trying to bust holes in other people’s setups. I understand both sides of the sometimes-conflicting needs to both ship software and to keep systems safe.  

This guide has advice to help you solve the immediate problem of choosing and deploying a third-party service with the approval of your security team.  But it also has advice for how to strengthen the working relationship between your security and development teams over the longer term. No two companies are the same, so please adapt these ideas to fit your circumstances.

Understanding the security mindset

The biggest problems in technology are never really about technology, but about people. Seeing your security team as people and understanding where they are coming from will help you to establish empathy with them so that both of you want to help each other get what you want, not block each other.

First, understand where your security team is coming from. Development teams need to build features, improve the product, understand and ship good code. Security teams need to make sure you don’t end up on the cover of the NYT for data breaches, that your business isn’t halted by ransomware, and that you’re not building your product on a vulnerable stack.

This can be an unfamiliar frame of mind for developers.  Software development tends to attract positive-minded people who love creating things and are excited about the possibilities of new technology. Software security tends to attract negative thinkers who are skilled at finding all the flaws in a system.  These are very different mentalities, and the people who occupy them tend to have very different assumptions, vocabularies, and worldviews.   

But if you and your security team can’t share the same worldview, it will be hard to trust each other and come to agreement.  This is where practicing empathy can be helpful.

Before approaching your security team with your request to approve a new vendor, you may want to run some practice exercises for putting yourselves in their shoes and forcing yourselves to deliberately cultivate a negative thinking mindset to experience how they may react — not just in terms of the objective risk to the business, or the compliance headaches it might cause, but also what arguments might resonate with them and what emotional reactions they might have.

My favourite exercise for getting teams to think negatively is what I call the Land Astronaut approach.

The “Land Astronaut” Game

Imagine you are an astronaut on the International Space Station. Literally everything you do in space has death as a highly possible outcome. So astronauts spend a lot of time analysing, re-enacting, and optimizing their reactions to events, until it becomes muscle memory. By expecting and training for failure, astronauts use negative thinking to anticipate and mitigate flaws before they happen. It makes their chances of survival greater and their people ready for any crisis.

Your project may not be as high-stakes as a space mission, and your feet will most likely remain on the ground for the duration of your work, but you can bet your security team is regularly indulging in worst-case astronaut-type thinking. You and your team should try it, too.

The Game:

Pick a service for you and your team to game out.  Schedule an hour, book a room with a whiteboard, put on your Land Astronaut helmets.  Then tell your team to spend half an hour brainstorming about all the terrible things that can happen to that service, or to the rest of your stack when that service is introduced.  Negative thoughts only!

Start brainstorming together. Start out by being as outlandish as possible (what happens if their data centre is suddenly overrun by a stampede of elephants?). Eventually you will find that you’ll tire of the extreme worst case scenarios and come to consider more realistic outcomes — some of which which you may not have thought of outside of the structure of the activity.

After half an hour, or whenever you feel like you’re all done brainstorming, take off your Land Astronaut helmets, sift out the most plausible of the worst case scenarios, and try to come up with answers or strategies that will help you counteract them.  Which risks are plausible enough that you should mitigate them?  Which are you prepared to gamble on never happening?  How will this risk calculus change as your company grows and takes on more exposure?

Doing this with your team will allow you all to practice the negative thinking mindset together and get a feel for how your colleagues in the security team might approach this request. (While this may seem similar to threat modelling exercises you might have done in the past, the focus here is on learning to adopt a security mindset and gaining empathy for this thought process, rather than running through a technical checklist of common areas of concern.)

While you still have your helmets within reach, use your negative thinking mindset to fill out the spreadsheet from the first piece in this series.  This will help you anticipate most of the reasonable objections security might raise, and may help you include useful detail the security team might not have known to ask for.

Once you have prepared your list of answers to George’s worksheet and held a team Land Astronaut session together, you will have come most of the way to getting on board with the way your security team thinks.

Preparing for compromise

You’ve considered your options carefully, you’ve learned how to harness negative thinking to your advantage, and you’re ready to talk to your colleagues in security – but sometimes, even with all of these tools at your disposal, you may not walk away with all of the things you are hoping for.

Being willing to compromise and anticipating some of those compromises before you approach the security team will help you negotiate more successfully.

While your Land Astronaut helmets are still within reach, consider using your negative thinking mindset game to identify areas where you may be asked to compromise. If you’re asking for production access to this new service for observability and debugging purposes, think about what kinds of objections may be raised about this and how you might counter them or accommodate them. Consider continuing the activity with half of the team remaining in the Land Astronaut role while the other half advocates from a positive thinking standpoint. This dynamic will get you having conversations about compromise early on, so that when the security team inevitably raises eyebrows, you are ready with answers.

Be prepared to consider compromises you had not anticipated, and enter into discussions with the security team with as open a mind as possible. Remember the team is balancing priorities of not only your team, but other business and development teams as well.  If you and your security colleagues are doing the hard work to meet each other halfway then you are more likely to arrive at a solution that satisfies both parties.

Working together for the long term

While the previous strategies we’ve covered focus on short-term outcomes, in this continuous-deployment, shift-left world we now live in, the best way to convince your security team of the benefits of a third-party service – or any other decision – is to have them along from day one, as part of the team.

Roles and teams are increasingly fluid and boundary-crossing, yet security remains one of the roles least likely to be considered for inclusion on a software development team. Even in 2019, the task of ensuring that your product and stack are secure and well-defended is often left until the end of the development cycle.  This contributes a great deal to the combative atmosphere that is common.

Bringing security people into the development process much earlier builds rapport and prevents these adversarial, territorial dynamics. Consider working together to build Disaster Recovery plans and coordinating for shared production ownership.

If your organisation isn’t ready for that kind of structural shift, there are other ways to work together more closely with your security colleagues.

Try having members of your team spend a week or two embedded with the security team. You may even consider a rolling exchange – a developer for a security team member – so that developers build the security mindset, and the security team is able to understand the problems your team is facing (and why you are looking at introducing this new service).

At the very least, you should make regular time to meet with the security team, get to know them as people, and avoid springing things on them late in the project when change is hardest.

Riding off together into the sunset…?

If you’ve taken the time to get to know your security team and how they think, you’ll hopefully be able to get what you want from them – or perhaps you’ll understand why their objections were valid, and come up with a better solution that works well for both of you.

Investing in a strong relationship between your development and security teams will rarely lead to the apocalypse. Instead, you’ll end up with a better product, probably some new work friends, and maybe an exciting idea for a boundary-crossing new career in tech.

But this story isn’t over! Once you get the green light from security, you’ll need to think about how to roll your new service out safely, maintain it, and consider its full lifespan within your company.  Which leads us to part three of this series, on rolling it out and maintaining it … both your integration and your relationship with the security team.

 

Lilly Ryan is a pen tester, Python wrangler, and recovering historian from Melbourne. She writes and speaks internationally about ethical software, social identities after death, teamwork, and the telegraph. More recently she has researched the domestic use of arsenic in Victorian England, attempted urban camouflage, reverse engineered APIs, wielded the Oxford comma, and baked a really good lemon shortbread.

Outsource Your O11y: Get Aligned With Security (part 2/3)

Outsource Your O11y: How To Be A Champion (part 1/3)

I hear variations on this question constantly: “I’d really like to use a service like Honeycomb for my observability, but I’m told I can’t ship any data off site.  Do you have any advice on how to convince my security team to let me?”

I’ve given lots of answers, most of them unsatisfactory.  “Strip the PII/PHI from your operational data.”  “Validate server side.”  “Use our secure tenancy proxy.”  (I’m not bad at security from a technical perspective, but I am not fluent with the local lingo, and I’ve never actually worked with an in-house security team — i’ve always *been* the security team, de facto as it may be.) 

So I’ve invited three experts to share their wisdom in a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

My ✨first-ever guest posts✨!  Yippee.  I hope these are useful to you, wherever you are in the process of outsourcing your tools.  You are on the right path: outsourcing your observability to a vendor for whom it’s their One Job is almost always the right call, in terms of money and time and focus — and yes, even security. 

All this pain will someday be worth it.  🙏❤️  charity + friends


“How to be a Champion”

by George Chamales

You’ve found a third party service you want to bring into your company, hooray!

To you, it’s an opportunity to deploy new features in a flash, juice your team’s productivity, and save boatloads of money.

To your security and compliance teams, it’s a chance to lose your customers’ data, cause your applications to fall over, and do inordinate damage to your company’s reputation and bottom line.

The good news is, you’re absolutely right.  The bad news is, so are they.

Successfully championing a new service inside your organization will require you to convince people that the rewards of the new service are greater than the risks it will introduce (there’s a guide below to help you).  

You’re convinced the rewards are real. Let’s talk about the risks.

The past year has seen cases of hackers using third party services to target everything from government agencies, to activists, to Targetagain.  Not to be outdone, attention-seeking security companies have been actively hunting for companies exposing customer data then issuing splashy press releases as a means to flog their products and services.  

A key feature of these name-and-shame campaigns is to make sure that the headlines are rounded up to the most popular customer – the clickbait lead “MBM Inc. Loses Customer Data” is nowhere near as catchy as “Walmart Jewelry Partner Exposes Personal Data Of 1.3M Customers.”

While there are scary stories out there, in many, many cases the risks will be outweighed by the rewards. Telling the difference between those innumerable good calls and the one career-limiting move requires thoughtful consideration and some up-front risk mitigation.

When choosing a third party service, keep the following in mind:

    • The security risks of a service are highly dependent on how you use it.  
      You can adjust your usage to decrease your risk.  There’s a big difference between sending a third party your server metrics vs. your customer’s personal information.  Operational metrics are categorically less sensitive than, say, PII or PHI (if you have scrubbed them properly).
    • There’s no way to know how good a service’s security really is.  
      History is full of compromised companies who had very pretty security pages and certifications (here’s Equifax circa September 2017).  Security features are a stronger indicator, but there are a lot more moving parts that go into maintaining a service’s security.
    • Always weigh the risks vs. the rewards.

 

 

There’s risk no matter what you do – bringing in the service is risky, doing nothing is risky.  You can only mitigate risks up to a point. Beyond that point, it’s the rewards that make risks worthwhile.

Context is critical in understanding the risks and rewards of a new service.  

You can use the following guide to put things in context as you champion a new service through the gauntlet of management, security, and compliance teams.  That context becomes even more powerful when you can think about the approval process from the perspective of the folks you’ll need to win over to get the okay to move forward.

In the next part of this series Lilly Ryan shares a variety of techniques to take on the perspective of your management, security and compliance teams, enabling you to constructively work through responses that can include everything from “We have concerns…” to “No” to “Oh Helllllllll No.”

Championing a new service is hard – it can be equally worthwhile.  Good luck!

 

George Chamales is a useful person to have around. Please send critiques of this post to george@criticalsec.com

“A Security Guide for Third Party Services” Worksheet

Note to thoughtful service providers:  You may want to fill parts of this out ahead of time and give it to your prospective customers.  It will provide your champion with good fortune in the compliance wars to come.  (Also available as a nicely formatted spreadsheet.)

 

Our Reasons
Why this service? This is the justification for the service – the compelling rewards that will outweigh the inevitable risks.
What will be true once the service is online?
Good reasons are ones that a fifth grader would understand.
Our Data
Data it will / won’t collect? Describe the classes or types of data the service will access / store and why that’s necessary for the service to operate.
If there are specific types of sensitive data the service won’t collect (e.g. passwords, Personally Identifiable Information, Patient Health Information) explicitly call them out.
How is data be accessed? Describe the process for getting data to the service.  
Do you have to run their code on your servers, on your customer’s computers?
Our Costs
Costs of NOT doing it? This are the financial risks / liabilities of not going with this service. What’s the worst and average cost?
Have you had costly problems in the past that could have been avoided if you were using this service?
Costs of doing it? Include the cost for the service and, if possible, the amount of person-time it’s going to take to operate the service.  
Ideally less than the cost of not doing it.
Our Risk – how mad will important people be…
If it’s compromised. What would happen if hackers or attention-seeking security companies publicly released the data you sent the service?  Is it catastrophic or an annoyance?
When it goes down? When this service goes down (and it will go down), will it be a minor inconvenience or will it take out your primary application and infuriate your most valuable customers?
Their Security  – in order of importance
SSO & 2FA Support? This is a security smoke test:  If a service doesn’t support SSO or 2FA, it’s safe to assume that they don’t prioritize security.
Also a good idea to investigate SSO support up front since some vendors charge extra for it (which is a shame).
Fine-grained permissions? This is another key indicator of the service’s maturity level since it takes time and effort to build in.  It’s also something else they might make you pay extra for.
Security certifications? These aren’t guarantees of quality, but it does indicate that the company’s put in some effort and money into their processes.
Check their website for general security compliance merit badges such as SOC2, ISO27001 or industry-specific things like PCI or HIPAA.
Security & privacy pages? If there is, it means that they’re willing to publicly state that they do something about security.  The more specific and detailed, the better.
Vendor’s security history? Have there been any spectacular breaches that demonstrated a callous disregard for security, gross incompetence, or both?
BONUS Questions Want to really poke and prod the internal security of your vendor?  Ask if they can answer the following questions:

  • How many known vulnerabilities (CVEs) exist on your production infrastructure right now?
  • At what time (exactly) was the last successful backup of all your customer data completed?
  • What were the last three secrets accessed in the production environment?
Our Decision
Is it worth it? Look back through the previous sections and ask whether it makes sense to:

* Use the 3rd party service

* Build it yourself

* Not do it at all
Would a thoughtful person agree with you?

 

 

 

Outsource Your O11y: How To Be A Champion (part 1/3)

Logs vs Structured Events

I got an interesting tweet the other day from @evntdrvn in response to this thread of mine. Paraphrasing,

“So I’ve almost got our group at work up to Step 1 in your observability maturity model, but some of the devs that I work with want to turn OFF our lovely structured logging in prod for informational-level msgs due to their legacy philosophy (‘we only log errors in prod’). The reasons given are mostly philosophical (“I’m a dev and only interested when things error out, I don’t want any other noise in prod logs”, “I don’t want to slow my app down in prod”). Help?!?”

As I was reading this, I was itching to fly out and dive into battle with Eric. I know exactly where his opinionated devs are coming from. I used to say the same things! I even wrote a whole blog post about it.

These developers have internalized a set of rules and best practices for dealing with output data, in the context of “monolith application development in the early 2000s”.

Monolithic systems assumptions

Those systems had many common constraints and assumptions, such as:

  • We have a monolith service, or a very small number of services. We can model the system in our heads.
  • Logging is done to local disk, which can impact performance
  • Disks are expensive
  • Log lines are spat out inline with execution.  A poorly placed printf can take the whole system down.
  • Investigation is rare, and usually means a human reading error logs.
  • Logging is of poor utility for understanding internal states or execution paths; you should just read the code or use a debugger.  (There are few or network hops between functions.)
  • Logging is mostly useful for detecting certain terminal crash states or connection errors.

Monolithic logging best practices

Therefore:

  • We should be very stingy in what we log
  • Debuggers should be used for understanding internal states of the code
  • Logs are a last resort and record of crash dumps.  We do not expect to use log data in the course of our daily work.  We assume log-related manual investigation will be infrequent and of limited utility.

These were exactly the right lessons to learn in the era of expensive hardware and monolithic repos/artifacts. Many people still work in environments like this, and follow logging best practices like these. God bless, more power to em.

Distributed systems assumptions

But more and more of us face systems that are very different.

  • We have many services, possibly many MANY services. A representative request will have “many” hops across “many” services and routers and proxies and meshes and storage systems.
  • We cannot model the system in our heads; it would be a mistake to try. We rely on tooling as the source of truth for those systems.
  • You may or may not have access to those services, or the systems your code runs on. There may or may not be a logging facility, or a centralized log aggregator. Your only view of the system is through the instrumentation of your code.
  • Disks and system resources are cheap, ephemeral, all but disposable.
  • Data services are similarly cheap.  We can almost entirely silo application performance off from the cost of writing perf data out.
  • Investigation is prohibitively slow and expensive for a human to do by hand. Many of the nodes or processes we need to inspect may no longer even exist, but their past states may still be relevant to us in understanding patterns to the present time.
  • Investigation should usually be done distributedly, across all instantiations of your code, however many there might be — and in real time
  • Investigation requires computation — not just string search. We need to ask on the fly involving math and percentiles and breakdowns and group by’s.  And we need access to the raw requests in order to run accurate computations — no pre-aggregates.
  • The hardest part isn’t usually debugging the code, it’s figuring out where is the code you need to debug. Or what the errors or outliers have in common from the perspective of the code.  Fixing the code itself is often comparatively trivial, once found.
  • What even is ‘logging’?
  • What even is ‘local disk’?

This isn’t optional: at some point of complexity or scale or distributedness, it becomes necessary if you want to work with these systems.

Logs can’t help you here.

And you aren’t going to get that kind of explorable data out of loglevel:ERROR, or by chopping up your telemetry into disconnected metrics devoid of context.

You are only going to get this kind of explorable, ad hoc, computation-friendly data if you take a radically new approach to how you output and aggregate telemetry.  You’re going to need to replace your log lines and log levels with a different sort of beast: arbitrarily wide structured events that describe the request and its context, one event per request per service.

If it helps, don’t think of them as log files any more. Think of them as events. Yes, you can stash this stream in a file, but why would you?  on what disk?  will that work for your serverless functions too?  Just stream them over the network to wherever you want to put them.

Log levels are another confusing and unnecessary artifact of yesteryear that you no longer really need. The more you think of structured events as logs, the more tempted you may be to apply the old set of best practices. So just don’t think of them as logs at all.

How to gather and structure your data

Instead of dribbling little pebbles of log effluvia throughout your code, do this.  (If you’re a honeycomb user, our beelines do it all automatically for you *and* pre-propagate the blobs with everything we know of your context.)

  1. Initialize an empty blob at the beginning, when the request first enters the service.
  2. Stuff any and all interesting detail about the request into that blob throughout the lifetime of the request.
    • Any unique id, any high-cardinality variable, any headers passed in, every full query, normalized query, and query execution time; every http call out to a remote service, every http execution time; any shopping cart id, first and last name, execution time — literally anything interesting, append to blob.
  3. Then, when the request is about to exit or error, write the blob off to honeycomb or another service or disk somewhere.

You can see immediately how this method has radically different performance implications and risks than the earlier shotgun spray approach. No more “oops i accidentally put a print line INSIDE a for loop”. The write amplification profile is compressed. Most importantly, the incremental cost of capturing more detail about the request per service is nearly zero.

And now you have the kind of structured data that you can feed into something like a columnar store, or honeycomb, and run ad hoc queries to your heart’s delight.

Distributed systems logging events best practices:

Let’s sum up.  (I’m including links to other past rants on this topic):

Just think.

No more doing multi-line regexps trying to look for the same request ID or user ID doing five suspicious things in a row.

No more regexps at all, for fuck’s sake.

No more bullshit percentiles that were computed at write time by averaging over a bunch of other averages

No more having to jump around from dashboards to logs trying to vainly eyeball correlate one spike with another. No more wondering why no two tools can agree if anything even exists or not

Just gather the detail you need to ask the questions when you need them, and store it in a single source of truth.  It’s that simple.

No need to shame people from learning best practices that worked perfectly well for a long time.  You can either let them learn the hard way that this transformation is non optional, or you can help them learn the easy way that it’s simply much better and easier to invest in this telemetry up front.  You seem like a nice enough chap, which is probably why you chose door 2.  (If you wanted to get tougher about it, have a few reformed folks in to tell their horror stories.  Try some ex-twitter engineers.)

The hardest part seems to be getting people to unlearn all the best practices they once learned for dealing with logs.  So just don’t call it logs anymore, if that helps. Call it “structured events”.

– charity.

img_4817

Logs vs Structured Events

Shipping Software Should Not Be Scary

On twitter this week, @srhtcn noted that “Many incidents happen during or right after release” and asked for advice on ways to fix this.

And he’s right!  Rolling out new software is the proximate cause for the overwhelming majority of incidents, at companies of all sizes.  Upgrading software is both a necessity and a minor insanity, considering how often it breaks things.

Image result for deploy production memeI’m not going to recap the history of continuous integration and delivery, suffice it to say that we now know that smaller and more frequent changes are much safer than larger and less frequent changes.

But it’s still risky.  And most issues are still caused by humans and our pesky need for “improvements”.  So what can be done?

It’s not ok for software releases to be scary and hazardous

First of all: If releasing is risky for you, you need to fix that.  Make this a priority.  Track your failures, practice post mortems, evaluate your on call practices and culture.  Know if you’re getting better or worse.  This is a project that will take weeks if not months until you can be confident in the results.

You have to fix it though, because these things are self-reinforcing.  If shipping changes is scary and fraught, people will do it less and it will get even MORE scary and treacherous.

Likewise, if you turn it into a non-cortisol inducing event and set expectations, engineers will ship their code more often in smaller diffs and therefore break the world less.

Fixing deploys isn’t about eliminating errors, it’s about making your pipeline resilient to errors.  It’s fundamentally about detecting common failures and recovering from them, without requiring human intervention.

Value your tools more

As an short term patch, you should run deploys in the mornings or whenever everyone is around and fresh.  Then take a hard look at your deploy pipeline.

In too many organizations, deploy code is a technical backwater, an accumulation of crufty scripts and glue code, forked gems and interns’ earnest attempts to hack up Capistrano.  It usually gives off a strong whiff of “sloppily evolved from many 2 am patches with no code review”.Image result for test in production meme

This is insane.  Deploy software is the most important software you have.  Treat it that way: recruit an owner, allocate real time for development and testing, bake in metrics and track them over time.

If it doesn’t have an owner, it will never improve.  And you will need to invest in frequent improvements even after you’re over this first hump.

  • Signal high organizational value by putting one of your best engineers on it.
  • Recruit help from the design side of the house as well.  The “right” thing to do must be the fastest, easiest thing to do, with friendly prompts and good docs.  No “shortcuts” for people to reach for at the worst possible time.  You need user research and design here.
  • Track how often deploys fail and why.  Managers should pay close attention to this metric, just like the one for people getting interrupted or woken up, and allocate time to fixing this early whenever it sags.  Before it gets bad.
  • Allocate real time for development, testing, and training — don’t expect the work to get shoved into people’s “spare time” or post mortem cleanup time.  Make sure other managers understand the impact of this work and are on board.  Make this one of your KPIs.Image result for deploy production meme

In other words, make deploy tools a first class citizen of your technical toolset.  Make the work prestigious and valued — even aspirational.  If you do performance reviews, recognize the impact there.

(Btw, “how we hardened our deploys” is total Velocity-bait (&& other practitioner conferences) as well as being great for recruiting and general visibility in blog post form.  People love these stories; there definitely aren’t enough of them.)

Turn software engineers into software owners

The canonical CI/CD advice starts with “ship early, ship often, ship smaller change sets”.  That’s great advice: you should definitely do those things.  But they are covered plenty elsewhere.  What’s software ownership?

Software ownership is the natural end state of DevOps.  Software engineers, operations engineers, platform engineers, mobile engineers — everyone who writes code should be own the full lifecycle of their software.

Software owners are people who:

  1. Write codeImage result for deploy production meme
  2. Can deploy and roll back their own code
  3. Are able to debug their own issues in prod (via instrumentation, not ssh)

If you’re lacking any one of those three ingredients, you don’t have ownership.

Why ownership?  Because software ownership makes for better engineers, better software, and a better experience for customers.  It shortens feedback loops and means the person debugging is usually the person with the most context on what has recently changed.

Some engineers might balk at this, but you’ll be doing them a favor.  We are all distributed systems engineers now, and distributed systems require a much higher level of operational literacy.  May as well start today.

Fail fast, fix fast

This is about shifting your mindset from one of brittleness and a tight grip, to one of flexibility where failures are no big deal because they happen all the time, don’t impact users, and give everyone lots of practice at detecting and recovering from them.

Here are a few of the best practices you should adopt with this practice.

Make operability a high-value skill set.  Never promote someone to “senior engineer” if they can’t deploy and debug their Image result for test in production memeown code.

Software engineers don’t have to become operational experts.  They do need to know the bare basics of instrumentation, deploy/revert, and debugging.

Everyone who puts software in production needs to understand and feel responsible for the full lifecycle of their code, not just how it works in their IDE.

Baking: it’s not just for cookies

Shipping something to production is a process of incrementally gaining confidence, not a switch you can flip.

You can’t trust code until it’s been in prod a while, Image result for deploy production memeuntil you’ve seen it perform under a wide range of load and concurrency scenarios, in lots of partial failure modes.  Only over time can you develop confidence in it not being terrible.

Nothing is production except production.  Don’t rely on never failing; expect failure, embrace failure.  Practice failure!  Build guard rails around your production systems to help you find and fix problems quickly.

The changes you need to make your pipeline more resilient are roughly the same changes you need to safely test in production.  These are a few of your guard rails.

  • Use feature flags to switch new code paths on and offImage result for test in production meme
  • Build canaries for your deploy process, so you can promote releases gracefully and automatically to larger subsets of your traffic as you gain confidence in them
  • Create cohorts.  Deploy to internal users first, then any free tier, etc in order of ascending importance.  Don’t jump from 10% to 25% to 50% and then 100% — some changes are related to saturating backend resources, and the 50%-100% jump will kill you.
  • Have robots check the health of your software as it rolls out to decide whether to promote the canary.  Over time the robot checks will mature and eventually catch a ton of problems and regressions for you.

The quality of code is not knowable before it hits production.  You may able to spot some problems, but you can never guarantee a lack of then.  It takes time to bake a new release and gain incremental confidence in new code.

In summary.

  1. Get someone to own the deploy software
  2. Value the work
  3. Create a culture of software ownership
  4. LOOK at what you’ve done after you do it
  5. Be suspicious of new versions until they prove themselves

Image result for deploy production meme

Two blog posts in one weekend!  That’s definitely never happened before.  Thanks to Baron for asking me to draft this up following the weekend’s twitter thread: https://twitter.com/mipsytipsy/status/1030340072741064704.

 

 

Shipping Software Should Not Be Scary

The Accidental DBA

This morning there was yet another comment thread on hacker news about Yet Another outage involving MongoDB and data loss, this time by some company called “CleverTap”.

Recap

To summarize: the CleverTap engineering team noticed that the WiredTiger storage engine was faster than MMAPv1 for MongoDB.  They decided to … “upgrade the following weekend” (that sentence alone made my eyes bulge).

According to the blog post, they upgraded from 2.6 to 3.0, while simultaneously changing storage engines from MMAPv1 to WiredTiger, while leaving zero secondaries snapshot nodes with data on MMAPv1.  All over the course of 3 days.

(They are also running sharded mongo, with a mere 300 ops/sec on each primary, which RAISES A LOT OF QUESTIONS but I already feel like I’m beating up on these kids so I won’t pursue that.)

Questions …

(But seriously what the *hell* can you be doing to have such a low request rate, that you
rainbow-umbrella-hineed to shard at an infinitesimal volume?  Why did you specify it in req/min instead of req/sec?  What is the breakdown of reads/writes?  What is the lock percentage?  What is the avg object size??  Are these like multi-MB documents????  Why did you pause all incoming traffic and process it after the upgrade?  If the primary can’t take the extra load, why not rs.syncFrom() a secondary?   If that doesn’t work, don’t you have other, bigger problems??)

Most bafflingly of all: why wait only a few minutes after electing a new WiredTiger primary for the first time ever, and then immediately DELETE your only known-good copies of the data on MMAPv1 and re-sync over them with WiredTiger?

Accidental DBAs

Okay.  So here’s the thing: you are clearly a team of accidental DBAs.  You are operations and software engineers who have found yourselves in charge of the data.

It’s cool.  I am too!  It’s a really neat and fun place to be in.  DBAs and network admins are kind of the last remaining priesthoods in our industry. umbrella-rainbow_cm-f

There’s a lot of powerful and fun stuff to be done for generalists who pick up specialty knowledge in one of those areas, or specialists (like my neteng friend Leslie) who start bringing their skills back to the generalist side and merging the two.

(Oh Right, We Wrote A Book About This!!!)

My friend Laine and I are writing a book for people on the data side, called “Database Reliability Engineering“, which is aimed at generalist engineers who want to learn how to deal with data responsibly and effectively.

(Actually that’s a good point, I am supposed to be pitching this book! — which is really screen-shot-2016-10-01-at-7-00-15-pmmostly Laine with a smidgen of me but it’s going to be super awesome.  Consider this your sales pitch.)

So first, as an accidental DBA, you should obviously buy this book  :).  Second: stateful services require a different mindset[*].  It’s cool that you are running your own databases!  But reading post mortems like this where the conclusion is “MongoDB sucks” makes me fucking grind my teeth.

Stop treating your databases like stateless services.

There are lots of ways that MongoDB (and every other database on the planet) really sucks.  Mongo set themselves up for special rage by overpromising too much early on, and seeming tone deaf to criticism from real database engineers.

But *I* can criticize Mongo all day long.  You children on hacker news who have never run it don’t get to. 😛  If you don’t know what the fuck you’re talking about, if you’re cargo culting other people’s years-old complaints, just shut up already.

Managing stateful services like databases means that you need to be more paranoid than you did with stateless services.  With stateless services the best practices are to to roll early, roll fast, roll often, roll back.  When you’re dealing with state, you need to be careful.

With stateful services you can’t play it fast and loose like that.  You’re going to have data loss, corruption, unpredictable results, catastrophic failures that you can’t simply roll back from.  Data loss can be ruinous to your company.  (This can also be true for stateless services that sit close to your data and mutate it a lot.)

But that’s what makes it fun.  🙂

Be paranoid.

When we were moving from MMAPv1 to RocksDB at Parse, we ran hybrid replica sets for 6-9 months.  We were paranoid.  It was justified!  We spent half a year capturing production workloads and replaying them, electing Rocks primaries and rolling back, and even then keeping snapshlightningots and secondaries of both storage engines for *months*.

This isn’t because MongoDB sucks.  It’s the nature of the game, it’s the difference between stateful and stateless services.

Do you know that there was a total query engine rewrite in 2.6?  We spent months flushing out tons of crazy bugs.  Do you know about the index intersection changes?  We helped chase down bugs in those too.  (You’re welcome.)

You can’t just go “dudes it’s faster” and jump off a cliff.  This shit is basic.  Test real production workloads. Have a rollback plan.  (Not for *10 days* … try a month or two.)

Lessons

If CleverTap had run their plan past anyone experienced with data, they would have called out all of those completely predictable failures, and advised them to change it:

  1. Make one change at a time.  Do a major version upgrade separately from the storage engine upgrade.
  2. Delay between each change.  Two weeks is absolutely minimal, any thing less is careless.  Let them bake.
  3. Storage engine changes are scary.  It takes years to gain confidence in a new way of laying bits down on disk.  (Whenever people bitch and moan about mongo, I remind Rainbow-Umbrella-Z-5_5them that I’ve still lost WAY more data to MyISAM, InnoDB, and MySQL overall than Mongo.
  4. You can run lots and lots of replicas (up to 7 votes per replica set, even more nodes) per each replica set in Mongo.  This is a killer feature.  Why didn’t you use it?
  5. Keep backups around for months in the new storage engine *and* the old storage engine, just in case.  Have two hidden snapshot nodes.  The only cost is in dollars, which is fucking cheap compared to data or engineering time.

If you are a new accidental DBA, you have to make a point of learning things.  Go to conferences.  Read books.  Buy bottles of whiskey for your data friends and pick their brains.  Remember that they know things you do not.  Don’t blame the vendors when you fucked up.

Network engineering is the same way, but mistakes tend to be a lot less … permanent.  You drop some packets..  like grains of sand. ^_^

Remember that you’re in charge of keeping people’s data safe and secure.  You have much to learn.  Learn it.

And get off my fucking lawn.  <3

Some slides from a couple of relevant talks I’ve given on the subject:

 

[*] P.S.:  “Stop treating your stateful services like stateless services” … this is a fact, but it’s not the aspiration.  DB folks should all be leaning in to the model of learning to treat our stateful services like stateless services, with the same casual disregard for individual nodes.  This is hard, and it’s going to take some time, but it’s clearly where the world is heading and it’s definitely a good thing.  🙂  The learning goes both ways!

 

rainbow-cloud-droplet

The Accidental DBA