Friday Deploy Freezes Are Exactly Like Murdering Puppies

VOICEOVER: “Previously, on twitter …”

So, that happened.

I hadn’t seen anyone say something like this in quite a while.  I remember saying things like this myself as recently as, oh, 2016, but I thought the zeitgeist had moved on to continuous delivery.

Which is not to say that Friday freezes don’t happen anymore, or even that they shouldn’t; I just thought that this was no longer seen as a badge of responsibility and honor, rather a source of mild embarrassment.  (Much like the fact that you still don’t automatedly restore your db backups and verify them every night.  Do you.)

So I responded with an equally hyperbolic and indefensible claim:

Now obviously, OBVIOUSLY, reassigning all your developer cycles is probably a terrible idea.  You don’t get 100x parallel efficiency if you put 100 developers on a single problem.  So I thought it was clear that this said somewhat tongue in cheek, serious-but-not-really.  I was wrong there too.

So let me explain.

There’s nothing morally “wrong” with Friday freezes.  But it is a costly and cumbersome bandage for a problem that you would be better served to address directly.  And if your stated goal is to protect people’s off hours, this strategy is likely to sabotage that goal and cause them to waste far more time and get woken up much more often, and it stunts your engineers’ technical development on top of that.

Fear is the mind-killer.

Fear of deploys is the ultimate technical debt.  How much time does your company waste, between engineers:

  • waiting until it is “safe” to deploy,
  • batching up changes into bigger changes that are decidedly unsafe to deploy,
  • debugging broken deploys that had many changes batched into them,Does Not Kill Us Puppy UPDATED
  • waiting nervously to get paged after a deploy goes out,
  • figuring out if now is a good time to deploy or not,
  • cleaning up terrible deploy-related catastrophuckes

Anxiety related to deploys is the single largest source of technical debt in many, many orgs.  Technical debt, lest we forget, is not the same as “bad code”.  Tech debt hurts your people.

Saying “don’t push to production” is a code smell.  Hearing it once a month at unpredictable intervals is concerning.  Hearing it EVERY WEEK for an ENTIRE DAY OF THE WEEK should be a heartstopper alarm.  If you’ve been living under this policy you may be numb to its horror, but just because you’re used to hearing it doesn’t make it any less noxious.

If you’re used to hearing it and saying it on a weekly basis, you are afraid of your deploys and you should fix that.

If you are a software company, shipping code is your heartbeat.  Shipping code should be as reliable and sturdy and fast and unremarkable as possible, because this is the drumbeat by which value gets delivered to your org.

Deploys are the heartbeat of your company.

Every time your production pipeline stops, it is a heart attack.  It should not be ok to go around nonchalantly telling people to halt the lifeblood of their systems based on something as pedestrian as the day of the week.

Why are you afraid to push to prod?  Usually it boils down to one or more factors:

  • your deploys frequently break, and require manual intervention just to get to a good state
  • your test coverage is not good, your monitoring checks are not good, so you rely on users to report problems back to you and this trickles in over days
  • recovering from deploys gone bad can regularly cause everything to grind to a halt for hours or days while you recover, so you don’t want to even embark on a deploy without 24 hours of work day ahead of you
  • your deploys are painfully slow, and take hours to run tests and go live.

These are pretty darn good reasons.  If this is the state you are in, I totally get why you don’t want to deploy on Fridays.  So what are you doing to actively fix those states?  How long do you think these emergency controls will be in effect?

The answers of “nothing” and “forever” are unacceptable.  These are eminently fixable problems, and the amount of drag they create on your engineering team and ability to execute are the equivalent of five-alarm fires.

Fix. That.  Take some cycles off product and fix your fucking deploy pipeline.

If you’ve been paying attention to the DORA report or Accelerate, you know that the way you address the problem of flaky deploys is NOT by slowing down or adding roadblocks and friction, but by shipping more QUICKLY.

Science says: ship fast, ship often.

Deploy on every commit.  Smaller, coherent changesets transform into debuggable, understandable deploys.  If we’ve learned anything from recent research, it’s that velocity of deploys and lowered error rates are not in tension with each other, they actually reinforce each other.  When one gets better, the other does too.

So by slowing down or batching up or pausing your deploys, you are materially contributing to the worsening of your own overall state.

If you block devs from merging on Fridays, then you are sacrificing a fifth of your velocity and overall output.  That’s a lot of fucking output.

If you do not block merges on Fridays, and only block deploys, you are queueing up a bunch of changes to all get shipped days later, long after the engineers wrote the code and have forgotten half of the context.  Any problems you encounter will be MUCH harder to debug on Monday in a muddled blob of changes than they would have been just shipping crisply, one at a time on Friday.  Is it worth sacrificing your entire Monday?  Monday-Tuesday?  Monday-Tuesday-Wednesday?

Good judgment matters more than rules.

I am not saying that you should make a habit of shipping a large feature at 4:55 pm on Friday and then sauntering out the door at 5.  For fucks sake.  Every engineer needs to learn and practice good technical judgment around deploy hygiene.  LIke,

  • Don’t ship before you walk out the door on *any* day.
  • Don’t ship big, gnarly features right before the weekend, if you aren’t going to be around to watch them.
  • Instrument your code, and go and LOOK at the damn thing once it’s live.
  • Use feature flags and other tools that separate turning on code paths from deploys.

But you don’t need rules for this; in fact, rules actually inhibit the development of good judgment!

Most deploy-related problems are readily obvious, if the person who has the context for the change in their heads goes and looks at it.

But if you aren’t looking for them, then sure — you probably won’t find out until user reports start to trickle in over the next few days.

So go and LOOK.

Stop shipping blind.  Actually LOOK at what you ship.

I mean, if it takes 48 hours for a bug to show up, then maybe you better freeze deploys on Thursdays too, just to be safe!  🙄

I get why this seems obvious and tempting.  The “safety” of nodeploy Friday is realized immediately, while the costs are felt later later.  They’re felt when you lose Monday (and Tuesday) to debugging the big blob deplly.  Or they get amortized out over time.  Or you experience them as sluggish ship rates and a general culture of fear and avoidance, or learned helplessness, and the broad acceptance of fucked up situations as “normal”.

But if recovering from deploys is long and painful and hard, then you should fix that.  If you don’t tend to detect reliability events until long after the event, you should fix that.  If people are regularly getting paged on Saturdays and Sundays, they are probably getting paged throughout the night, too.  You should fix that.

On call paging events should be extremely rare.  There’s no excuse for on call being something that significantly impacts a person’s life on the regular.  None.

I’m not saying that every place is perfect, or that every company can run like a tech startup.  I am saying that deploy tooling is systematically underinvested in, and we abuse people far too much by paging them incessantly and running them ragged, because we don’t actually believe it can be any better.

It can.  If you work towards it.

Devote some real engineering hours to your deploy pipeline, and some real creativity to your processes, and someday you too can lift the Friday ban on deploys and relieve your oncall from burnout and increase your overall velocity and productivity.

On virtue signaling

Finally, I heard from a alarming number of people who admitted that Friday deploy bans were useless or counterproductive, but they supported them anyway as a purely symbolic gesture to show that they supported work/life balance.

This makes me really sad.  I’m … glad they want to support work/life balance, but surely we can come up with some other gestures that don’t work directly counter to their goals of  life/work balance.

Recovery: building a healthy deploy culture

Ways to begin recovering from a toxic deploy culture:

  • Have a deploy philosophy, make sure everybody knows what it is.  Be consistent.
  • Build and deploy on every set of committed changes.  Do not batch up multiple people’s commits into a deploy.
  • Train every engineer so they can run their own deploys, if they aren’t fully automated.  Make every engineer responsible for their own deploys.
  • (Work towards fully automated deploys.)
  • Every deploy should be owned by the developer who made the changes that are rolling out.  Page the person who committed the change that triggered the deploy, not whoever is oncall.
  • Set expectations around what “ownership” means.  Provide observability tooling so they can break down by build id and compare the last known stable deploy with the one rolling out.
  • Never accept a diff if there’s no explanation for the question, “how will you know Graph Everything, Kittenswhen this code breaks?  how will you know if the deploy is not behaving as planned?”  Instrument every commit so you can answer this question in production.
  • Shipping software and running tests should be fast.  Super fast.  Minutes, tops.
  • It should be muscle memory for every developer to check up on their deploy and see if it is behaving as expected, and if anything else looks “weird”.
  • Practice good deploy hygiene using feature flags.  Decouple deploys from feature releases.  Empower support and other teams to flip flags without involving engineers.

Each deploy should be owned by the developer who made the code changes.  But your deploy pipeline needs to have a team that owns it too.  I recommend putting your most experienced, senior developers on this problem to signal its high value.

You can find more tips for boring deploys in my piece on why shipping software should not be scary.

Good teams ship often.

Ultimately, I am not dogmatic about Friday deploys.  Truly, I’m not.  If that’s the only lever you have to protect your time, use it.  But call it and treat it like the hack it is.  It’s a gross workaround, not an ideal state.

Don’t let your people settle into the idea that it’s some kind of moral stance instead of a butt-ugly hack.  Because if you do you will never, ever get rid of it.

Remember: a team’s maturity and efficiency can be represented by how long it takes to get their shit into users’ hands after they write it.  Ship it fast, while it’s still fresh in your developers’ heads.  Ship one change set at a time, so you can swiftly debug and revert them.  I promise your lives will be so much better.  Every step helps.  <3

charity.

IMG_3768

Friday Deploy Freezes Are Exactly Like Murdering Puppies

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

This is part three of a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it.  🙏❤️  charity + friends


“Now Roll It Out And Keep Them Happy”

This is the third in a series of blog posts; previously we analyzed the security challenges of using a third party service, and we worked together with the security team to build empathy to deliver the project.  You might want to read those first, since we are going to build on a lot of the ideas there to ship and maintain this integration.

Ready for launch

You’ve convinced the security team and other stakeholders, you’ve gotten the integration running, you’re getting promising results from dev-test or staging environments… now it’s time to move from proof-of-concept to full implementation.  Depending on your situation this might be a transition from staging to production, or it might mean increasing a feature flipper flag from 5% to 100%, or it might mean increasing coverage of an integration from one API endpoint to cover your entire developer footprint.

Taking into account Murphy’s Law, we expect that some things will go wrong during the rollout.  Perhaps during coverage, a developer realizes that the schema designed to handle the app’s event mechanism can’t represent a scenario, requiring a redesign or a hacky solution.  Or perhaps the metrics dashboard shows elevated error rates from the API frontend, and while there’s no smoking gun, the ops oncall decides to rollback the integration Just In Case it’s causing the incident.

This gives us another chance to practice empathy — while it’s easy, wearing the champion hat, to dismiss any issues found by looking for someone to blame, ultimately this poisons trust within your organization and will hamper success.  It’s more effective, in the long run (and often even in the short run), to find common ground with your peers in other disciplines and teams, and work through to solutions that satisfy everybody.

Keeping the lights on

In all likelihood as integration succeeds, the team will rapidly develop experts and expertise, as well as idiomatic ways to use the product.  Let the experts surprise you; folks you might not expect can step up when given a chance.  Expertise flourishes when given guidance and goals; as the team becomes comfortable with the integration, explicitly recognize a leader or point person for each vendor relationship.  Having one person explicitly responsible for a relationship lets them pay attention to those vendor emails, updates, and avoid the tragedy of the “but I thought *you* were” commons.  This Integration Lead is also a center of knowledge transfer for your organization — they won’t know everything or help every user come up to speed, but they can help empower the local power users in each team to ramp up their teams on the integration.

As comfort grows you will start to consider ways to change your usage, for example growing into new kinds of data.  This is a good time to revisit that security checklist — does the change increase PII exposure to your vendor?  Would the new data lead to additional requirements such as per-field encryption?  Don’t let these security concerns block you from gaining valuable insight using the new tool, but do take the chance to talk it over with your security experts as appropriate.

Throughout this organic growth, the Integration Lead remains core to managing your changing profile of usage of the vendor they shepherd; as new categories of data are added to the integration, the Lead has responsibility to ensure that the vendor relationship and risk profile are well matched to the needs that the new usage (and presumably, business value) is placing on the relationship.

Documenting the Intergation Lead role and responsibilities is critical. The team should know when to check in, and writing it down helps it happen.  When new code has a security implication, or a new use case potentially amplifies the cost of an integration, bringing the domain expert in will avoid unhappy surprises.  Knowing how to find out who to bring in, and when to bring them in, will keep your team getting the right eyes on their changes.

Security threats and other challenges change over time, too.  Collaborating with your security team so that they know what systems are in use helps your team take note of new information that is relevant to your business. A simple example is noting when your vendors publish a breach announcement, but more complex examples happen too — your vendor transitions cloud providers from AWS to Azure and the security team gets an alert about unexpected data flows from your production cluster; with transparency and trust such events become part of a routine process rather than an emergency.

It’s all operational

Monitoring and alerting is a fact of operations life, and this has to include vendor integrations (even when the vendor integration is a monitoring product.)  All of your operations best practices are needed here — keep your alerts clean and actionable so that you don’t develop pager fatigue, and monitor performance of the integration so that you don’t get blindsided by a creeping latency monster in your APIs.

Authentication and authorization are changing as the threat landscape evolves and industry moves from SMS verification codes to U2F/WebAuthn.  Does your vendor support your SSO integration?  If they can’t support the same SSO that you use everywhere else and can’t add it — or worse, look confused when you mention SSO — that’s probably a sign you should consider a different vendor.

A beautiful sunset

Have a plan beforehand for what needs to be done should you stop using the service.  Got any mobile apps that depend on APIs that will go away or start returning permission errors?  Be sure to test these scenarios ahead of time.

What happens at contract termination to data stored on the service?  Do you need to explicitly delete data when ceasing use?

Do you need to remove integrations from your systems before ending the commercial relationship, or can the technical shutdown and business shutdown run in parallel?

In all likelihood these are contingency plans that will never be needed, and they don’t need to be fully fleshed out to start, but a little bit of forethought can avoid unpleasant surprises.

Year after year

Industry best practice and common sense dictate that you should revisit the security questionnaire annually (if not more frequently). Use this chance to take stock of the last year and check in — are you getting value from the service?  What has changed in your business needs and the competitive landscape? 

It’s entirely possible that a new year brings new challenges, which could make your current vendor even more valuable (time to negotiate a better contract rate!) or could mean you’d do better with a competing service.  Has the vendor gone through any major changes?  They might have new offerings that suit your needs well, or they may have pivoted away from the features you need. 

Check in with your friends on the security team as well; standards evolve, and last year’s sufficient solution might not be good enough for new requirements.

 

Andy thinks out loud about security, society, and the problems with computers on Twitter.


 

❤️ Thanks so much reading, folks.  Please feel free to drop any complaints, comments, or additional tips to us in the comments, or direct them to me on twitter.

Have fun!  Stay (a little bit) Paranoid!!

— charity

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

Logs vs Structured Events

I got an interesting tweet the other day from @evntdrvn in response to this thread of mine. Paraphrasing,

“So I’ve almost got our group at work up to Step 1 in your observability maturity model, but some of the devs that I work with want to turn OFF our lovely structured logging in prod for informational-level msgs due to their legacy philosophy (‘we only log errors in prod’). The reasons given are mostly philosophical (“I’m a dev and only interested when things error out, I don’t want any other noise in prod logs”, “I don’t want to slow my app down in prod”). Help?!?”

As I was reading this, I was itching to fly out and dive into battle with Eric. I know exactly where his opinionated devs are coming from. I used to say the same things! I even wrote a whole blog post about it.

These developers have internalized a set of rules and best practices for dealing with output data, in the context of “monolith application development in the early 2000s”.

Monolithic systems assumptions

Those systems had many common constraints and assumptions, such as:

  • We have a monolith service, or a very small number of services. We can model the system in our heads.
  • Logging is done to local disk, which can impact performance
  • Disks are expensive
  • Log lines are spat out inline with execution.  A poorly placed printf can take the whole system down.
  • Investigation is rare, and usually means a human reading error logs.
  • Logging is of poor utility for understanding internal states or execution paths; you should just read the code or use a debugger.  (There are few or network hops between functions.)
  • Logging is mostly useful for detecting certain terminal crash states or connection errors.

Monolithic logging best practices

Therefore:

  • We should be very stingy in what we log
  • Debuggers should be used for understanding internal states of the code
  • Logs are a last resort and record of crash dumps.  We do not expect to use log data in the course of our daily work.  We assume log-related manual investigation will be infrequent and of limited utility.

These were exactly the right lessons to learn in the era of expensive hardware and monolithic repos/artifacts. Many people still work in environments like this, and follow logging best practices like these. God bless, more power to em.

Distributed systems assumptions

But more and more of us face systems that are very different.

  • We have many services, possibly many MANY services. A representative request will have “many” hops across “many” services and routers and proxies and meshes and storage systems.
  • We cannot model the system in our heads; it would be a mistake to try. We rely on tooling as the source of truth for those systems.
  • You may or may not have access to those services, or the systems your code runs on. There may or may not be a logging facility, or a centralized log aggregator. Your only view of the system is through the instrumentation of your code.
  • Disks and system resources are cheap, ephemeral, all but disposable.
  • Data services are similarly cheap.  We can almost entirely silo application performance off from the cost of writing perf data out.
  • Investigation is prohibitively slow and expensive for a human to do by hand. Many of the nodes or processes we need to inspect may no longer even exist, but their past states may still be relevant to us in understanding patterns to the present time.
  • Investigation should usually be done distributedly, across all instantiations of your code, however many there might be — and in real time
  • Investigation requires computation — not just string search. We need to ask on the fly involving math and percentiles and breakdowns and group by’s.  And we need access to the raw requests in order to run accurate computations — no pre-aggregates.
  • The hardest part isn’t usually debugging the code, it’s figuring out where is the code you need to debug. Or what the errors or outliers have in common from the perspective of the code.  Fixing the code itself is often comparatively trivial, once found.
  • What even is ‘logging’?
  • What even is ‘local disk’?

This isn’t optional: at some point of complexity or scale or distributedness, it becomes necessary if you want to work with these systems.

Logs can’t help you here.

And you aren’t going to get that kind of explorable data out of loglevel:ERROR, or by chopping up your telemetry into disconnected metrics devoid of context.

You are only going to get this kind of explorable, ad hoc, computation-friendly data if you take a radically new approach to how you output and aggregate telemetry.  You’re going to need to replace your log lines and log levels with a different sort of beast: arbitrarily wide structured events that describe the request and its context, one event per request per service.

If it helps, don’t think of them as log files any more. Think of them as events. Yes, you can stash this stream in a file, but why would you?  on what disk?  will that work for your serverless functions too?  Just stream them over the network to wherever you want to put them.

Log levels are another confusing and unnecessary artifact of yesteryear that you no longer really need. The more you think of structured events as logs, the more tempted you may be to apply the old set of best practices. So just don’t think of them as logs at all.

How to gather and structure your data

Instead of dribbling little pebbles of log effluvia throughout your code, do this.  (If you’re a honeycomb user, our beelines do it all automatically for you *and* pre-propagate the blobs with everything we know of your context.)

  1. Initialize an empty blob at the beginning, when the request first enters the service.
  2. Stuff any and all interesting detail about the request into that blob throughout the lifetime of the request.
    • Any unique id, any high-cardinality variable, any headers passed in, every full query, normalized query, and query execution time; every http call out to a remote service, every http execution time; any shopping cart id, first and last name, execution time — literally anything interesting, append to blob.
  3. Then, when the request is about to exit or error, write the blob off to honeycomb or another service or disk somewhere.

You can see immediately how this method has radically different performance implications and risks than the earlier shotgun spray approach. No more “oops i accidentally put a print line INSIDE a for loop”. The write amplification profile is compressed. Most importantly, the incremental cost of capturing more detail about the request per service is nearly zero.

And now you have the kind of structured data that you can feed into something like a columnar store, or honeycomb, and run ad hoc queries to your heart’s delight.

Distributed systems logging events best practices:

Let’s sum up.  (I’m including links to other past rants on this topic):

Just think.

No more doing multi-line regexps trying to look for the same request ID or user ID doing five suspicious things in a row.

No more regexps at all, for fuck’s sake.

No more bullshit percentiles that were computed at write time by averaging over a bunch of other averages

No more having to jump around from dashboards to logs trying to vainly eyeball correlate one spike with another. No more wondering why no two tools can agree if anything even exists or not

Just gather the detail you need to ask the questions when you need them, and store it in a single source of truth.  It’s that simple.

No need to shame people from learning best practices that worked perfectly well for a long time.  You can either let them learn the hard way that this transformation is non optional, or you can help them learn the easy way that it’s simply much better and easier to invest in this telemetry up front.  You seem like a nice enough chap, which is probably why you chose door 2.  (If you wanted to get tougher about it, have a few reformed folks in to tell their horror stories.  Try some ex-twitter engineers.)

The hardest part seems to be getting people to unlearn all the best practices they once learned for dealing with logs.  So just don’t call it logs anymore, if that helps. Call it “structured events”.

– charity.

img_4817

Logs vs Structured Events

Software Sprawl, The Golden Path, and Scaling Teams With Agency

gplanis

Stop me if you’ve heard this one before.

The company is growing like crazy, your engineering team keeps rising to the challenge, and you are ferociously proud of them.  But some cracks are beginning to show, and frankly you’re a little worried.  You have always advocated for engineers to have broad latitude in technical decisions, including choosing languages and tools.  This autonomy and culture of ownership is part of how you have successfully hired and retained top talent despite the siren song of the Faceboogles.

But recently you saw something terrifying that you cannot unsee: your company is using all the languages, all the environments, all the databases, all the build tools.  Shit!!!  Your ops team is in full revolt and you can’t really blame them.  It’s grown into an unsupportable nightmare and something MUST be done, but you don’t know what or how — let alone how to solve it while retaining the autonomy and personal agency that you all value so highly.

I hear a version of this everywhere I’ve gone for the past year or two.  It’s crazy how often.  I’ve been meaning to write my answer up for ages, and here it (finally) is.

First of all: you aren’t alone.  This is extremely common among high-performing teams, so congratulations.  Really!

There actually seems to be a direct link between teams that give engineers lots of leeway to own their technical decisions and that team’s ability to hire and retain top-tier talent, particularly senior talent.   Everything is a tradeoff, obviously, but accepting somewhat more chaos in exchange for a stronger sense of individual ownership is usually the right one, and leads to higher-performing teams in the long run.

Second, there is actually already a well-trod path out of this hole to a better place, and it doesn’t involve sacrificing developer agency.  It’s fairly simple!  Just five short steps, which I will describe to you now.

How to build a golden path and reverse software sprawl

  1. Assemble a small council of trusted senior engineers.
  2. Task them with creating a recommended list of default components for developers to use when building out new services.  This will be your Golden Path, the path of convergence (and the path of least resistance).
  3. Tell all your engineers that going forward, the Golden Path will be fully supported by the org.  Upgrades, patches, security fixes; backups, monitoring, build pipeline; deploy tooling, artifact versioning, development environment, even tier 1 on call support.  Pave the path with gold.  Nobody HAS to use these components … but if they don’t, they’re on their own.  They will have to support it themselves.
  4. Work with team leads to draw up an umbrella plan for adopting the Golden Path for their current projects as well as older production services, as much as is reasonable or possible or desirable.  Come up with a timeline for the whole eng org to deprecate as many other tools as possible.  Allocate real engineering time to the effort.  Hell, make a party out of it!
  5. After the cutoff date (and once things have stabilized), establish a regular process for reviewing and incorporating feedback about the blessed Path and considering any proposed changes, additions or removals.

There you go.  That’s it.  Easy, right??

(It’s not easy.  I never said it was easy, I said it was simple.  👼🏼)

Your engineers are currently used to picking the best tool for the job by optimizing locally. gpjon What data store has a data model that is easiest for them to fit to their needs?  Which language is fastest for I/O throughput?  What are they already proficient in?  What you need to do is start building your muscles for optimizing globally.  Not in isolation of other considerations, but in conjunction with them.  It will always be a balancing act between optimizing locally for the problem at hand and optimizing globally for operability and general sanity.

(Oh, incidentally, requiring an engineer to write up a proposal any time they want to use a non-standard component, and then defend their case while the council grills them in person — this will be nothing but good for them, guaran-fucking-teed.)

Let’s go into a bit more detail on each of the five points.  But quick disclaimer: this is not a prescription.  I don’t know your system, your team, your cultural land mines or technical interdependencies or anything else about your situation.  I am just telling stories here.

1. Assemble your council

Three is a good number for a council.  More than that gets unwieldy, and may have trouble reaching consensus.  Less than three and you run into SPOFs.  You never want to have a single person making unilateral decisions because a) the decision-making process will be weaker, b) it sets that person up for too much interpersonal friction, and c) it denies your other engineers the opportunity to practice making these kinds of decisions.

  • Your council members need technical breadth more than depth, and should be widely respected by engineers.
  • At least one member should have a long history with the company so they know lots of stupid little details about what’s been tried before and why it failed.
  • At least one member should be deeply versed in practical data and operability concerns.
  • They should all have enough patience and political skill to drive consensus for their decisions.  Absolutely no bombthrowers.

If you’re super lucky, you just tap the three senior technologists who immediately come to mind … your mind and everyone else’s.  If you don’t have this kind of automatic consensus, you may want to let teams or orgs nominate their own representative so they feel they have some say.

 

2.  Task the council with defining a Golden Path

Your council cannot vanish for a week and then descend from the mountain lugging lists engraved on stone tablets.  The process of discovery and consensus is what validates the result.

The process must include talking to and gathering feedback from your engineers, talking to experts outside the company, talking to teams at other companies who are farther along using that technology, coming up with detailed pro/con lists and reasons for their choices.  Maybe sometimes it includes prototyping something or investigating the technical depths … but yeah no mostly it’s just the talking.

You need your council members to have enough political skill to handle these conversations deftly, building support and driving consensus through the process.  Everybody doesn’t have to love the outcome, but it shouldn’t be a *surprise* to anyone by the end.

3.  Know where you’re going

Your council should create a detailed written plan describing which technologies are going to be supported … and a stab at what “supported” means.  (Ask the experts in each component what the best practices are for backups, versioning, dependency management, etc.)

You might start with something like this:

* Backend lang: Go 1.11           ## we will no longer be supporting
backend scripting languages
* Frontend lang: ReactJS v 16.5
* Primary db: Aurora v 2.0        ## Yes, we know postgres is "better", 
but we have many mysql experts and 0 pg experts except the one guy 
who is going to complain about this.  You know who you are.
* Deploy pipeline: github -> jenkins + docker -> S3 -> custom k8s 
deploy tooling
* Message broker: kafka v 2.10, confluent build
* Mail: SES
* .... etc

Circulate the draft regularly for feedback, especially with eng managers.  Some team reorganization will probably be necessary to bear the new weight of your support specifications, and managers will need some lead time to wrangle this.

This is also a great time to reconceive of the way on call works at your company.  But I am not going to go into all that here.

4. Set a date, draft a plan: go!

Get approval from leadership to devote a certain amount of time to consolidating your stack and paying down a lump sum of tech debt.  It depends on your stage of decay, but a reasonable amount of time might be “25% of engineering time for three months“.  Whatever you agree to, make sure it’s enough to make the world demonstrably better for the humans who run it; you don’t want to leave them with a tire fire or you’ll blow your credibility.

The council and team leads should come up with a rough outer estimate for how long itgpjava would take to rewrite everything and move the whole stack on to the Golden Stack.  (It’s probably impossible and/or would take years, but that’s okay.)  Next, look for the quick wins or swollen, inflamed pain points.

  • If you are running two pieces of functionally similar software, like postgres and mysql, can you eliminate one?
  • If you are managing something yourself that AWS could manage for you (e.g. postfix instead of SES, or kafka instead of kinesis), can you migrate that?
  • If you are managing anything yourself that is not core to your business value, in fact, you should try to not manage it.
  • If you are running any services by hand on an AWS instance somewhere, could you try using a service?
  • If you are running your own monitoring software, etc … can you not?
  • If you have multiple versions of a piece of software, can you upgrade or consolidate on one version?

The hardest parts are always going to be the ones around migrating data or rewriting components.  Not everything is worth doing or can afford to be done in the time span of your project time, and that’s okay.

Next, brainstorm up some carrots.  Can you write templates so that anybody who writes a service using your approved library, magically gets monitoring checks without having to configure anything?  Can you write a wrapper so they get a bunch of end-to-end tests for free?  Anything you can do to delight people or save them time and effort by using your preferred components is worth considering.

(By the way, if you don’t have any engineers devoted to internal tooling, you’re probably way overdue at this point.)

Pay down as much debt as you can, but be pragmatic: it’s better to get rid of five small things than one large thing, from a support perspective.  Your main goal is to shrink the number of types of software your team has to support, particularly databases.

Do look for ways to make it fun, like … running a competition to see who can move the most tools to AWS in a week, or throwing a hack week party, or giving dorky prizes like trophies that entitle you to put your manager on call instead of you for a day, etc.

gpcersei

5. Make the process sustainable

After your target date has come and gone, you probably want to hold a post mortem retrospective and do lots of listening.  (Well — first might I recommend a bubble bath and a bottle of champagne?  But then a post mortem.)

Nothing is ever fixed forever.  The company’s needs are going to expand and contract, gpdiedand people will come and go, because change is the only constant.  So you need to bake some flex into your system.  How are you going to handle the need for changes to the Golden Path?  Monthly discussions?  An email list?  Quarterly meetings with a formal agenda?  I’ve seen people do all of these and more, it doesn’t really matter afaict.

Nobody likes a cabal, though, so the original council should gradually rotate out.  I recommend replacing one person at a time, one per quarter, and rotating in another senior engineer in their place.  This provides continuity while giving others a chance to learn these technical and political skills.

In the end, engineers are still free to use any tool or component at any time, just like before, only now they are solely responsible for it, which puts pressure on them not to do it unless REALLY necessary.  So if someone wants to propose adding a new tool to the default golden path, they can always add it themselves and gain some experience in it before bringing it to the council to discuss a formal place for it.

That’s all folks

See, wasn’t that simple?

(It’s never simple.)

I dearly wish more people would write up their experiences with this sort of thing in detail.  I think engineering teams are too reluctant to show their warts and struggles to the world — or maybe it’s their executives who are afraid?  Dunno.

Regardless, I think it’s actually a highly effective recruiting tool when teams aren’t afraid to share their struggles.  The companies that brag about how awesome they are are the ones who come off looking weak and fragile.  Whereas you can always trust the ones who are willing to laugh about all the ways they screwed up.  Right?

In conclusion, don’t feel like an asshole for insisting on some process here.  There should be friction around adding new components to your stack.  (Add in haste, repent at leisure, as they say.)  Anybody who argues with you probably needs to be exposed to way, way more of the support load for that software.  That’s my professional opinion.

Anyway.  You win or you die.  Good luck with your sprawl.

charity

 

 

Software Sprawl, The Golden Path, and Scaling Teams With Agency

DevOps vs SRE: delayed coverage of the dumbest war

Last week was the West Coast Velocity conference.  I had a terrific time — I think it’s the
best Velocity I’ve been to yet.  I also slipped in quite late, the evening before last, to catch Gareth’s session on DevOps vs SRE.

had to catch it, because Gareth Rushgrove (of DevOps Weekly glory) was taunting @lusis and me about it on the Internet.

rainbowdropletAnd it was worth it!   Holy crap, this was such a fun barnburner of a talk, with Gareth schizophrenically arguing both for and against the key premise of the talk, which was about “Google Infrastructure for Everyone Else (GIFEE)” and whether SRE is a) the highest, noblest goal that we should all aspire towards, or b) mostly irrelevant to anyone outside the Google confines.

Which Gareth won?  Check out the slides and judge for yourself.  🙃

 

At some point in his talk, though, Gareth tossed out something like “Charity probably already has a blog post on this drafted up somewhere.”  And I suddenly remembered “Fuck!  I DO!”  it’s been sitting in my Drafts for months god dammit.

So this is actually a thing I dashed off back in April, after CraftConf.  Somebody asked me for my opinion on the internet — always a dangerous proposition — and I went off on a bit of a rant about the differences and similarities between DevOps and SRE, as philosophies and practices.
colorful-rainbow

Time passed and I forgot about it, and then decided it was too stale.  I mean who really wants to read a rehash of someone’s tweetstorm from two months ago?

Well Gareth, apparently.

Anyway: enjoy.


SRE vs DevOps: TWO PHILOSOPHIES ENTER, BOTH ARE PHENOMENALLY SUCCESSFUL AND MUTUALLY DUBIOUS OF ONE ANOTHER


 

So in case you haven’t noticed, Google recently published a book about Site Reliability Engineering: How Google Runs Production Systems.  It contains some really terrific wisdom on how to scale both systems and orgs.  It contains chapters written by dear friends of mine.  It’s a great book, and you should buy it and read it!

Rainbow-Umbrella-Z-5_5It also has some really fucking obnoxious blurbs.  Things like about how “ONLY GOOGLE COULD HAVE DONE THIS”, and an whiff of snobbery throughout the book as though they actually believe this (which is far worse if true).

You can’t really blame the poor blurb’ers, but you can certainly look askance at a massive systems engineering org when it seems as though they’ve never heard of DevOps, or considered how it relates to SRE practices, and may even be completely unaware of what the rest of the industry has been up to for the past 10-plus years.  It’s just a little weird.

So here, for the record, is what I said about it.

 

Google is a great company with lots of terrific engineers, but you can only say they are THE

sre
The Google SRE Bible

BEST at what they do if you’re defining what they do tautologically, i.e. “they are the best at making Google run.”  Etsyans are THE BEST at running Etsy, Chefs are THE BEST at building Chef, because … that’s what they do with their lives.

Context is everything here.  People who are THE BEST at Googling often flail and flame out in early startups, and vice versa.  People who are THE BEST at early-stage startup engineering are rarely as happy or impactful at large, lumbering, more bureaucratic companies like Google.  People who can operate equally well and be equally happy at startups and behemoths are fairly rare.

And large companies tend to get snobby and forget this.  They stop hiring for uniquerainbow-swirl strengths and start hiring for lack of weaknesses or “Excellence in Whiteboard Coding Techniques,” and congratulate themselves alot about being The Best.  This becomes harmful when it translates into to less innovation, abysmal diversity numbers, and a slow but inexorable drift into dinosaurdom.

Everybody thinks their problems are hard, but to a seasoned engineer, most startup problems are not technically all that hard.  They’re tedious, and they are infinite, but anyone can figure this shit out.  The hard stuff is the rest of it: feverish pace, the need to reevaluate and reprioritize and reorient constantly, the total responsibility, the terror and uncertainty of trying to find product/market fit and perform ten jobs at once and personally deliver to your promises to your customers.

rainbow-cloud-dropletAt a large company, most of the hardest problems are bureaucratic.  You have to come to terms with being a very tiny cog in a very large wheel where the org has a huge vested interest in literally making you as replicable and replaceable as possible.  The pace is excruciatingly slow if you’re used to a startup.  The autonony is … well, did I mention the politics?  If you want autonomy, you have to master the politics.

 

Everyone.  Operational excellence is everyone’s job.  Dude, if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on  your team is an operational risk (not to mention, you know, like moral issues and stuff).

But the more engineering-focused your role is, the more direct your impact will be on operational outcomes.

As a software engineer, developing strong ops chops makes you powerful.  It makes you better at debugging and instrumentation, building resiliency and observability into your own systems and interdependent systems, and building systems that other people can come along and understand and maintain long after you’re gone.rainbow-shade

As an operations engineer, those skills are already your bread and butter.  You can increase your power in other ways, like by leveling up at software engineering skills like test coverage and automation, or DBA stuff like query optimization and storage engine internals, or by helping the other teams around you level up on their skills (communication and persuasion are chronically underrecognized as core operations engineering skills).

This doesn’t mean that everyone can or should be able to do everything.  (I can’t even SAYrainbow-dot-ball the words “full stack engineer” without rolling my eyes.)  Generalists are awesome!  But past a certain inflection point, specialization is the only way an org can scale.

It’s the only way you make room for those engineering archetypes who only want to dive deep, or who really really love refactoring, or who will save the world then disappear for weeks.  Those engineers can be incredibly valuable as part of a team … but they are most valuable in a large org where you have enough generalists to keep the oars rowing along in the meantime.

So, back to Google.  They’ve done, ahem, rather  well for themselves.  Made shitbuckets of money, pushed the boundaries of tech, service hardly ever goes down.  They have operational demands that most of us never have seen and never will, and their engineers are definitely to be applauded for doing a lot of hard technical and cultural labor to get there.

So why did this SRE book ruffle a few feathers?

Mostly because it comes off a little tone deaf in places.  I’m not personally pissed off by
the google SRE book, actually, just a little bemused at how legitimately unaware they seem to be about … anything else that the industry has been doing over the past 10 years, in terms of cultural transformation, turning sysadmins into better engineers, sharing on-call rotations, developing processes around empathy and cross-functionality, engineering best practices, etc.

devops
DevOps for the rest of us

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

However, it happens that Jen Davis and Katherine Daniels just published a book called Effective DevOps, which covers a lot of the same ground with a much more varied and inclusive approach.  And one of the things they return to over and over again is the power of context, and how one-size-fits-all solutions simply don’t exist, just like unisex OSFA t-shirts are a dirty fucking lie.

Google insularity is … a thing.  On the one hand it’s great that they’re opening up a bit!rainbow-umbrella-clipart-1 On the other hand it’s a little bit like when somebody barges onto a mailing list and starts spouting without skimming any of the archives.  And don’t even get me started on what happens when you hire long, longterm ex-Googlers back into to the real world.

So, so many of us have had this experience of hiring ex-Googlers who automatically assume that the way Google does a thing is CORRECT, not just contextually appropriate.   Not just right for Google, but right for everyone, always.  Which is just obviously untrue.  But the reassimilation process can be quite long and exhausting when the Kool-Aid is so strong.

Because yeah, this is a conversation and a transformation that the industry has been having for a long time now.  Compared with the SRE manifesto, the DevOps philosophy is much more crowd-sourced, more flexible, and adaptable to organizations of all stages of developments, with all different requirements and key business differentiators, because it’s benefited from loud, mouthy contributors who aren’t all working in the same bubble all along.

And it’s like Google isn’t even aware this was happening, which is weird.

Orrrrrr, maybe I’m just a wee bit annoyed that I’ve been drawn into this position rainbow-dot-ballof having to defend “DevOps”, after many excellent years spent being grumpy about the word and the 10000010101 ways it is used and abused.

(Tell me again about your “DevOps Engineering Team”, I dare you.)

P.S. I highly encourage you to go read the epic hours-long rant by @matthiasr that kicked off the whole thing.  some of which I definitely endorse and some of which not, but I think we could go drink whiskey and yell about this for a week or two easy breezy  <3

Anyway what the fuck do I know, I’ve never worked in the Google lair, so maybe I am just under-equipped to grasp the true glory, majesty and superiority of their achievements over us all.

Or maybe they should go read Katherine and Jen’s book and interact with the “UnGoogled” once in a while.  ☺️

colorful-rainbow

 

DevOps vs SRE: delayed coverage of the dumbest war

Operational Best Practices #serverless

This post is part two of my recap of last week’s terrific Serverless conference.  If you feel like getting bitchy with me about what serverless means or #NoOps or whatever, please refer back to the prequel post, where I talked about operations engineering in the modern world.

*Then* you can get bitchy with me.  (xoxoxxooxo)

The title of my talk was:

Screen Shot 2016-05-30 at 8.43.39 PM

The theme of my talk was basically: what should software engineers know and care about when it comes to operations in a world where we are outsourcing more and more core functionality?

If you care about running a quality service or product, or providing your customers with a reasonable level of service, you have to care about operational concerns like design, resiliency, instrumentation and debuggability.  No matter how many abstractions there are between you and the bare metal.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault.  You chose them, it’s on you.  It’s tacky to blame the software or the service, and besides your customers don’t give a shit whose “fault” it is.

So given an infinite number of things to care about, where do you start?

What is your mission, and what are your differentiators?

The first question must always be: what is your mission?  Your mission is not writing software.  Your mission is delivering whatever it is your customers are paying you for, and you use software to get there.  (Code is kind of a liability so you should write as little of it as necessary.  hey!! sounds like a good argument for #serverless!)

Second: what are your core differentiators?  What are the things that you are doing that are unique, and difficult to replicate, or the things where you have to actually be world class experts in those things?

Those are the things that you will have the hardest time outsourcing, or that you should think about very carefully before outsourcing.

Screen Shot 2016-05-30 at 7.57.06 PM

Facts

You can outsource labor, but you can’t outsource caring.  And nobody but you is in the position to think about your core differentiators and your product in a holistic way.

If you’re a typical early startup, you’re probably using somewhere between 5 and 20 SaaS products to get rid of some of the crap work and offload it to dedicated teams who can do it better than you can, much more cheaply, so you are freed up to work on your core value proposition.

GOOD.

But you still have to think about things like reliability, your security model, your persistent storage models, your query performance, how all these lovely services talk to each other, how you’re going to debug them, how you’re going to repro when things go wrong, etc.  You still own these things, even if you don’t run them.

For example, take AWS Lambda.  It’s a pretty great service on many dimensions.  It’s an early version of the future.  It is also INCREDIBLY irritating and challenging to debug in a practically infinite number of insanity-inducing ways.

** Important side note — I’m talking about actual production systems.  Parse, Heroku, Lambda, etc are GREAT for prototyping and can take you a long, long way.  Early stage startups SHOULD optimize for agility and rapid developer iteration, not reliability.  Thx to @joeemison for reminding me that i left that out of the recap.

Screen Shot 2016-05-30 at 8.03.01 PM

Focus on the critical path

Your users don’t care if your internal jenkins builds are broken.  They don’t care about a whole lot of things that you have to care about … eventually.  They do care a lot if your product isn’t actually functional.  Which means you have to think through the behavioral and failure characteristics of the providers you’re relying on in any user visible fashion.

Ask lots of questions if you can.  (AWS often won’t tell you much, but smaller providers will.)  Find out as much as you can about their cotenancy model (shared hardware or isolation?), their typical performance variance (run your own tests, don’t trust their claims), and the underlying storage systems.

Think about how you can bake in resiliency from the user’s perspective, that doesn’t rely on provider guarantees.  If you’re on mobile, can you provide a reasonable offline experience?  Like Parse did a lot of magic here in the APIs, where it would back off and retry saves if there were any errors.

Can you fail over to another provider if one is down?  Is it even worth it at your company’s stage of maturity and engineering resources to invest in this?

How willing are you to be locked into a vendor or provider, and what is the story if you find yourself forced to switch?  Or if that service goes away, as so many, many, many of them have done and will do.  (RIP, parse.com.)

Screen Shot 2016-05-30 at 8.11.10 PM

Tradeoffs

Listen, outsourcing is awesome.  I do it as much as I can.  I’m literally helping build a service that provides outsourced metrics, I believe in this version of the future!  It’s basically the latest iteration of capitalism in a nutshell: increased complexity –> increased specialization –> you pay other people to do the job better than you –> everybody wins.

But there are tradeoffs, so let’s be real.

The service, if it is smart, will put strong constraints on how you are able to use it, so they are more likely to deliver on their reliability goals.  When users have flexibility and options it creates chaos and unreliability.  If the platform has to choose between your happiness vs thousands of other customers’ happiness, they will choose the many over the one every time — as they should.

Limits may mysteriously change or be invented as they are discovered, esp with fledgling services.  You may be desperate for a particular feature, but you can’t build it.  (This is why I went for Kafka over Kinesis.)

You need to think way more carefully and more deeply about visibility and introspection up front than you would if you were running your own services, because you have no ability to log in and use strace or gdb or tail a logfile or run any system profiling commands when things go dark.

In the best case, you’re giving up some control and quality in exchange for experts doing the work better than you could for cheaper (e.g. i’m never running a fucking physical data center again, jesus.  EC24lyfe).  In a common worse case, it’s less reliable than what you would build AND it’s also opaque AND you can’t tell if it’s down for you or for everyone because frankly it’s just massively harder to build a service that works for thousands/millions of use cases than for any one of them individually.

Screen Shot 2016-05-30 at 8.21.36 PM.png

Stateful services

Ohhhh and let’s just briefly talk about state.

The serverless utopia mostly ignores the problems of stateful services.  If pressed they will usually say DynamoDB, or Firebase, or RDS or Aurora or something.

This is a big, huge, deep, wide lake of crap to wade in to so all I’m going to say is that there is no such thing as having the luxury of not having to understand how your storage systems work.  Queries will get slow, and you’ll need to be able to figure out why and fix them.  You’ll hit scaling cliffs where suddenly a perfectly-usable app just starts timing everything out because of that extra second of latency coming from …

¯\_(ツ)_/¯

The hardware underlying your instance will degrade (there’s a server somewhere under all those abstractions, don’t forget).  The provider will have mysterious failures.  They will be better than you, probably, but less inclined to give you satisfactory progress updates because there are hundreds or thousands or millions of you all clamoring.

The more you understand about your storage system (and the more you stay in the lane of how it was intended to be used), the happier you’ll be.

Screen Shot 2016-05-30 at 8.28.29 PM

In conclusion

These trends are both inevitable and, for the most part, very good news for everyone.

Operations engineering is becoming a more fascinating and specialized skill set.  The best engineers are flocking to solve category problems — instead of building the same system at company after company, they are building SaaS solutions to solve it for the internet at large.  Just look at the massive explosion in operational software offerings over the past 5-6 years.

This means that the era of the in-house dedicated ops team, which serves as an absorbent buffer for all the pain of software development, is mostly on its way out the door.  (And good riddance.)

People are waking up to the fact that software quality improves when feedback loops are tighter for software engineers, which means being on call and owning services end to end.  The center of gravity is shifting towards engineering teams owning the services they built.

This is awesome!  You get to rent engineers from Google, AWS, Pagerduty, Pingdom, Heroku, etc for much cheaper than if you hired them in-house — if you could even get them, which you probably can’t because talent is scarce.

Screen Shot 2016-05-30 at 8.39.42 PM

But the flip side of this is that application engineers need to get better at thinking in traditionally operations-oriented ways about reliability, architecture, instrumentation, visibility, security, and storage.  Figure out what your core differentiators are, and own the shit out of those.

Nobody but you can care about your mission as much as you can.  Own it, do it.  Have fun.

 

Operational Best Practices #serverless