Friday Deploy Freezes Are Exactly Like Murdering Puppies

VOICEOVER: “Previously, on twitter …”

So, that happened.

I hadn’t seen anyone say something like this in quite a while.  I remember saying things like this myself as recently as, oh, 2016, but I thought the zeitgeist had moved on to continuous delivery.

Which is not to say that Friday freezes don’t happen anymore, or even that they shouldn’t; I just thought that this was no longer seen as a badge of responsibility and honor, rather a source of mild embarrassment.  (Much like the fact that you still don’t automatedly restore your db backups and verify them every night.  Do you.)

So I responded with an equally hyperbolic and indefensible claim:

Now obviously, OBVIOUSLY, reassigning all your developer cycles is probably a terrible idea.  You don’t get 100x parallel efficiency if you put 100 developers on a single problem.  So I thought it was clear that this said somewhat tongue in cheek, serious-but-not-really.  I was wrong there too.

So let me explain.

There’s nothing morally “wrong” with Friday freezes.  But it is a costly and cumbersome bandage for a problem that you would be better served to address directly.  And if your stated goal is to protect people’s off hours, this strategy is likely to sabotage that goal and cause them to waste far more time and get woken up much more often, and it stunts your engineers’ technical development on top of that.

Fear is the mind-killer.

Fear of deploys is the ultimate technical debt.  How much time does your company waste, between engineers:

  • waiting until it is “safe” to deploy,
  • batching up changes into bigger changes that are decidedly unsafe to deploy,
  • debugging broken deploys that had many changes batched into them,Does Not Kill Us Puppy UPDATED
  • waiting nervously to get paged after a deploy goes out,
  • figuring out if now is a good time to deploy or not,
  • cleaning up terrible deploy-related catastrophuckes

Anxiety related to deploys is the single largest source of technical debt in many, many orgs.  Technical debt, lest we forget, is not the same as “bad code”.  Tech debt hurts your people.

Saying “don’t push to production” is a code smell.  Hearing it once a month at unpredictable intervals is concerning.  Hearing it EVERY WEEK for an ENTIRE DAY OF THE WEEK should be a heartstopper alarm.  If you’ve been living under this policy you may be numb to its horror, but just because you’re used to hearing it doesn’t make it any less noxious.

If you’re used to hearing it and saying it on a weekly basis, you are afraid of your deploys and you should fix that.

If you are a software company, shipping code is your heartbeat.  Shipping code should be as reliable and sturdy and fast and unremarkable as possible, because this is the drumbeat by which value gets delivered to your org.

Deploys are the heartbeat of your company.

Every time your production pipeline stops, it is a heart attack.  It should not be ok to go around nonchalantly telling people to halt the lifeblood of their systems based on something as pedestrian as the day of the week.

Why are you afraid to push to prod?  Usually it boils down to one or more factors:

  • your deploys frequently break, and require manual intervention just to get to a good state
  • your test coverage is not good, your monitoring checks are not good, so you rely on users to report problems back to you and this trickles in over daysfaith
  • recovering from deploys gone bad can regularly cause everything to grind to a halt for hours or days while you recover, so you don’t want to even embark on a deploy without 24 hours of work day ahead of you
  • your deploys are painfully slow, and take hours to run tests and go live.

These are pretty darn good reasons.  If this is the state you are in, I totally get why you don’t want to deploy on Fridays.  So what are you doing to actively fix those states?  How long do you think these emergency controls will be in effect?

The answers of “nothing” and “forever” are unacceptable.  These are eminently fixable problems, and the amount of drag they create on your engineering team and ability to execute are the equivalent of five-alarm fires.

Fix. That.  Take some cycles off product and fix your fucking deploy pipeline.

If you’ve been paying attention to the DORA report or Accelerate, you know that the way you address the problem of flaky deploys is NOT by slowing down or adding roadblocks and friction, but by shipping more QUICKLY.

Science says: ship fast, ship often.

Deploy on every commit.  Smaller, coherent changesets transform into debuggable, understandable deploys.  If we’ve learned anything from recent research, it’s that velocity of deploys and lowered error rates are not in tension with each other, they actually reinforce each other.  When one gets better, the other does too.

So by slowing down or batching up or pausing your deploys, you are materially contributing to the worsening of your own overall state.

If you block devs from merging on Fridays, then you are sacrificing a fifth of your velocity and overall output.  That’s a lot of fucking output.Screen Shot 2019-02-05 at 7.02.43 AM

If you do not block merges on Fridays, and only block deploys, you are queueing up a bunch of changes to all get shipped days later, long after the engineers wrote the code and have forgotten half of the context.  Any problems you encounter will be MUCH harder to debug on Monday in a muddled blob of changes than they would have been just shipping crisply, one at a time on Friday.  Is it worth sacrificing your entire Monday?  Monday-Tuesday?  Monday-Tuesday-Wednesday?

Good judgment matters more than rules.

I am not saying that you should make a habit of shipping a large feature at 4:55 pm on Friday and then sauntering out the door at 5.  For fucks sake.  Every engineer needs to learn and practice good technical judgment around deploy hygiene.  LIke,

  • icecream_ninesDon’t ship before you walk out the door on *any* day.
  • Don’t ship big, gnarly features right before the weekend, if you aren’t going to be around to watch them.
  • Instrument your code, and go and LOOK at the damn thing once it’s live.
  • Use feature flags and other tools that separate turning on code paths from deploys.

But you don’t need rules for this; in fact, rules actually inhibit the development of good judgment!

Most deploy-related problems are readily obvious, if the person who has the context for the change in their heads goes and looks at it.

But if you aren’t looking for them, then sure — you probably won’t find out until user reports start to trickle in over the next few days.

So go and LOOK.

Stop shipping blind.  Actually LOOK at what you ship.

I mean, if it takes 48 hours for a bug to show up, then maybe you better freeze deploys on Thursdays too, just to be safe!  🙄

I get why this seems obvious and tempting.  The “safety” of nodeploy Friday is realized immediately, while the costs are felt later later.  They’re felt when you lose Monday (and Tuesday) to debugging the big blob deplly.  Or they get amortized out over time.  Or you experience them as sluggish ship rates and a general culture of fear and avoidance, or learned helplessness, and the broad acceptance of fucked up situations as “normal”.

But if recovering from deploys is long and painful and hard, then you should fix that.  If you don’t tend to detect reliability events until long after the event, you should fix that.  If people are regularly getting paged on Saturdays and Sundays, they are probably getting paged throughout the night, too.  You should fix that.

On call paging events should be extremely rare.  There’s no excuse for on call being something that significantly impacts a person’s life on the regular.  None.Root Causes Dream Bunny 4x4

I’m not saying that every place is perfect, or that every company can run like a tech startup.  I am saying that deploy tooling is systematically underinvested in, and we abuse people far too much by paging them incessantly and running them ragged, because we don’t actually believe it can be any better.

It can.  If you work towards it.

Devote some real engineering hours to your deploy pipeline, and some real creativity to your processes, and someday you too can lift the Friday ban on deploys and relieve your oncall from burnout and increase your overall velocity and productivity.

On virtue signaling

Finally, I heard from a alarming number of people who admitted that Friday deploy bans were useless or counterproductive, but they supported them anyway as a purely symbolic gesture to show that they supported work/life balance.

This makes me really sad.  I’m … glad they want to support work/life balance, but surely we can come up with some other gestures that don’t work directly counter to their goals of  life/work balance.

Recovery: building a healthy deploy culture

Ways to begin recovering from a toxic deploy culture:

  • Have a deploy philosophy, make sure everybody knows what it is.  Be consistent.
  • Build and deploy on every set of committed changes.  Do not batch up multiple people’s commits into a deploy.
  • Train every engineer so they can run their own deploys, if they aren’t fully automated.  Make every engineer responsible for their own deploys.
  • (Work towards fully automated deploys.)
  • Every deploy should be owned by the developer who made the changes that are rolling out.  Page the person who committed the change that triggered the deploy, not whoever is oncall.
  • Set expectations around what “ownership” means.  Provide observability tooling so they can break down by build id and compare the last known stable deploy with the one rolling out.
  • Never accept a diff if there’s no explanation for the question, “how will you know Graph Everything, Kittenswhen this code breaks?  how will you know if the deploy is not behaving as planned?”  Instrument every commit so you can answer this question in production.
  • Shipping software and running tests should be fast.  Super fast.  Minutes, tops.
  • It should be muscle memory for every developer to check up on their deploy and see if it is behaving as expected, and if anything else looks “weird”.
  • Practice good deploy hygiene using feature flags.  Decouple deploys from feature releases.  Empower support and other teams to flip flags without involving engineers.

Each deploy should be owned by the developer who made the code changes.  But your deploy pipeline needs to have a team that owns it too.  I recommend putting your most experienced, senior developers on this problem to signal its high value.

You can find more tips for boring deploys in my piece on why shipping software should not be scary.

Good teams ship often.

Ultimately, I am not dogmatic about Friday deploys.  Truly, I’m not.  If that’s the only lever you have to protect your time, use it.  But call it and treat it like the hack it is.  It’s a gross workaround, not an ideal state.

Don’t let your people settle into the idea that it’s some kind of moral stance instead of a butt-ugly hack.  Because if you do you will never, ever get rid of it.

Remember: a team’s maturity and efficiency can be represented by how long it takes to get their shit into users’ hands after they write it.  Ship it fast, while it’s still fresh in your developers’ heads.  Ship one change set at a time, so you can swiftly debug and revert them.  I promise your lives will be so much better.  Every step helps.  ❤

charity.

IMG_3768

Friday Deploy Freezes Are Exactly Like Murdering Puppies

On pain, careers, and doing things the hard way.

Part 1

Seven years ago I was working on backend infra for mobile apps at Parse, resenting MongoDB and its accursed single write lock per replica with all my dirty, blackened soul.  That’s when Miles Ward asked me to give a customer testimonial for MongoDB at AWS reinvent.

It was my first time EVER speaking in public, and I had never been more terrified.  I have always been a writer, not a talker, and I was pathologically afraid of speaking in public, or even having groups of people look at me.  I scripted every word, memorized my lines, even printed it all out just in case my laptop didn’t work.  I had nightmares every night.  For three months I woke up every night in a cold sweat, shaking.

And I bombed, completely and utterly.  The laptop DIDN’T work, my limbs and tongue froze, I was shaking so badly I could hardly read my printout, and after I rushed through the last sentences I turned and stumbled robotically off the stage, fully unaware that people were raising their hands and asking questions.  I even tripped over the microphone cord in my haste to escape the stage.

Afterwards I burned with unpleasantries — fear, anger, humiliation, rage at being so bad at anything.  It was excruciating.  For the next two years I sought out every opportunity I could get to talk at a meetup, conference, anything.  I got a prescription for propranolol to help manage the physical symptoms of panic.   I gave 17 more talks that year, spending most nights and weekends working on them or rehearsing, and 21 the year after that.  I hated every second of it.

I hated it, but I burned up my fear and aversion as fuel.  Until around 18 months later, when I realized that I no longer had nightmares and had forgotten to pack my meds for a conference.  I brute forced my way through to the other side, and public speaking became just an ordinary skill or a tool like any other.

part 2

I was on a podcast last week where the topic was career journeys.  They asked me what piece of career advice I would like to give to people.  I promptly said that following your bliss is nice, but I think it’s important to learn to lean into pain.

“Pain is nature’s teacher,” I said.  Feedback loops train us every day, mostly unconsciously.  We feel aversion for pain, and we enjoy dopamine hits, and out of those and other brain chemicals our habits are made.  All it takes is a little tolerance for discomfort and a some conscious tweaking of those feedback loops, and you can train yourself to achieve big things without even really trying.

But then I hesitated.  Yes, leaning in to pain has done well for me in my career.  But that is not the whole story, it leaves off some important truths.  It has also hurt me and held me back.

Misery is not a virtue.  Pain is awful.  That’s why it’s so powerful and primal.  It’s a pre-conscious mechanism, an acute response that kicks in long before your conscious mind.  Even just the suggestion of pain (or memory of past trauma) will train you to twist and contort around to avoid it.

When you are in pain, your horizons shrink.  Your vision narrows, you curl inward. You have to expend enormous amounts of energy just moving forward through the day inch by inch.

Everything is hard when you’re in pain.  Your creative brain shuts down.  Basic life functions become impossible tests.  You have to spend so much time compensating for your reduced capacity that learning new things is nearly impossible.  You can’t pick up on subtle signals when your nerves are screaming in agony.  And you grow numb over time, as they die off from sheer exhaustion.

part 3

I am no longer the CEO of honeycomb.

I never wanted to be CEO; I always fiercely wanted a technical role.  But it was a matter of company survival, and I did my best.  I wasn’t a great CEO, although we did pretty well at the things I am good at or care about.  But I couldn’t expand past them.

I hated every second of it.  I cried every single day for the first year and a half.  I tried to will myself into loving a role I couldn’t stand, tried to brute force my way to success like I always do.  It didn’t get better.  My ability to be present and curious and expansive withered.  I got numb.

Turns out not every problem can be powered through on a high pain tolerance.  The collateral damage starts to rack up.  Sometimes the only way to succeed is to redefine success.

Pain is a terrific teacher, but pain is an acute response.  Chronic pain will hijack your reward pathways, your perspective, your relationships, and every other productive system and leave them stunted.

Leaning in to pain can be powerful if you have the agency and ability to change it, or practice it to mastery, or even just adapt your own emotional responses to it.  If you don’t or you can’t, leaning in to pain will kill you.  Having the wisdom to know the difference is everything.  Or so I’m learning.

From here on out I’ll be in the CTO seat.  I don’t know what that even means yet, but I guess we’ll find out.  Stay tuned.  ❤

charity

img_7678

On pain, careers, and doing things the hard way.

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

This is part three of a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it.  🙏❤️  charity + friends


“Now Roll It Out And Keep Them Happy”

This is the third in a series of blog posts; previously we analyzed the security challenges of using a third party service, and we worked together with the security team to build empathy to deliver the project.  You might want to read those first, since we are going to build on a lot of the ideas there to ship and maintain this integration.

Ready for launch

You’ve convinced the security team and other stakeholders, you’ve gotten the integration running, you’re getting promising results from dev-test or staging environments… now it’s time to move from proof-of-concept to full implementation.  Depending on your situation this might be a transition from staging to production, or it might mean increasing a feature flipper flag from 5% to 100%, or it might mean increasing coverage of an integration from one API endpoint to cover your entire developer footprint.

Taking into account Murphy’s Law, we expect that some things will go wrong during the rollout.  Perhaps during coverage, a developer realizes that the schema designed to handle the app’s event mechanism can’t represent a scenario, requiring a redesign or a hacky solution.  Or perhaps the metrics dashboard shows elevated error rates from the API frontend, and while there’s no smoking gun, the ops oncall decides to rollback the integration Just In Case it’s causing the incident.

This gives us another chance to practice empathy — while it’s easy, wearing the champion hat, to dismiss any issues found by looking for someone to blame, ultimately this poisons trust within your organization and will hamper success.  It’s more effective, in the long run (and often even in the short run), to find common ground with your peers in other disciplines and teams, and work through to solutions that satisfy everybody.

Keeping the lights on

In all likelihood as integration succeeds, the team will rapidly develop experts and expertise, as well as idiomatic ways to use the product.  Let the experts surprise you; folks you might not expect can step up when given a chance.  Expertise flourishes when given guidance and goals; as the team becomes comfortable with the integration, explicitly recognize a leader or point person for each vendor relationship.  Having one person explicitly responsible for a relationship lets them pay attention to those vendor emails, updates, and avoid the tragedy of the “but I thought *you* were” commons.  This Integration Lead is also a center of knowledge transfer for your organization — they won’t know everything or help every user come up to speed, but they can help empower the local power users in each team to ramp up their teams on the integration.

As comfort grows you will start to consider ways to change your usage, for example growing into new kinds of data.  This is a good time to revisit that security checklist — does the change increase PII exposure to your vendor?  Would the new data lead to additional requirements such as per-field encryption?  Don’t let these security concerns block you from gaining valuable insight using the new tool, but do take the chance to talk it over with your security experts as appropriate.

Throughout this organic growth, the Integration Lead remains core to managing your changing profile of usage of the vendor they shepherd; as new categories of data are added to the integration, the Lead has responsibility to ensure that the vendor relationship and risk profile are well matched to the needs that the new usage (and presumably, business value) is placing on the relationship.

Documenting the Intergation Lead role and responsibilities is critical. The team should know when to check in, and writing it down helps it happen.  When new code has a security implication, or a new use case potentially amplifies the cost of an integration, bringing the domain expert in will avoid unhappy surprises.  Knowing how to find out who to bring in, and when to bring them in, will keep your team getting the right eyes on their changes.

Security threats and other challenges change over time, too.  Collaborating with your security team so that they know what systems are in use helps your team take note of new information that is relevant to your business. A simple example is noting when your vendors publish a breach announcement, but more complex examples happen too — your vendor transitions cloud providers from AWS to Azure and the security team gets an alert about unexpected data flows from your production cluster; with transparency and trust such events become part of a routine process rather than an emergency.

It’s all operational

Monitoring and alerting is a fact of operations life, and this has to include vendor integrations (even when the vendor integration is a monitoring product.)  All of your operations best practices are needed here — keep your alerts clean and actionable so that you don’t develop pager fatigue, and monitor performance of the integration so that you don’t get blindsided by a creeping latency monster in your APIs.

Authentication and authorization are changing as the threat landscape evolves and industry moves from SMS verification codes to U2F/WebAuthn.  Does your vendor support your SSO integration?  If they can’t support the same SSO that you use everywhere else and can’t add it — or worse, look confused when you mention SSO — that’s probably a sign you should consider a different vendor.

A beautiful sunset

Have a plan beforehand for what needs to be done should you stop using the service.  Got any mobile apps that depend on APIs that will go away or start returning permission errors?  Be sure to test these scenarios ahead of time.

What happens at contract termination to data stored on the service?  Do you need to explicitly delete data when ceasing use?

Do you need to remove integrations from your systems before ending the commercial relationship, or can the technical shutdown and business shutdown run in parallel?

In all likelihood these are contingency plans that will never be needed, and they don’t need to be fully fleshed out to start, but a little bit of forethought can avoid unpleasant surprises.

Year after year

Industry best practice and common sense dictate that you should revisit the security questionnaire annually (if not more frequently). Use this chance to take stock of the last year and check in — are you getting value from the service?  What has changed in your business needs and the competitive landscape? 

It’s entirely possible that a new year brings new challenges, which could make your current vendor even more valuable (time to negotiate a better contract rate!) or could mean you’d do better with a competing service.  Has the vendor gone through any major changes?  They might have new offerings that suit your needs well, or they may have pivoted away from the features you need. 

Check in with your friends on the security team as well; standards evolve, and last year’s sufficient solution might not be good enough for new requirements.

 

Andy thinks out loud about security, society, and the problems with computers on Twitter.


 

❤️ Thanks so much reading, folks.  Please feel free to drop any complaints, comments, or additional tips to us in the comments, or direct them to me on twitter.

Have fun!  Stay (a little bit) Paranoid!!

— charity

img_6772

Outsource Your O11y: Now Roll It Out And Keep Them Happy (part 3/3)

Outsource Your O11y: Get Aligned With Security (part 2/3)

This is part two of a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

All this pain will someday be worth it.  🙏❤️  charity + friends


“Get Aligned With Security”

by Lilly Ryan

If your team has decided on a third-party service to help you gather data and debug product issues, how do you convince an often overeager internal security team to help you adopt it?

When this service is something that provides a pathway for developers to access production data, as analytics tools often do, making the case for access to that data can screech to a halt at the mention of the word “production”. Progressing past that point will take time, empathy, and consideration.

I have been on both sides of the “adopting a new service” fence: as a developer hoping to introduce something new and useful to our stack, and now as a security professional who spends her days trying to bust holes in other people’s setups. I understand both sides of the sometimes-conflicting needs to both ship software and to keep systems safe.  

This guide has advice to help you solve the immediate problem of choosing and deploying a third-party service with the approval of your security team.  But it also has advice for how to strengthen the working relationship between your security and development teams over the longer term. No two companies are the same, so please adapt these ideas to fit your circumstances.

Understanding the security mindset

The biggest problems in technology are never really about technology, but about people. Seeing your security team as people and understanding where they are coming from will help you to establish empathy with them so that both of you want to help each other get what you want, not block each other.

First, understand where your security team is coming from. Development teams need to build features, improve the product, understand and ship good code. Security teams need to make sure you don’t end up on the cover of the NYT for data breaches, that your business isn’t halted by ransomware, and that you’re not building your product on a vulnerable stack.

This can be an unfamiliar frame of mind for developers.  Software development tends to attract positive-minded people who love creating things and are excited about the possibilities of new technology. Software security tends to attract negative thinkers who are skilled at finding all the flaws in a system.  These are very different mentalities, and the people who occupy them tend to have very different assumptions, vocabularies, and worldviews.   

But if you and your security team can’t share the same worldview, it will be hard to trust each other and come to agreement.  This is where practicing empathy can be helpful.

Before approaching your security team with your request to approve a new vendor, you may want to run some practice exercises for putting yourselves in their shoes and forcing yourselves to deliberately cultivate a negative thinking mindset to experience how they may react — not just in terms of the objective risk to the business, or the compliance headaches it might cause, but also what arguments might resonate with them and what emotional reactions they might have.

My favourite exercise for getting teams to think negatively is what I call the Land Astronaut approach.

The “Land Astronaut” Game

Imagine you are an astronaut on the International Space Station. Literally everything you do in space has death as a highly possible outcome. So astronauts spend a lot of time analysing, re-enacting, and optimizing their reactions to events, until it becomes muscle memory. By expecting and training for failure, astronauts use negative thinking to anticipate and mitigate flaws before they happen. It makes their chances of survival greater and their people ready for any crisis.

Your project may not be as high-stakes as a space mission, and your feet will most likely remain on the ground for the duration of your work, but you can bet your security team is regularly indulging in worst-case astronaut-type thinking. You and your team should try it, too.

The Game:

Pick a service for you and your team to game out.  Schedule an hour, book a room with a whiteboard, put on your Land Astronaut helmets.  Then tell your team to spend half an hour brainstorming about all the terrible things that can happen to that service, or to the rest of your stack when that service is introduced.  Negative thoughts only!

Start brainstorming together. Start out by being as outlandish as possible (what happens if their data centre is suddenly overrun by a stampede of elephants?). Eventually you will find that you’ll tire of the extreme worst case scenarios and come to consider more realistic outcomes — some of which which you may not have thought of outside of the structure of the activity.

After half an hour, or whenever you feel like you’re all done brainstorming, take off your Land Astronaut helmets, sift out the most plausible of the worst case scenarios, and try to come up with answers or strategies that will help you counteract them.  Which risks are plausible enough that you should mitigate them?  Which are you prepared to gamble on never happening?  How will this risk calculus change as your company grows and takes on more exposure?

Doing this with your team will allow you all to practice the negative thinking mindset together and get a feel for how your colleagues in the security team might approach this request. (While this may seem similar to threat modelling exercises you might have done in the past, the focus here is on learning to adopt a security mindset and gaining empathy for this thought process, rather than running through a technical checklist of common areas of concern.)

While you still have your helmets within reach, use your negative thinking mindset to fill out the spreadsheet from the first piece in this series.  This will help you anticipate most of the reasonable objections security might raise, and may help you include useful detail the security team might not have known to ask for.

Once you have prepared your list of answers to George’s worksheet and held a team Land Astronaut session together, you will have come most of the way to getting on board with the way your security team thinks.

Preparing for compromise

You’ve considered your options carefully, you’ve learned how to harness negative thinking to your advantage, and you’re ready to talk to your colleagues in security – but sometimes, even with all of these tools at your disposal, you may not walk away with all of the things you are hoping for.

Being willing to compromise and anticipating some of those compromises before you approach the security team will help you negotiate more successfully.

While your Land Astronaut helmets are still within reach, consider using your negative thinking mindset game to identify areas where you may be asked to compromise. If you’re asking for production access to this new service for observability and debugging purposes, think about what kinds of objections may be raised about this and how you might counter them or accommodate them. Consider continuing the activity with half of the team remaining in the Land Astronaut role while the other half advocates from a positive thinking standpoint. This dynamic will get you having conversations about compromise early on, so that when the security team inevitably raises eyebrows, you are ready with answers.

Be prepared to consider compromises you had not anticipated, and enter into discussions with the security team with as open a mind as possible. Remember the team is balancing priorities of not only your team, but other business and development teams as well.  If you and your security colleagues are doing the hard work to meet each other halfway then you are more likely to arrive at a solution that satisfies both parties.

Working together for the long term

While the previous strategies we’ve covered focus on short-term outcomes, in this continuous-deployment, shift-left world we now live in, the best way to convince your security team of the benefits of a third-party service – or any other decision – is to have them along from day one, as part of the team.

Roles and teams are increasingly fluid and boundary-crossing, yet security remains one of the roles least likely to be considered for inclusion on a software development team. Even in 2019, the task of ensuring that your product and stack are secure and well-defended is often left until the end of the development cycle.  This contributes a great deal to the combative atmosphere that is common.

Bringing security people into the development process much earlier builds rapport and prevents these adversarial, territorial dynamics. Consider working together to build Disaster Recovery plans and coordinating for shared production ownership.

If your organisation isn’t ready for that kind of structural shift, there are other ways to work together more closely with your security colleagues.

Try having members of your team spend a week or two embedded with the security team. You may even consider a rolling exchange – a developer for a security team member – so that developers build the security mindset, and the security team is able to understand the problems your team is facing (and why you are looking at introducing this new service).

At the very least, you should make regular time to meet with the security team, get to know them as people, and avoid springing things on them late in the project when change is hardest.

Riding off together into the sunset…?

If you’ve taken the time to get to know your security team and how they think, you’ll hopefully be able to get what you want from them – or perhaps you’ll understand why their objections were valid, and come up with a better solution that works well for both of you.

Investing in a strong relationship between your development and security teams will rarely lead to the apocalypse. Instead, you’ll end up with a better product, probably some new work friends, and maybe an exciting idea for a boundary-crossing new career in tech.

But this story isn’t over! Once you get the green light from security, you’ll need to think about how to roll your new service out safely, maintain it, and consider its full lifespan within your company.  Which leads us to part three of this series, on rolling it out and maintaining it … both your integration and your relationship with the security team.

 

Lilly Ryan is a pen tester, Python wrangler, and recovering historian from Melbourne. She writes and speaks internationally about ethical software, social identities after death, teamwork, and the telegraph. More recently she has researched the domestic use of arsenic in Victorian England, attempted urban camouflage, reverse engineered APIs, wielded the Oxford comma, and baked a really good lemon shortbread.

Outsource Your O11y: Get Aligned With Security (part 2/3)

Outsource Your O11y: How To Be A Champion (part 1/3)

I hear variations on this question constantly: “I’d really like to use a service like Honeycomb for my observability, but I’m told I can’t ship any data off site.  Do you have any advice on how to convince my security team to let me?”

I’ve given lots of answers, most of them unsatisfactory.  “Strip the PII/PHI from your operational data.”  “Validate server side.”  “Use our secure tenancy proxy.”  (I’m not bad at security from a technical perspective, but I am not fluent with the local lingo, and I’ve never actually worked with an in-house security team — i’ve always *been* the security team, de facto as it may be.) 

So I’ve invited three experts to share their wisdom in a three-part series of guest posts:

  1. How To Be A Champion, on how to choose a third-party vendor and champion them successfully to your security team.  (George Chamales)
  2. Get Aligned With Security, how to work with your security team to find the best possible outcome for all sides (Lilly Ryan)
  3. Now Roll It Out And Keep Them Happy, on how to operationalize your service by rolling out the integration and maintaining it — and the relationship with your security team — over the long run (Andy Isaacson)

My ✨first-ever guest posts✨!  Yippee.  I hope these are useful to you, wherever you are in the process of outsourcing your tools.  You are on the right path: outsourcing your observability to a vendor for whom it’s their One Job is almost always the right call, in terms of money and time and focus — and yes, even security. 

All this pain will someday be worth it.  🙏❤️  charity + friends


“How to be a Champion”

by George Chamales

You’ve found a third party service you want to bring into your company, hooray!

To you, it’s an opportunity to deploy new features in a flash, juice your team’s productivity, and save boatloads of money.

To your security and compliance teams, it’s a chance to lose your customers’ data, cause your applications to fall over, and do inordinate damage to your company’s reputation and bottom line.

The good news is, you’re absolutely right.  The bad news is, so are they.

Successfully championing a new service inside your organization will require you to convince people that the rewards of the new service are greater than the risks it will introduce (there’s a guide below to help you).  

You’re convinced the rewards are real. Let’s talk about the risks.

The past year has seen cases of hackers using third party services to target everything from government agencies, to activists, to Targetagain.  Not to be outdone, attention-seeking security companies have been actively hunting for companies exposing customer data then issuing splashy press releases as a means to flog their products and services.  

A key feature of these name-and-shame campaigns is to make sure that the headlines are rounded up to the most popular customer – the clickbait lead “MBM Inc. Loses Customer Data” is nowhere near as catchy as “Walmart Jewelry Partner Exposes Personal Data Of 1.3M Customers.”

While there are scary stories out there, in many, many cases the risks will be outweighed by the rewards. Telling the difference between those innumerable good calls and the one career-limiting move requires thoughtful consideration and some up-front risk mitigation.

When choosing a third party service, keep the following in mind:

    • The security risks of a service are highly dependent on how you use it.  
      You can adjust your usage to decrease your risk.  There’s a big difference between sending a third party your server metrics vs. your customer’s personal information.  Operational metrics are categorically less sensitive than, say, PII or PHI (if you have scrubbed them properly).
    • There’s no way to know how good a service’s security really is.  
      History is full of compromised companies who had very pretty security pages and certifications (here’s Equifax circa September 2017).  Security features are a stronger indicator, but there are a lot more moving parts that go into maintaining a service’s security.
    • Always weigh the risks vs. the rewards.

 

 

There’s risk no matter what you do – bringing in the service is risky, doing nothing is risky.  You can only mitigate risks up to a point. Beyond that point, it’s the rewards that make risks worthwhile.

Context is critical in understanding the risks and rewards of a new service.  

You can use the following guide to put things in context as you champion a new service through the gauntlet of management, security, and compliance teams.  That context becomes even more powerful when you can think about the approval process from the perspective of the folks you’ll need to win over to get the okay to move forward.

In the next part of this series Lilly Ryan shares a variety of techniques to take on the perspective of your management, security and compliance teams, enabling you to constructively work through responses that can include everything from “We have concerns…” to “No” to “Oh Helllllllll No.”

Championing a new service is hard – it can be equally worthwhile.  Good luck!

 

George Chamales is a useful person to have around. Please send critiques of this post to george@criticalsec.com

“A Security Guide for Third Party Services” Worksheet

Note to thoughtful service providers:  You may want to fill parts of this out ahead of time and give it to your prospective customers.  It will provide your champion with good fortune in the compliance wars to come.  (Also available as a nicely formatted spreadsheet.)

 

Our Reasons
Why this service? This is the justification for the service – the compelling rewards that will outweigh the inevitable risks.
What will be true once the service is online?
Good reasons are ones that a fifth grader would understand.
Our Data
Data it will / won’t collect? Describe the classes or types of data the service will access / store and why that’s necessary for the service to operate.
If there are specific types of sensitive data the service won’t collect (e.g. passwords, Personally Identifiable Information, Patient Health Information) explicitly call them out.
How is data be accessed? Describe the process for getting data to the service.  
Do you have to run their code on your servers, on your customer’s computers?
Our Costs
Costs of NOT doing it? This are the financial risks / liabilities of not going with this service. What’s the worst and average cost?
Have you had costly problems in the past that could have been avoided if you were using this service?
Costs of doing it? Include the cost for the service and, if possible, the amount of person-time it’s going to take to operate the service.  
Ideally less than the cost of not doing it.
Our Risk – how mad will important people be…
If it’s compromised. What would happen if hackers or attention-seeking security companies publicly released the data you sent the service?  Is it catastrophic or an annoyance?
When it goes down? When this service goes down (and it will go down), will it be a minor inconvenience or will it take out your primary application and infuriate your most valuable customers?
Their Security  – in order of importance
SSO & 2FA Support? This is a security smoke test:  If a service doesn’t support SSO or 2FA, it’s safe to assume that they don’t prioritize security.
Also a good idea to investigate SSO support up front since some vendors charge extra for it (which is a shame).
Fine-grained permissions? This is another key indicator of the service’s maturity level since it takes time and effort to build in.  It’s also something else they might make you pay extra for.
Security certifications? These aren’t guarantees of quality, but it does indicate that the company’s put in some effort and money into their processes.
Check their website for general security compliance merit badges such as SOC2, ISO27001 or industry-specific things like PCI or HIPAA.
Security & privacy pages? If there is, it means that they’re willing to publicly state that they do something about security.  The more specific and detailed, the better.
Vendor’s security history? Have there been any spectacular breaches that demonstrated a callous disregard for security, gross incompetence, or both?
BONUS Questions Want to really poke and prod the internal security of your vendor?  Ask if they can answer the following questions:

  • How many known vulnerabilities (CVEs) exist on your production infrastructure right now?
  • At what time (exactly) was the last successful backup of all your customer data completed?
  • What were the last three secrets accessed in the production environment?
Our Decision
Is it worth it? Look back through the previous sections and ask whether it makes sense to:

* Use the 3rd party service

* Build it yourself

* Not do it at all
Would a thoughtful person agree with you?

 

 

 

Outsource Your O11y: How To Be A Champion (part 1/3)

Logs vs Structured Events

I got an interesting tweet the other day from @evntdrvn in response to this thread of mine. Paraphrasing,

“So I’ve almost got our group at work up to Step 1 in your observability maturity model, but some of the devs that I work with want to turn OFF our lovely structured logging in prod for informational-level msgs due to their legacy philosophy (‘we only log errors in prod’). The reasons given are mostly philosophical (“I’m a dev and only interested when things error out, I don’t want any other noise in prod logs”, “I don’t want to slow my app down in prod”). Help?!?”

As I was reading this, I was itching to fly out and dive into battle with Eric. I know exactly where his opinionated devs are coming from. I used to say the same things! I even wrote a whole blog post about it.

These developers have internalized a set of rules and best practices for dealing with output data, in the context of “monolith application development in the early 2000s”.

Monolithic systems assumptions

Those systems had many common constraints and assumptions, such as:

  • We have a monolith service, or a very small number of services. We can model the system in our heads.
  • Logging is done to local disk, which can impact performance
  • Disks are expensive
  • Screen Shot 2019-02-05 at 7.02.43 AM
  • Log lines are spat out inline with execution.  A poorly placed printf can take the whole system down.
  • Investigation is rare, and usually means a human reading error logs.
  • Logging is of poor utility for understanding internal states or execution paths; you should just read the code or use a debugger.  (There are few or network hops between functions.)
  • Logging is mostly useful for detecting certain terminal crash states or connection errors.

Monolithic logging best practices

Therefore:

  • We should be very stingy in what we log
  • Debuggers should be used for understanding internal states of the code
  • Logs are a last resort and record of crash dumps.  We do not expect to use log data in the course of our daily work.  We assume log-related manual investigation will be infrequent and of limited utility.

These were exactly the right lessons to learn in the era of expensive hardware and monolithic repos/artifacts. Many people still work in environments like this, and follow logging best practices like these. God bless, more power to em.

Distributed systems assumptions

But more and more of us face systems that are very different.

  • We have many services, possibly many MANY services. A representative request will have “many” hops across “many” services and routers and proxies and meshes and storage systems.
  • We cannot model the system in our heads; it would be a mistake to try. We rely on tooling as the source of truth for those systems.
  • You may or may not have access to those services, or the systems your code runs on. There may or may not be a logging facility, or a centralized log aggregator. Your only view of the system is through the instrumentation of your code.
  • Disks and system resources are cheap, ephemeral, all but disposable.
  • Data services are similarly cheap.  We can almost entirely silo application performance off from the cost of writing perf data out.Screen Shot 2019-02-05 at 7.03.04 AM
  • Investigation is prohibitively slow and expensive for a human to do by hand. Many of the nodes or processes we need to inspect may no longer even exist, but their past states may still be relevant to us in understanding patterns to the present time.
  • Investigation should usually be done distributedly, across all instantiations of your code, however many there might be — and in real time
  • Investigation requires computation — not just string search. We need to ask on the fly involving math and percentiles and breakdowns and group by’s.  And we need access to the raw requests in order to run accurate computations — no pre-aggregates.
  • The hardest part isn’t usually debugging the code, it’s figuring out where is the code you need to debug. Or what the errors or outliers have in common from the perspective of the code.  Fixing the code itself is often comparatively trivial, once found.
  • What even is ‘logging’?
  • What even is ‘local disk’?

This isn’t optional: at some point of complexity or scale or distributedness, it becomes necessary if you want to work with these systems.

Logs can’t help you here.

And you aren’t going to get that kind of explorable data out of loglevel:ERROR, or by chopping up your telemetry into disconnected metrics devoid of context.

You are only going to get this kind of explorable, ad hoc, computation-friendly data if you take a radically new approach to how you output and aggregate telemetry.  You’re going to need to replace your log lines and log levels with a different sort of beast: arbitrarily wide structured events that describe the request and its context, one event per

sourceoftruth
Remember kids: you either have a single source of truth, or multiple sources of lies.

request per service.

If it helps, don’t think of them as log files any more. Think of them as events. Yes, you can stash this stream in a file, but why would you?  on what disk?  will that work for your serverless functions too?  Just stream them over the network to wherever you want to put them.

 

Log levels are another confusing and unnecessary artifact of yesteryear that you no longer really need. The more you think of structured events as logs, the more tempted you may be to apply the old set of best practices. So just don’t think of them as logs at all.

How to gather and structure your data

Instead of dribbling little pebbles of log effluvia throughout your code, do this.  (If you’re a honeycomb user, our beelines do it all automatically for you *and* pre-propagate the blobs with everything we know of your context.)

  1. Initialize an empty blob at the beginning, when the request first enters the service.
  2. Stuff any and all interesting detail about the request into that blob throughout the lifetime of the request.
    • Any unique id, any high-cardinality variable, any headers passed in, every full query, normalized query, and query execution time; every http call out to a remote service, every http execution time; any shopping cart id, first and last name, execution time — literally anything interesting, append to blob.
  3. Then, when the request is about to exit or error, write the blob off to honeycomb or another service or disk somewhere.

You can see immediately how this method has radically different performance Screen Shot 2019-02-05 at 7.02.57 AMimplications and risks than the earlier shotgun spray approach. No more “oops i accidentally put a print line INSIDE a for loop”. The write amplification profile is compressed. Most importantly, the incremental cost of capturing more detail about the request per service is nearly zero.

And now you have the kind of structured data that you can feed into something like a columnar store, or honeycomb, and run ad hoc queries to your heart’s delight.

Distributed systems logging events best practices:

Let’s sum up.  (I’m including links to other past rants on this topic):

Just think.

No more doing multi-line regexps trying to look for the same request ID or user ID doing five suspicious things in a row.

No more regexps at all, for fuck’s sake.

No more bullshit percentiles that were computed at write time by averaging over a bunch of other averages

No more having to jump around from dashboards to logs trying to vainly eyeball correlate one spike with another. No more wondering why no two tools can agree if anything even exists or not

Just gather the detail you need to ask the questions when you need them, and store it in a single source of truth.  It’s that simple.

No need to shame people from learning best practices that worked perfectly well for a long time.  You can either let them learn the hard way that this transformation is non optional, or you can help them learn the easy way that it’s simply much better and easier to invest in this telemetry up front.  You seem like a nice enough chap, which is probably why you chose door 2.  (If you wanted to get tougher about it, have a few reformed folks in to tell their horror stories.  Try some ex-twitter engineers.)

The hardest part seems to be getting people to unlearn all the best practices they once learned for dealing with logs.  So just don’t call it logs anymore, if that helps. Call it “structured events”.

– charity.

img_4817

Logs vs Structured Events

Engineering Management: The Pendulum Or The Ladder

Last night I was out with a dear friend who has been an engineering manager for a year now, and by two drinks in I was rattling off a long list things I always say to newer engineering managers.

Then I remembered: I should write a post! It’s one of my goals this year to write more long form instead of just twittering off into the abyss.Buffy Jaguar 3.5x5

There’s a piece I wrote two years ago, The Engineer/Manager Pendulum,  which is probably my all time favorite.  It was a love letter to a friend who I desperately wanted to see go back to engineering, for his own happiness and mental health.  Well, this piece is a sequel to that one.

It’s primarily aimed at new managers, who aren’t sure what their career options look like or how to evaluate the opportunities that come their way, or how it may expand or shrink their future opportunities.

The first fork in the manager’s path

Every manager reaches a point where they need to choose: do they want to manage engineers (a “line manager”), or do they want to try to climb the org chart? — manage managers, managers of other managers, even other divisions; while Does Not Kill Us Puppy UPDATEDbeing “promoted” from manager to senior manager, director to senior director, all the way up to VP and so forth.   Almost everyone’s instinct is to say “climb the org chart”, but we’ll talk about why you should be critical of this instinct.

They also face a closely related question: how technical do they wish to stay, and how badly do they care?

Are you an “engineering MANAGER” or an “ENGINEERING manager”?

These are not unlike the decisions every engineer ends up making about whether to go deep or go broad, whether to specialize or be a generalist.  The problem is that both engineers and managers often make these career choices with very little information — or even awareness that they are doing it.

And managers in particular then have a tendency to look up ten years later and realize that those choices, witting or unwitting, have made them a) less employable  and b) deeply unhappy.

Lots of people have the mindset that once they become an engineering manager, they should just go from gig to gig as an engineering manager who manages other engineers: that’s who they are now.  But this is actually a very fragile place to sit long-term, as we’ll discuss further on in this piece.

But let’s start at to the beginning, so I can speak to those of you who are considering management for the very first time.

“So you want to try engineering management.”

COOL! I think lots of senior engineers should try management, maybe even most senior engineers.  It’s so good for you, it makes you better at your job. (If you aren’t a senior engineer, and by that I mean at least 7+ years of engineering experience, be very wary; know this isn’t usually in your best interest.)

Hopefully you have already gathered that management is a career change, not a promotion, and you’re aware that nobody is very good at it when they first start.

That’s okay! It takes a solid year or two to find new rhythms and reward mechanisms before you can even begin to find your own voice or trust your judgment. Management problems look easy, deceptively so.  Reasons this is hard include:

  1. Most tech companies are absolutely abysmal at providing any sort of training or structure to help you learn the ropes and find your feet.
  2. untrueEven if they do, you still have to own your own career development.  If learning to be a good engineer was sort of like getting your bachelor’s, learning to be a good manager is like getting your PhD — much more custom to who you are.
  3. It will exhaust you mentally and emotionally in the weirdest ways for much longer than you think it should.  You’ll be tired a lot, and you’ll miss feeling like you’re good at something (anything).

This is because you need to change your habits and practices, which in turn will actually change who you are.  This takes time.  Which is why …

The minimum tour of duty as a new manager is two years.

If you really want to try being a manager, and the opportunity presents itself, do it!  But only if you are prepared to fully commit to a two year long experiment.

Root Causes DolphinCommit to it like a proper career change. Seek out new peers, find new heroes. Bring fresh eyes and a beginner’s mindset. Ask lots of questions. Re-examine every one of your patterns and habits and priorities: do they still serve you? your team?

Don’t even bother thinking about in terms of whether you “enjoy managing” for a while, or trying to figure out if you are are any good at it. Of course you aren’t any good at it yet.  And even if you are, you don’t know how to recognize when you’ve succeeded at something, and you haven’t yet connected your brain’s reward systems to your successes.  A long stretch of time without satisfying brain drugs is just the price of admission if you want to earn these experiences, sadly.

It takes more than one year to learn management skills and wire up your brain to like it.  If you are waffling over the two year commitment, maybe now is not the time.  Switching managers too frequently is disruptive to the team, and it’s not fair to make them report to someone who would rather be doing something else or isn’t trying their ass off.

It takes about 3-5 years for your skills to deteriorate.

So you’ve been managing a team for a couple years, and it’s starting to feel … comfortable?  Hey, you’re pretty good at this!  Yay!

With a couple of years under your belt as a line manager, you now have TWO powerful skill sets.  You can build things, AND you can organize people into teams to build even bigger things. Right now, both sets are sharp.  You could return to engineering pretty easily, or keep on as a manager — your choice.

But this state of grace doesn’t last very long. Your technical skills stop advancing when you become a manager, and instead begin eroding.  Two years in, you aren’t the effective tech lead you once were; your information is out of date and full of gaps, the hard parts are led by other people these days.

More critically, your patterns of mind and habits shift over time, and become those of a manager, not an engineer.  Consider how excited an engineer becomes at the prospect of a justifiable greenfield project; now compare to her manager’s glum reaction as she instinctively winces at having to plan for something so reprehensibly unpredictable and difficult to estimate.  It takes time to rewire yourself back.

If you like engineering management, your tendency is to go “cool, now I’m a manager”, and move from job to job as an engineering manager, managing team after team of engineers.  But this is a trap.  It is not a sound long term plan.  It leads too many people off to a place they never wanted to end up: technically sidelined.

Sunglasses Tiger Debugger 3.3x5

Why can’t I just make a career out of being a combo tech lead+line manager?

One of the most common paths to management is this: you’re a tech lead, you’re directing ever larger chunks of technical work, doing 1x1s and picking up some of the people stuff, when your boss asks if you’d like to manage the team.  “Sure!”, you say, and voila — you are an engineering manager with deep domain expertise.

But if you are doing your job, you begin the process of divesting yourself of technical leadership responsibilities starting immediately.  Your own technical development should screech to a halt once you become a manager, because you have a whole new career to focus on learning.

Your job is to leverage that technical expertise to grow your engineers into great senior engineers and tech leads themselves.  Your job is not to hog the glory and squat on the hard problems yourself, it’s to empower and challenge and guide your team.  Don’t suck up all the oxygen: you’ll stunt the growth of your team.

But your technical knowledge gets dated, and your skills atrophy..  The longer it’s been since you worked as an engineer, the harder it will be to switch back.  It gets real hard around three years, and five years seems like a tipping point.[1]

And because so much of your credibility and effectiveness as an engineering leader comes from your expertise in the technology that your team uses every day, ultimately you will be no longer capable of technical leadership, only people management.

On being an “engineering manager” who only does people management

I mean, there’s a reason we don’t lure good people managers away from Starbucks to run engineering teams.  It’s the intersection and juxtaposition of skill sets that gives engineering managers such outsize impact.

The great ones can make a large team thrum with energy.  The great ones can break down a massive project into projects that challenge (but do not overwhelm) a dozen or more engineers, from new grads to grizzled veterans, pushing everyone to grow.  The great ones can look ahead and guess which rocks you are going to die on if you don’t work to avoid them right now.

The great ones are a treasure: and they are rare.  And in order to stay great, they regularly need to go back to the well to refresh their own hands-on technical abilities.

Pointless Ice Cream 3x2.5There is an enormous demand for technical engineering leaders — far more demand than supply.  The most common hackaround is to pair a people manager (who can speak the language and knows the concepts, but stopped engineering ages ago) with a tech lead, and make them collaborate to co-lead the team.  This unwieldy setup often works pretty well.

But most of those people managers didn’t want or expect to end up sidelined in this way when they were told to stop engineering.

If you want to be a pure people manager and not do engineering work, and don’t want to climb the ladder or can’t find a ladder to climb, more power to you.  I don’t know that I’ve met many of these people in my life.  I have met a lot of people in this situation by accident, and they are always kinda angsty and unhappy about it.  Don’t let yourself become this person by accident.  Please.

Which brings me to my next point.

You will be advised to stop writing code or engineering.

Fuck

That.

 ✨

Everybody’s favorite hobby is hassling new managers about whether or not they’ve stopped writing code yet, and not letting up until they say that they have.  This is a terrible, horrible, no-good VERY bad idea that seems like it must originally have been a botched repeating of the correct advice, which is:

Stop writing code and engineering

in the critical path

Can you spot the difference?  It’s very subtle.  Let’s run a quick test:

  • Authoring a feature?  ⛔️
  • Covering on-call when someone needs a break?  ✅
  • Diving on the biggest project after a post mortem?  ⛔️
  • Code reviews?  ✅
  • Picking up a p2 bug that’s annoying but never seems to become top priority?  ✅
  • Insisting that all commits be gated on their approval?  ⛔️
  • Cleaning up the monitoring checks and writing a library to generate coverage?  ✅

The more you can keep your hands warm, the more effective you will be as a coach and a leader.  You’ll have a richer instinct for what people need and want from you and each other, which will help you keep a light touch.  You will write better reviews and resolve technical disputes with more authority.  You will also slow the erosion and geriatric creep of your own technical chops.

I firmly believe every line manager should either be in the on call rotation or pinch hit liberally and regularly, but that’s a different post.

Technical Leadership track

If you  love technology and want to remain a subject-matter expert in designing, building and shipping cutting-edge technical products and systems, you cannot afford to let yourself drift too far or too long away from hands-on engineering work.  You need to consciously cultivate your path , probably by practicing some form of the engineer/manager pendulum.

If you love managing engineers — if being a technical leader is a part of your identity that you take great pride in, then you must keep up your technical skills and periodically DIstrust Kittens 2.5x3invest in your practice and renew your education.  Again: this is simply the price of admission.  You need to renew your technical abilities, your habits of mind, and your visceral senses around creating and maintaining systems.  There is no way to do this besides doing it.  If management isn’t a promotion, then returning to hands-on work isn’t a demotion, either.  Right?

One warning: Your company may be great, but it doesn’t exist for your benefit.  You and only you can decide what your needs are and advocate for them.  Remember that next time your boss tries to guilt you into staying on as manager because you’re so badly needed, when you can feel your skills getting rusty and your effectiveness dwindling.  You owe it to yourself to figure out what makes you happy and build a portfolio of experiences that liberate you to do what you love.  Don’t sacrifice your happiness at the altar of any company.  There are always other companies.

Honestly, I would try not to think of yourself as a manager at all: you are an “engineering leader” performing a tour of duty in management.  You’re pursuing a long term strategy towards being a well-respected technologist, someone who can sling code, give informed technical guidance and explain in detail customized for to anyone at any level of sophistication.

Organizational Leadership Track

Most managers assume they want to climb the ladder.  Leveling up feels like an achievement, and that can feel impossible to resist.

Resist it.  Or at least, resist doing it unthinkingly.  Don’t do it because the ladder is there and must be climbed.  Know as much as you can about what you’re in for before you decide it’s what you want.

Here are a few reasons to think critically about climbing the ladder to director and executive roles.

  1. Your choices shrink. There are fewer jobs, with more competition, mostly at bigger companies.  (Do you even like big companies?)
  2. You basically need to do real time at a big company where they teach effective management skills, or you’ll start from a disadvantage.
  3. Bureaucracies are highly idiosyncratic, skills and relationships may or may not transfer with you between companies. As an engineer you could skip every year or two for greener pastures if you landed a crap gig.  An engineer has … about 2-3x more leeway in this regard than an exec does.  A string of short director/exec gigs is a career ender or a coach seat straight to consultant life.
  4. You are going to become less employable overall.  The ever-higher continuous climb almost never happens, usually for reasons you have no control over.  This can be a very bitter pill.
  5. Your employability becomes more about your “likability” and other problematic things.  Your company’s success determines the shape of your career much more than your own performance.  (Actually, this probably begins the day you start managing people.)
  6. Your time is not your own. Your flaws are no longer cute. You will see your worst failings ripple outward and be magnified and reflected.  (Ditto, applies to all leaders but intensifies as you rise.)
  7. You may never feel the dopamine hit of “i learned something, i fixed something, i did something” that comes so freely as an I.C.  Some people learn to feel satisfaction from managery things, others never do.  Most describe it as a very subdued version of the thrill you get from building things.
  8. You will go home tired every night, unable to articulate what you did that day. You cannot compartmentalize or push it aside. If the project failed for reasons outside your control, you will be identified with the failure anyway.
  9. Nobody really thinks of you as a person anymore, you turn into a totem for them to project shit on. (Things will only get worse if you hit back.)  Can you handle that?  Are you sure?
  10. It’s pretty much a one-way trip.

Sure, there are compensating rewards.  Money, power, impact.  But I’m pointing out the negatives because most people don’t stop to consider them when they start saying they want to try managing managers.  Every manager says that. parasite

The mere existence of a ladder compels us all to climb.

I know people who have climbed, gotten stuck, and wished they hadn’t. I know people who never realized how hard it would be for them to go back to something they loved doing after 5+ years climbing the ladder farther and farther away from tech.  I know some who are struggling their way back, others who have no idea how or where to start.  For those who try, it is hard.  

You can’t go back and forth from engineering to executive, or even director to manager, in the way you can traverse freely between management and engineering as a technologist.

I just want more of you entering management with eyes wide open.  That’s all I’m saying.

If you don’t know what you want, act to maximize your options.

Engineering is a creative act. Managing engineers will require your full attentive and authentic self. You will be more successful if you figure out what that self is, and honor its needs.  Try to resist the default narratives about promotions and titles and roles, they have nothing to do with what satisfies your soul.  If you have influence, use it to lean hard against things like paying managers more than ICs of the same level.[2]

gpsun2It’s totally normal not to know who you want to be, or have some passionate end goal.  It’s great to live your life and work your work and keep an eye out for interesting opportunities, and see what resonates.  It’s awesome when you get asked to step up and opportunistically build on your successes.

If you want a sustainable career in tech, you are going to need to keep learning your whole life. The world is changing much faster than humans evolved to naturally adapt, so you need to stay a little bit restless and unnaturally hungry to succeed in this industry.

The best way to do that is to make sure you a) know yourself and what makes you happy, b) spend your time mostly in alignment with that. Doing things that make you happy give you energy. Doing things that drain you are antithetical to your success. Find out what those things are, and don’t do them. 

Don’t be a martyr, don’t let your spending habits shackle you, and don’t build things that trouble your conscience.

And have fun.

Yours in inverting $(allthehierarchies),
charity.

img_5680

 

[1] Important point: I am not saying you can’t pick up the skills and patience to practice engineering again.  You probably can!  But employers are extremely reluctant to pay you a salary as an engineer if you haven’t been paid to ship code recently.  The tipping point for hireability comes long before the tipping point for learning ability, in my experience.

[2] It is in no one’s best interest for money to factor into the decision of whether to be a manager or not.  Slack pays their managers LESS than engineers of the same level, and I think this is incredibly smart: sends a strong signal of servant leadership.

 

Engineering Management: The Pendulum Or The Ladder