Every Achievement Has A Denominator

One of the classic failure modes of management is the empire-builder — the managers who measure their own status, rank or value by the number of teams and people “under” them.

Everyone knows you aren’t supposed to do this, but most of us secretly, sheepishly do it anyway to some extent. After all, it’s not untrue — the more teams and people that roll up to you, the wider your influence and the more impact you have on more people, by definition.

The other reason is, well, it’s what we’ve got. How else are we supposed to gauge our influence and impact, or our skill as a leader? We don’t really have any other language, metrics or metaphors readily available to us. 😖

Well… Here’s one:

✨Every achievement has a denominator✨

Organization size can be a liability

Let’s say you have 1,000 people in your org and you collectively achieve something remarkable. Good for you!

What if you achieved the same thing with 10,000 people, instead? What would that say about your leadership?

What if you achieved the same thing with 100 people?

Or even 10 people?

Lots of people take pride in their ability to manage large organizations. And with thousands of people in your org, you kinda better do something fucking great. But what if you instead took pride in your ability to deliver outsize results with a small denominator?

What if comp didn’t automatically bloat with the size of your org, but rather the impact of your work divided by the number of contributors — rewarding leaders for leaner teams, not larger ones?

Bigness itself is costly. There’s the cost of the engineers, managers, product and designers etc, of course. But the bigger it gets, the more coordination costs are incurred, which are the worst costs of all because they do not accrue to any user benefit — and often lead to lack of focus and product surface sprawl.

Constraints fuel creativity. Having “enough” engineers for a project is usually a terrible idea; you want to be constrained, you want to have to make hard decisions about where to spend your time and where to invest development cycles.

More often than not, scope is the enemy

As Ben Darfler wrote earlier this year about our approach to engineering levels at Honeycomb:

There are times when broad scope may be unavoidable, but at Honeycomb, we try to cultivate a healthy skepticism toward scope. More often than not, scope is the enemy. We would rather reward engineers who find clever ways to limit scope by decomposing problems in both time and size. We also want to reward engineers who work on the most important problems for the business, regardless of the size of the project. We don’t want to reward people for gaming out their work based on what will get them promoted.

The same is true for engineering managers, directors and VPs. We would rather reward them for getting things done with small, nimble teams, not for empire building and team sprawl. We want to reward them for working on the most important problems for the business, regardless of what size their teams are.

What was the denominator of the last big project you landed? Could you have done it with fewer people? How will you apply those learnings to the next big initiative?

Can we find more language and ways to talk about, or take pride in how efficiently we do big things? At the very least, perhaps we can start paying attention to the denominator of our achievements, and factor that into how we level and reward our leaders.

charity.

P.S. I did not invent this phrase, but I am unfortunately unable to credit the person I heard it from (a senior Googler). I simply think it’s brilliant, and so helpful.

Every Achievement Has A Denominator

The Future of Ops is Platform Engineering

First published on 2022-09-30 at https://www.honeycomb.io/blog/future-ops-platform-engineering.

Two years ago I wrote a piece in The New Stack about the Future of Ops Careers. Towards the end, I wrote:

The reality is that jack-of-all-trades systems infrastructure jobs are slowly vanishing: the world doesn’t need thousands of people who can expertly tune postfix, SpamAssassin, and ClamAV—the world has Gmail. (…)

Building infrastructure and operational expertise used to be bundled together into a single role. But the industry is now bifurcating along an infrastructure fault line, and the overlap between infrastructure-oriented engineers and operationally-minded engineers is swiftly eroding. Engineers who love this work increasingly have a choice to make. Either you can 1) go deep on infrastructure by joining a company that does infrastructure as a service, or 2) go broad on operability by joining a company to help them do as little infrastructure as possible.

I described the second category as “operations engineering minus the infrastructure,” dedicated to evaluating and assembling a production stack of third-party platform providers, enabling software engineers to self-serve their services and own their own code in production. I said:

  • Your job will be to aggressively minimize the cycles your org devotes to infrastructure by finding effective ways to outsource or minimize infra labor. Your job is to NOT go deep if there is any workable alternative.
  • Your job will be to work cross-functionally with all the other software engineering teams, looking for ways to speed up their time to value and helping them own their own code in production.
  • Your job will be to move past the kludgey old models of “outsourcing” to sophisticated understandings of how and where to leverage abstractions that can radically accelerate development.

That second category I was describing now has a name. We call those teams “platform engineering.”

The fifty-year arc of software careers

In the beginning, there were people who wrote and ran software. At some point, we spun away ops skills from dev skills into two different professions, but that turned out to be a ginormous mistake, so along came DevOps to reunify them. Nowadays, ops as an independent profession is in the process of fading out. Companies are spinning down their ops teams left and right. Engineers who formerly identified as sysadmins or operations have turned into DevOps engineers, and soon there will just be “software people” again. This is the way of things.

Please note that this is NOT the same thing as saying “ops is dead,” or “ops skills are no longer valuable or needed1.” Our systems are only getting more complex, more difficult to operate, and simultaneously more critical to life on earth, which means that operational excellence has never been more desperately needed (and if you don’t respect that, 🌈 you deserve to suffer 🌈).

The industry story of the past three to five years has been us trying to figure out how to help software engineers own their own code in production2, phasing out dedicated ops teams, and aggressively outsourcing as much infrastructure as possible.

As we should. Developer cycles are the scarcest resource in your company, and you want to spend as many of those as possible on your core product: the crown jewel, the code that makes you a business. Money is cheaper than engineering cycles, and teams that are focused on their core business will always outperform teams whose focus is spread across dozens of non-revenue-generating projects. Let someone else build and run all the dependencies and adjacencies.

Before: some engineers wrote code, and some engineers ran code.

Now: all engineers write code, and all engineers run the code they write.

Platform engineering is what stands between you and darkness

When you start talking about putting software engineers on call for their own code, and generally being more involved in production, some percentage of the time you will hear back a guttural wail of despair: “You can’t expect me to know EVERYTHING about EVERYTHING!”

Quite right; we can’t. Platform engineering teams are part of the answer to this perfectly reasonable complaint. It’s not that you’re being asked to do or understand more in toto, but the distribution of labor and responsibility is shifting:

Before: some engineers wrote code, and some engineers ran code.

Now: all engineers write code, and all engineers run the code they write—but we divide the areas of responsibility by layer or function.

The emergence of a minimum viable self-serve tier

In the earliest days of a company, your first few engineers end up bootstrapping an infrastructure by reading AWS docs or blog posts, or asking a friend for recommendations to get started. They might start by setting up a managed container service, or configuring Terraform, and for a while everybody deploys and owns their own code, just as god intended.

But cognitive limits kick in pretty quickly. The maze of APIs and SDKs and components out there is simply bewildering, even for an experienced ops hand. Before long, it becomes someone’s job to make good decisions, pick a suite of compute and storage options that serve the team’s needs, and write some tooling that pulls everything into a coherent whole—which, at a minimum, lets you:

  1. Run tests and generate new artifacts
  2. Deploy artifacts, version them, and roll back
  3. Instrument, monitor, and debug
  4. Store data somewhere, manage schemas and migrations
  5. Adjust capacity as needed
  6. Define and commit all components (and their relationships) as code

Once these are built, it should be trivial for an engineer to come along and spin up a new service using templates and components from existing services. It should be much simpler and easier to use the blessed paths than anything else, and there should be friction if you go off the beaten path.

Congratulations! You’ve just been platformed 🎉. One of the key principles of any developer platform is that it should be easy to do the right things, and hard to do the wrong things.

The differences between platform engineering and traditional ops

Platform teams are typically staffed by engineers who are comfortable writing software. Not just scripting and automation, but writing tests and doing code reviews. Platform teams also operate much more like product development teams do, with product managers (and occasionally, designers, developer advocates, or UX researchers).

This doesn’t mean that everybody on a platform team has to have originally been a software engineer; in fact, a super common failure condition for platform teams is simply thinking all they need to do is hire software engineers to build developer tools. A strong platform team has an equally deep grounding in operations experience and software development. Individuals who are experts in both areas are fairly rare, but you can pull together a strong, well-rounded team by assembling a mix of SWEs (with some ops experience) and ops or DevOps engineers (with some software experience) and having them learn and grow from each other.

Platform teams are decidedly cloud-native; they actually mostly involve platforms built atop the cloud itself—PaaS, IaaS, everything-aaS, serverless, and so forth.

Ops/DevOps teams are oriented around managing infrastructure, often several generations of infrastructure. Their turf is everything from data centers and bare metal up through virtualization, containers, and the cloud (they aren’t so much cloud-native as cloud-enabled). They measure themselves on things like SLOs and the DORA metrics. You know they’re doing a good job if the system is up/available and users are happy.

Platform teams are oriented around providing a good experience for developers to self-serve and self-manage their code. The more swiftly and easily developers can move, the better your platform team. Operational excellence, in the platform model, is actually more the responsibility of the other engineering teams (and/or an adjacent SRE team) than that of the platform team.

Platform teams typically work higher up the stack than operations, DevOps, or SRE teams do, and they involve a great deal less infrastructure. On the contrary, platform teams are bent on paying other people to run as much shit as possible, preserving their own scarce development cycles for their core product.

Here is a somewhat tongue-in-cheek table of the similarities and differences between the archetypes.

Platform engineers vs. DevOps engineers

Platform Engineer Ops (or DevOps) Engineer
% of job spent writing code > 50% < 50%
Rest of time spent Gathering product requirements, doing user research, architecture discussions, optimizing internal workflows, researching new tools and developer productivity ideas, reviewing other teams’ diffs for impact, performance tuning, helping other engineers own & scale their code, fixing CI/CD pipelines. Fixing cron jobs, automating old setup docs, converting PXE/rsync to Chef/Puppet, converting Chef/Puppet to Terraform, converting VMs to containers, deploying software, debugging broken deploys, writing monitoring checks, doing retros, building out new services, pairing with software engineers to understand and debug their code, investigating weird shit, documentation, etc.
Responsible for Enabling internal teams to self-serve their ability to run and own their code in production. Creating standard, reusable components and processes. Defining golden paths. Infrastructure capacity planning, scaling, performance tuning, upgrading. Reliability and resiliency, SLOs and monitoring/alerting. Delivering quality experience to customers.
Builds for Internal developer teams Customers
Development style Infrastructure as a product Infrastructure as code
Works with product managers Yes No
Works with UX researchers or designers Sometimes No
Dashboards & graphs Uses APM, observability, tracing. Cares a lot about instrumentation and OpenTelemetry. Uses metrics, logs, dashboards; monitoring, alerting, and agent/sidecar/blackbox telemetry.
What ‘coding’ means to them Developing new features & services, writing tests. These are (primarily) software people who do systems. Automation, configuration, DSLs, extending and debugging existing code. These are systems people who do software.
Preferred language Go, Rust Python, Ruby
Time spent in Linux Hardly any A lot
Succeeds when Developers can easily choose good defaults, self-serve their infra, and own their own code in production. Infrastructure is scalable, secure, cost-effective, reliable, and customers are happy.
Native terrain Serverless, *aaS, APIs for everything (cloud-native and above). Instances, VMs, containers, regions, multi-cloud (everything “below,” but up to and including the cloud).
Databases Uses hosted DBs Runs their own, blending automation & DBA expertise
SSH No Yes
Shell REPL bash/zsh
Mantra “Run Less Software” “Cattle, Not Pets”

What about DevOps vs. SRE?

Countless words have been spilled on the difference between DevOps and SRE3, which I won’t rehash.

Here’s what I’ll say: DevOps, to me, feels like a relevant concept for companies that have a lot of infrastructure to wrangle. Companies that do in fact have dev teams and ops teams, or dev teams and DevOps teams (🙄), tend to have a lot of operational shit to automate, test, and run. They use config management, virtualization, and containers, often managing several generations worth of technology, possibly even down to data centers and bare metal. DevOps is for companies that have some combination of bare metal, VMs, regions, AZs, multi-cloud, networking devices, self-managed databases, etc.

DevOps is capacious. It contains multitudes. DevOps writes code, and DevOps has a fuckload of code to manage.

It is also on its way to becoming irrelevant. We are swiftly entering a post-DevOps world.

SRE, to me, feels different. I associate SRE with very large companies, where they mostly have software engineers owning their own code in production, but maybe still struggle with it a bit. SREs are often embedded within software engineering teams or product groups, and they focus a lot on, well, reliability, as the name suggests.

This means they do less infrastructure jockeying or automating (although they still do some coding). They typically have a lot to say about instrumentation, monitoring and observability, and cross-functional coordination. They run incident response and do blameless retros, and they tend to be experts at scaling.

If a company has both a DevOps team and SRE, typically I expect to see the SRE team more on the frontlines, involved with incidents, telemetry, etc., and DevOps teams more on the backburner, slinging pipes and plumbing.

Observability engineering as a case study

In the same piece I referenced earlier, I also wrote about the role of observability teams. I said they should largely no longer be running their own monitoring and graphing software in-house. Yet there is still a place for observability teams to exist: they remain a critical link between outsourced solutions and internal developer needs.

That team should write libraries, generate examples, and drive standardization; ushering in consistency, predictability, and usability. They should partner with internal teams to evaluate use cases. They should partner with your vendors as roadmap stakeholders. They might also write glue code and helper modules to connect disparate data sources and create cohesive visualizations. Basically, that team becomes an integration point between your organization and the outsourced work.

I originally wrote this about observability, but it could just as easily be used to describe platform engineering as a whole. This is the role—being the bridge between other vendors and your own core software. It’s a very high-leverage place to sit.

Ops is dead, long live ops

I’ve spent a lot of time thinking about this because we’ve had such a hard time nailing down exactly who the Honeycomb customer is. Sometimes our buyer is an ops team buying it for their SWEs, sometimes it’s SREs in the midst of an outage, sometimes it’s a VP or director of engineering, or an architect, or a CTO, or a “full stack” engineering team, or even a product manager. It is hard to form a snappy answer out of that list.

The first couple questions every new go-to-market candidate asks us are “who is your buyer?” and “how do we help them?” To which I respond with a five minute ramble where I list every above persona and each of their pain points. Hardly the concrete answer they would like to receive.

As it goes, sociotechnical trends come and go. A year ago, Christine and I were speculating that platform engineering might be on the verge of consolidating the necessary ingredients that makes up our ideal buyer:

  1. Writing and shipping code, and needing to understand their own code
  2. Positioned to help other teams with their instrumentation patterns and tooling
  3. Firmly cloud-native+ and untethered to hardware or traditional infrastructure

To my delight, since that conversation, these trends have only accelerated—and I, for one, welcome our new platform engineering overlords to the observability table. ☺️

If you’d like to learn more about platform engineering, we’ll be running a Twitter space on ✨ October 20th ✨ at 12:00 p.m. PT. Come join us! I’ll be there along with two colleagues and we’ll be answering your questions and shedding more light on the topic.


1  I do hear people saying that, and it used to make me fucking furious, but now I just smugly remind myself how much self-inflicted suffering they are in for. Disrespecting operational expertise is the shortest path to never again sleeping through the night.

2 It is rather incredible how rapidly this idea has taken off. When we started talking about putting developers on call for their code in 2016, people got seriously angry with us. Before that, the only twitter mention I could find of putting devs on call was one by (of course) Adrian Cockcroft, but by 2019-2020 it had stopped being controversial and soon became common wisdom.

3 I actually wrote one of those myself: DevOps vs SRE: Delayed Coverage of the Dumbest War). LMAO. I think Liz had the final word on this back in … 2017? 2018? … when she said something like class SRE implements DevOps. And yes, DevOps is a philosophy or a methodology and not a job title, etc.

The Future of Ops is Platform Engineering

The Hierarchy Is Bullshit (And Bad For Business)

My friend Molly has had an impressive career. She got a job as a software engineer after graduating from college, and after kicking ass for a year or so she was offered a promotion to management, which she accepted with relish. Molly was smart, driven, and fiercely ambitious, so she swiftly clambered up the ranks to hold director, VP, and other shiny leadership roles. It took two decades, an IPO and a vicious case of burnout before she allowed herself to admit how much she hated her work, and how desperately she envied (guess who??) the software engineers she worked alongside. Turns out, all she ever really wanted to do was write code every day. And now, to her dismay, it felt too late.

Why did it take Molly so long to realize what made her happy? I personally blame the fucking hierarchy.

The Hierarchy Lie

The “Big Lie” of hierarchy is that your organizational structure is a vertical tree from the CEO on down, where higher up is always better.

Of course any new grad is going to feel that way, on the heels of 15-20 years spent going through school year by year, grade by grade, measuring success via good grades and teacher approval. The early years of professional life are a similar blend of hard work, leveling up and basic skills acquisition. (They got Molly hopped on the leveling treadmill before she even had a chance to become a real adult, in other words. 😍)

But by the time you are fully baked as a senior contributor, maybe 7-8 years in, your relationship to levels and ladders should undergo a dramatic shift. At some point you have to learn to tune in to your own inner compass. What draws you in to your work? What fuels your growth and success?

Being an adult means not measuring yourself entirely on other people’s definition of success. Personal growth might come in the guise of a big promotion, but it also might look like a new job, a different role, a swing to management or back, becoming well-known as a subject matter expert, mentoring others, running an affinity group, picking up new skill sets, starting a company, trying your hand at consulting, speaking at conferences, taking a sabbatical, having a family, working part time, etc. No one gets to define that but you.

You have a thirty- or forty-year adult life and career in front of you. What the hell are you going to do with all that time and space??

Your career is not one mad sprint to the finish line

Literally nobody’s career looks like a straight line, going up, up up and to the right, from intern to CEO (to a coffin).

One of the most exhausting things about working at Facebook was the way engineering levels feltLiterally no one's career, ever. like a hamster wheel, where every single quarter you were expected to go go go go go, do more do more, scrape up ever more of your mortal soul to pour in more than you could last quarter — and the quarter before that, and before that, in ever-escalating intensity.

It was fucking exhausting, yo. Life does not work that way. Shit gets hilly.

The strategy for a fulfilling, lifelong career in tech is not to up the ante every interval. Nor is it to amass more and more power over others until you explode. Instead:

  1. Train yourself to love the feeling of constantly learning and pushing your boundaries. Feeling comfortable is the system blinking orange, and it should make you uneasy.
  2. Follow your nose into work that lights you up in the morning, work you can’t stop thinking about. If you’re bored, do something else.
  3. Say yes to opportunities!! Intensity is nothing to be afraid of. Instead of trying to cap your speed or your growth, learn to alternate it with recovery periods.
  4. If you aren’t sure what to do, make the choice that preserves or expands future optionality. Remember: Most startups fail. Will you be okay with your choices if (& when) this one does too?

Why do people climb the ladder? “Because it’s there.” And when they don’t have any other animating goals, the ladder fills a vacuum.

But if you never make the leap from externally-motivated to intrinsically-motivated, this will eventually becomes a serious risk factor for your career. Without an inner compass (and a renewable source of joy), you will struggle to locate and connect with the work that gives your life meaning. You will risk burnout, apathy and a serious lack of fucks given..

The times I have come closest to burnout or flaming out have never been when I was working the hardest, but when I cared the least. Or when I felt the least needed.📈📉💔

A disturbing number of companies would rather feel in control than unclench and perform better

But hey! Lack of inner drive isn’t the ONLY thing that drives people to climb the ladder. Plenty of companies fuck this up too, all on their lonesome. Let’s talk about more of the ways that companies mess up the workplace! Like by disempowering the people doing the work and giving all the power to managers, thereby forcing anyone who wants a say in their own job become one.

The way we talk about work is riddled with hierarchical, authoritarian phrases: “She was my superior”, “My boss made me do it”, “I got promoted into management”, and so on.

There are plenty of industries where line workers are still disempowered cogs and power structures are hierarchical and absolute (like flipping burgers at McDonalds, or factory line work). There are even software companies still trying to make it work in command-and-control mode, to whom engineers are interchangeable monkeys that ship story points and close JIRA tasks.

But if there’s one thing we know, it’s that for industries that are fueled by creativity and innovation, command-and-control leadership is poison. It stifles innovation, it saps initiative, it siphons away creativity and motivation and caring.

Studies also show that the more visible someone’s power is, the less likely anyone is to give them honest feedback.[2]

Companies that don’t learn this lesson are unlikely to win over the long run. Engineering is a deeply creative occupation, and authoritarian environments are toxic for creativity and people’s willingness to share information.

Hierarchy is just a data structure

The basic function of a hierarchy is to help us make sense of the world, simplify information, and make decisions. Hierarchy lets us break down enormous projects — like “let’s build a rocket!”, or “let’s invade the moon!” — into millions of bite size decisions and tasks, and this is how progress gets made.

A certain amount of authority is invested into the hierarchy model. If you are responsible for delivering a unit of work, the company needs to make sure you have enough resources and decision-making ability to do so. This is what we think of as the formal power structure [1], and there is nothing wrong with that. It’s what makes the system work.

The problem starts when we stop thinking of hierarchy as a neutral data structure — a utilitarian device for organizing groups and making decisions — and start projecting all kinds of social status and dominance onto it.

A sensitivity to social dominance is wired deep, deep into our little monkey brains. It’s what tells us we deserve more power, leverage, pride, influence, and autonomy — and simply have more value — than those below us. It’s what tells us those above us are better, stronger and more deserving than we are, and that we owe them our respect and deference.

It also tells us “if you lose status, YOU MIGHT DIE” 😱😱😱 which is why we may react to a perceived loss of status with a sting that seems astonishingly extreme and overwrought, even to ourselves, yet somehow impossible to shrug off.

hierarchies tend to get mixed up with social dominance

In general, it is better to pursue roles and growth based on the affirmative (what it is you want to learn, grow or do more of) than the negative (what you want to avoid, evade or stop doing). Your motivation systems don’t kick in to gear when you are feeling “lack of pain” — the system doesn’t work that way. They kick in when you get interested.

And if you are sick of doing something or being treated a certain way, chances are everyone else will hate it, too. Who wants to work at a company where all the shit rolls downhill?

Hierarchies have stuck around for one very good reason: because they work. Hierarchies are simple, intuitive, and allow large numbers to collaborate with low cognitive overhead. Unfortunately, most hierarchies become entwined with status and dominance markers, which can bring enormous downsides. At their worst, they can suck the literal life out of work, reducing us all to glum little cogs obeying orders.

We aren’t getting rid of hierarchy anytime soon. But we can use culture and ritual to gently untangle them from dominance, and we can choose to interpret formal power as a service function instead of a dictatorship. This frees people up to choose their work based on what makes them feel fulfilled, instead of their perceived status. (Also helpful? Flatter pay bands. 😛)

Good managers do not dictate and demand, they nurture, develop, and inspire. The most important roles in the company aren’t held by managers; they are all the little leaf nodes  busily building the product, supporting users, identifying markets, writing copy, etc. The people doing the work are why we exist as a company; all the rest is, with considerable due respect, overhead.

How to drain your hierarchy of social dominance

When it comes to hierarchy and team structure, there are the functional, organizational aspects (mostly good) and the social dominance parts (mostly bad). With that in mind, there are plenty of smaller things we can do as a team to remind people that we are equal colleagues, simply with different roles.

  • Be conscious of the language you use. Does it reinforce dominance and hierarchy? (Step one: stop calling management “a promotion”🥰)
  • De-emphasize trappings of power. The more you refer to someone’s formal power, the less likely anyone is to give them critical feedback or question them.
  • Push back against common but unhelpful practices, like “a manager should always make more money than the people who report to them.” Really? Why??
  • Are there opportunities for career advancement as an IC, or only as a manager? Everyone should have the ability to advance in their career.
  • Do your own dishes, everyone.
  • Practice visualizing the org chart upside down, where managers and execs support their teams from below rather than topping them from above. (I was going to write a whole post about this, then discovered other people have been doing that for the past decade. 🤣)

And then there is the big(ger) thing we can (and must!) do, in order to 1) make people go into management for the right reasons, 2) help senior IC roles remain attractive to highly skilled creative and technical contributors, and 3) encourage everybody to make career decisions based on curiosity, growth, and what’s best for the business, instead of turf and power grabs. Which is:

Practice transparency, from top to bottom

Share authority, decision-making and power

Technical contributors own technical decisions

Most people who go in to management don’t do it out of a burning desire to write performance reviews. They do it because they are fed the fuck up with being out of the loop, or not having a say in decisions over their own work. All they want is to be in the room where it happens, and management tends to be the only way you get an invite.

EVERY company says they believe in transparency, but hardly any of them are, by my count. Transparency doesn’t mean flooding people with every trivial detail, or freaking them out with constant fire drills. It does mean being actively forthcoming about important questions and matters which are happening or on the horizon…often before you are fully comfortable with it. Honestly, if you never feel any discomfort about your level of transparency, you probably aren’t transparent enough.

People do better work with more context! You’re equipping them with information to better understand the business problems and technical objectives, and thereby unleashing them and their creativity to help solve them. You’re also opening yourself up to questioning and sanity checks — which may feel uncomfortable, but 🌞sunlight is sanitizing🌞 — it is worth it.

Some practical tips for transparency

At Honeycomb, we present the full board deck after every board meeting in our all hands, and take questions. When we’re facing financial uncertainty, we say so, along with our working plan for dealing with it. We also do org-level updates in all hands, once per quarter per org. Each org presents a snapshot to the company of how they are doing, but we ask that no more than 2/3 of the presentation be about their successes and triumphs, and 1/3 of their material be about their failures and misses. Normalize talking about failure.

Being transparent isn’t about putting everyone on blast; it’s about cultivating a habit of awareness about what might be relevant to other people. It’s about building systems of feedback, updates and open questioning into your culture. This can be scary, so it’s also about training yourselves as a team to handle hard news without overreacting or shooting the messenger. If you always tell people what they want to hear, they’ll never trust you. You can’t trust someone’s ‘yes’ until you hear their ‘no’.

Transparency is always a balance between information and distraction, but I think these are healthy internal rules of thumb for management:

  1. If anyone has further questions or wants to know more details than what was shared, they are free to ask any manager or exec, who will willingly answer more fully, up to the boundaries of privacy or legal reasons. As employees, they have a right to know about the business they are part of. A right — not a privilege, which can be revoked on a whim.
  2. When making internal decisions about e.g. salary bands, individual exceptions to formal policy, etc, ask each other … if this decision were to leak, could we justify our reasoning with head held high? If you would feel ashamed, or if you really don’t want people to find out about it, it’s probably the wrong decision.

Some practical tips for distributing power

Power flows to managers by default, just like water flows downhill. Managers have to actively push back on this tendency by explicitly allocating powers and responsibilities to tech leads and engineers. Don’t hoard information, share context generously, and make sure you know when they would want to tap in to a discussion. Your job is not to “shield” them from the rest of the org; your job is to help them determine where they can add outsize value, and include them. Only if they trust you to loop them in will they feel free to go heads down and focus.

Wrap your senior ICs into planning and other leadership activities. Decisions about sociotechnical processes (code reviews, escalation points, SLI/SLOs, ownership etc) are usually better owned by staff+ engineers than anyone on the management track. Invite a couple of your seniormost engineers to join calibrations — they bring a valuable perspective to performance discussions that managers lack.

Demystify management. Blur the lines between people managers and engineers; delegate ownership and accountability for some important projects to ICs. Ask every engineer about their career interests, and if management is on the list, find opportunities for them to practice and improve at managerial skills — mentoring, interviewing, onboarding, etc.

Adults don’t like being told what to do

People do phenomenal work when they want to do it, when they are creatively and emotionally engaged at the level of optimum challenge, and when they know their work matters. That’s where you’ll find your state of flow. That is where you’ll do your best work, which is also the best way to get promoted and make durable advances in your career.

Not, ironically, by chasing levels and titles for their own sake. ☺️

People want to be challenged. They want you to ask them to step up and take responsibility for something hard. They want to be needed, and they want to have agency in the doing of it. Just like you do.

Oh yeah, back to Molly …

Molly, who I mentioned at the beginning, joined Honeycomb five years ago as a customer success exec. After realizing she wanted to go back to engineering, she switched to working our support desk to build up her technical chops while she practiced writing code on the side. She has now been working as a software engineer on the product team for over two years, and she is ✨rocking it.✨ It is NEVER too late. 🙌

<3 charity

p.s. Molly also says, “don’t waste time at bad companies, whether you’re climbing the ladder or not!” 🥂

 

[1] Formal power is only one kind of power, and in some ways it is the weakest, because it doesn’t belong to you. It belongs to the company and is only loaned out for you to wield on its behalf. (You don’t carry the innate ability to fire people along with you after you stop being an engineering manager, for example.) Formal powers are limited, enumerated, and functional. You don’t get to use them for any reason other than furthering the goals of the org, or else it is literally an abuse of power.

Formal power is fascinating in another way, too: which is that your formal power is seen as legitimate only if you ~basically always wield it in the ways everyone already expects you to. You can make a surprising call only so often; you can straight up overrule the wishes of your constituents extremely rarely. If you use your formal power to do things that people disagree with or don’t support, without taking the time to persuade them or create real consensus, you will squander your credibility and good faith unbelievably fast.

[2] I am not going to bother rustling up lots of links and citations, because I expect most of this falls into the voluminous category of “shit you already knew”. But if any of it sounds surprising to you, here are some classic reference works:

Flow, by Mihaly Csikszentmihalyi
Drive, by Dan Pink
The Culture Code: Secrets of Successful Groups, by Daniel Coyle
A Lapsed Anarchist’s Guide to Being a Better Leader, by Ari Weinzweig

[3] The scientific literature suggests that dominating instincts tend to emerge in more overtly hostile environments. Make of that what you will, I guess.

 

Some other writing I have done on this topic, or topics adjacent …

The Engineer/Manager Pendulum
The Pendulum or the Ladder
If Management Isn’t a Promotion, then Engineering isn’t a Demotion
Twin Anxieties of the Engineer Manager Pendulum
Things to Know About Engineering Levels
Advice for Engineering Managers who want to Climb the Ladder
On Engineers and Influence
Is There a Path Back from CTO to Engineer?

The Hierarchy Is Bullshit (And Bad For Business)

Rituals for Engineering Teams

Last weekend I happened to pick up a book called “Rituals For Work: 50 Ways To Create Engagement, Shared Purpose, And A Culture That Can Adapt To Change.” It’s a super quick read, more comic book than textbook, but I liked it.

It got me thinking about the many rituals I have initiated and/or participated in over the course of my career. Of course, I never thought of them as such — I thought of them as “having fun at work” 🙃 — but now I realize these rituals disproportionately contribute to my favorite moments and the most precious memories of my career.

Rituals (a definition): Actions that a person or group does repeatedly, following a similar pattern or script, in which they’ve imbued symbolism and meaning.

I think it is extremely worth reading the first 27 pages of the book — the Introduction and Part One. To briefly sum up the first couple chapters: the power of creative rituals comes from their ability to link the physical with the psychological and emotional, all with the benefit of “regulation” and intentionality. Physically going through the process of a ritual helps people feel satisfied and in control, with better emotional regulation and the ability to act in a steadier and more focused way. Rituals also powerfully increase people’s sense of belonging, giving them a stable feeling of social connection. (p. 5-6)

The thing that grabbed me here is that rituals create a sense of belonging. You show that you belong to the group by participating in the ritual. You feel like you belong to the group by participating in the ritual. This is powerful shit!

It seems especially relevant these days when so many of us are atomized and physically separated from our teammates. That ineffable sense of belonging can make all the difference between a job that you do and a role that feeds your soul. Rituals are a way to create that sense of belonging. Hot damn.

So I thought I’d write up some of the rituals for engineering teams I remember from jobs past. I would love to hear about your favorite rituals, or your experience with them (good or bad). Tell me your stories at @mipsytipsy. 🙃

Rituals at Linden Lab

Feature Fish Freeze

At Linden Lab, in the ancient era of SVN, we had something called the “Feature Fish”. It was a rubber fish that we kept in the freezer, frozen in a block of ice. We would periodically cut a branch for testing and deployment and call a feature freeze. Merging code into the branch was painful and time consuming, so If you wanted to get a feature in after the code freeze, you had to first take the fish out of the freezer and unfreeze it.

This took a while, so you would have to sit there and consider your sins as it slowly thawed. Subtext: Do you really need to break code freeze?

Stuffy the Code Reviewer

You were supposed to pair with another engineer for code review. In your commit message, you had to include the name of your reviewer or your merge would be rejected. But the template would also accept the name “Stuffy”, to confess that your only reviewer had been…Stuffy, the stuffed animal.

However if your review partner was Stuffy, you would have to narrate the full explanation of Stuffy’s code review (i.e., what questions Stuffy asked, what changes he suggested and what he thought of your code) at the next engineering meeting. Out loud.

Shrek Ears

We had a matted green felt headband with ogre ears on it, called the Shrek Ears. The first time an engineer broke production, they would put on the Ears for a day. This might sound unpleasant, like a dunce cap, but no — it was a rite of passage. It was a badge of honor! Everyone breaks production eventually, if they’re working on something meaningful.

If you were wearing the Shrek Ears, people would stop you throughout the day and excitedly ask what happened, and reminisce about the first time they broke production. It became a way for 1) new engineers to meet lots of their teammates, 2) to socialize lots of production wisdom and risk factors, and 3) to normalize the fact that yes, things break sometimes, and it’s okay — nobody is going to yell at you. ☺️

This is probably the number one ritual that everybody remembers about Linden Lab. “Congratulations on breaking production — you’re really one of us now!”

Vorpal Bunny

vorpal bunny

We had a stuffed Vorpal Bunny, duct taped to a 3″ high speaker stand, and the operations engineer on call would put the bunny on their desk so people knew who it was safe to interrupt with questions or problems.

At some point we lost the bunny (and added more offices), but it lingered on in company lore since the engineers kept on changing their IRC nick to “$name-bunny” when they went on call.

There was also a monstrous, 4-foot-long stuffed rainbow trout that was the source of endless IRC bot humor… I am just now noticing what a large number of Linden memories involve stuffed animals. Perhaps not surprising, given how many furries were on our platform ☺️

Rituals at Parse

The Tiara of Technical Debt

Whenever an engineer really took one for the team and dove headfirst into a spaghetti mess of tech debt, we would award them the “Tiara of Technical Debt” at the weekly all hands. (It was a very sparkly rhinestone wedding tiara, and every engineer looked simply gorgeous in it.)

Examples included refactoring our golang rewrite code to support injection, converting our entire jenkins fleet from AWS instances to containers, and writing a new log parser for the gnarliest logs anyone had ever seen (for the MongoDB pluggable storage engine update).

Bonfire of the Unicorns

We spent nearly 2.5 years rewriting our entire ruby/rails API codebase to golang. Then there was an extremely long tail of getting rid of everything that used the ruby unicorn HTTP server, endpoint by endpoint, site by site, service by service.

When we finally spun down the last unicorn workers, I brought in a bunch of rainbow unicorn paper sculptures and a jug of lighter fluid, and we ceremonially set fire to them in the Facebook courtyard, while many of the engineers in attendance gave their own (short but profane) eulogies.

Mission Accomplished

This one requires a bit of backstory.

For two solid years after the acquisiton, Facebook leadership kept pressuring us to move off of AWS and on to FB infra. We kept saying “no, this is a bad idea; you have a flat network, and we allow developers all over the world to upload and execute random snippets of javascript,” and “no, this isn’t cost effective, because we run large multi-terabyte MongoDB replica sets by RAIDing together multiple EBS volumes, and you only have 2.5TB FusionIO (for extremely high-perf mysql/RocksDB) and 40 TB spinning rust volumes (for Hadoop), and also it’s impossible to shrink or slice up replsets”, and so forth. But they were adamant. “You don’t understand. We’re Facebook. We can do anything.” (Literal quote)

Finally we caved and got on board. We were excited! I announced the migration and started providing biweekly updates to the infra leadership groups. Four months later, when the  migration was half done, I get a ping from the same exact members of Facebook leadership:

“What are you doing?!?”
“Migrating!”
“You can’t do that, there are security issues!”
“No it’s fine, we have a fix for it.”
“There are hardware issues!”
“No it’s cool, we got it.”
You can’t do this!!!”

ANYWAY. To make an EXTREMELY long and infuriating story short, they pulled the plug and canned the whole project. So I printed up a ten foot long “Mission Accomplished” banner (courtesy of George W Bush on the aircraft carrier), used Zuck’s credit card to buy $800 of top-shelf whiskey delivered straight to my desk (and cupcakes), and we threw an angry, ranty party until we all got it out of our systems.

Blue Hair

I honestly don’t remember what this one was about, but I have extensive photographic evidence to prove that I shaved the heads of and/or dyed the hair blue of at least seven members of engineering. I wish I could remember why! but all I remember is that it was fucking hilarious.

In Conclusion

Coincidentally (or not), I have no memories of participating in any rituals at the jobs I didn’t like, only the jobs I loved. Huh.

One thing that stands out in my mind is that all the fun rituals tend to come bottoms-up. A ritual that comes from your VP can run the risk of feeling like forced fun, in a way it doesn’t if it’s coming from your peer or even your manager. I actually had the MOST fun with this shit as a line manager, because 1) I had budget and 2) it was my job to care about teaminess.

There are other rituals that it does make sense for executives to create, but they are less about hilarious fun and more about reinforcing values. Like Amazon’s infamous door desks are basically just a ritual to remind people to be frugal.

Rituals tend to accrue mutations and layers of meaning as time goes on. Great rituals often make no sense to anybody who isn’t in the know — that’s part of the magic of belonging. 🥰

Now, go tell me about yours!

charity

Rituals for Engineering Teams

Live Your Best Life With Structured Events

If you’re like most of us, you learned to debug as a baby engineer by way of printf(3). By the time you were shipping code to production you had probably learned to instrument your code with a real metrics library. Maybe a tenth of us learned to use gdb and still step through functions on the regular. (I said maybe.)

Printing stuff to stdout is still the Swiss Army knife of tools. Always there when you reach for it, usually helps more than it does harm. (I said usually.)

And then! In case you’ve been living under a rock, we recently went and blew up ye aulde monolythe, and in the process we … lost most of our single-process tools and techniques for debugging. Forget gdb; even printf doesn’t work when you’re hopping the network between functions.

If your tool set no longer works for you, friend, it’s time to go all in. Maybe what you wanted was a faster horse, but it’s time for a car, and the sooner you turn in your oats for gas cans and a spare tire, the better.

Exercising Good Technical Judgment (When You Don’t Have Any)

If you’re stuck trying to debug modern problems with pre-modern tooling, the first thing to do is stop digging the hole. Stop pushing good data after bad into formats and stores that aren’t going to help you answer the right questions.

0893d048d8361fe632b090b0429ad78b-rainbow-dash-rainbows-e1542789580565.jpgIn brief: if you aren’t rolling out a solution based on arbitrarily wide, structured raw events that are unique and ordered and trace-aware and without any aggregation at write time, you are going to regret it. (If you aren’t using OpenTelemetry, you are going to regret that, too.)

So just make the leap as soon as possible.

But let’s rewind a bit.  Let’s start with observability.

 

Observability: an introduction

Observability is not a new word or concept, but the definition of observability as a specific technical term applied to software engineering is relatively new — about four years old. Before that, if you heard the term in softwareland it was only as a generic synonym for telemetry (“there are three pillars of observability”, in one annoying formulation) or team names (twitter, for example, has long had an “observability team”).

The term itself originates with control theory:

“In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals. The concept of observability was introduced by Hungarian-American engineer Rudolf E. Kálmán for linear dynamic systems.[1][2]”

But when applied to a software context, observability refers to how well you can understand and reason about your systems, just by interrogating them and inspecting their outputs with your tools. How well can you understand the inside of the system from the outside?

Achieving this relies your ability to ask brand new questions, questions you have never encountered and never anticipated — without shipping new code. Shipping new code is cheating, because it means that you knew in advance what the problem was in order to instrument it.

But what about monitoring?

Monitoring has a long and robust history, but it has always been about watching your systems for failures you can define and expect. Monitoring is for known-unknowns, and setting thresholds and running checks against the system. Observability is about the unknown-unknowns. Which requires a fundamentally different mindset and toolchain.

“Monitoring is the action of observing and checking the behavior and outputs of a system and its components over time.” — @grepory, in his talk “Monitoring is Dead“.

Monitoring is a third-person perspective on your software. It’s not software explaining itself from the inside out, it’s one piece of software checking up on another.

Observability is for understanding complex, ephemeral, dynamic systems (not for debugging code)

You don’t use observability for stepping through functions; it’s not a debugger.  Observability is for swiftly identifying where in your system the error or problem is coming from, so you can debug it — by reproducing it, or seeing what it has in common with other erroring requests.  You can think of observability as being like B.I. (business intelligence) tooling for software applications, in the way you engage in a lot of exploratory, open-ended data sifting to detect novel patterns and behaviors.

rainbow_dash___no_by_cptofthefriendship-d4erd69Observability is often about swiftly isolating or tracking down the problem in your large, sprawling, far-flung, dynamic system. Because the hard part of distributed systems is rarely debugging the code, it’s figuring out where the code you need to debug is.

The need for observability is often associated with microservices adoption, because they are prohibitively difficult to debug without service-level event oriented tooling — the kind you can get from Honeycomb and Lightstep.. and soon, I hope, many other vendors.

Events are the building blocks of observability

Ergh, another overloaded data term. What even is an “event”?

An observability “event” is a hop in the lifecycle of an end-to-end request. If a request executes code on three services separated by network hops before returning to the user, that request generated three observability “events”, each packed with context and details about that code running in that environment. These are also sometimes called “canonical log lines“. If you implemented tracing, each event may be a span in your trace.

If request ID #A897BEDC hits your edge, then your API service, then four more internal services, and twice connects to a db and runs a query, then request ID #A897BEDC generated 8 observability events … assuming you are in fact gathering observability data from the edge, the API, the internal services and the databases.ponyfm-i7812-original

This is an important caveat. We only gather observability events from services that we can and do introspect. If it’s a black box to us, that hop cannot generate an observability event. So if request ID #A897BEDC also performed 20 cache lookups and called out to 8 external HTTP services and 2 managed databases, those 30 hops do not generate observability events (assuming you haven’t instrumented the memcache service and have no instrumentation from those external services/dbs). Each request generates one event per request per service hop.**

(I also wrote about logs vs structured events here.)

Observability is a first-person narrative.

We care primarily about self-reported status from the code as it executes the request path.

Instrumentation is your eyes and ears, explaining the software and its environment from the perspective of your code. Monitoring, on the other hand, is traditionally a third-person narrative — it’s one piece of software checking up on another piece of software, with no internal knowledge of its hopes and dreams.

First-person narrative reports have the best potential for telling a reliable narrative.  And more importantly, they map directly to user experience in a way that third-party monitoring does not and cannot.

Events … must be structured.

First, structure your goddamn data.  You’re a computer scientist, you’ve got no business using text search to plow through terabytes of text.

Events …  are not just structured logs.

Now, part of the reason people seem to think structured data is cost-prohibitive is that they’re doing it wrong.  They’re still thinking about these like log lines.  And while you can look at events like they’re just really wide structured log lines that aren’t flushed to disk, here’s why you shouldn’t: logs have decades of abhorrent associations and absolutely ghastly practices.

Instead of bundling up and passing along one neat little pile of context, they’re spewing log lines inside loops in their code and DDoS’ing their own logging clusters.They’re shitting out “log lines” with hardly any dimensions so they’re information-sparse and just straight up wasting the writes. And then to compensate for the sparseness and repetitiveness they just start logging the same exact nouns tens or hundreds of times over the course of the request, just so they can correlate or reconstruct some lousy request that they never should have blown up in the first place!

But they keep hearing they should be structuring their logs, so they pile structure on to their horrendous little strings, which pads every log line by a few bytes, so their bill goes up but they aren’t getting any benefit! just paying more! What the hell, structuring is bull shit!giphy

Kittens. You need a fundamentally different approach to reap the considerable benefits of structuring your data.

But the difference between strings and structured data is ~basically the difference between grep and all of computer science. 😛

Events … must be arbitrarily wide and dense with context.

So the most effective way to structure your instrumentation, to get the absolute most bang for your buck, is to emit a single arbitrarily wide event per request per service hop. At Honeycomb, the maturely instrumented datasets that we see are often 200-500 dimensions wide.  Here’s an event that’s just 20 dimensions wide:

{ 

   "timestamp":"2018-11-20 19:11:56.910",
   "az":"us-west-1",
   "build_id":"3150",
   "customer_id":"2310",
   "durationMs":167,
   "endpoint":"/api/v2/search",
   "endpoint_shape":"/api/v2/search",
   "fraud_dur":131,
   "hostname":"app14",
   "id":"f46691dfeda9ede4",
   "mysql_dur":"",
   "name":"/api/v2/search",
   "parent_id":"",
   "platform":"android",
   "query":"",
   "serviceName":"api",
   "status_code":200,
   "traceId":"f46691dfeda9ede4",
   "user_id":"344310",
   "error_rate":0,
   "is_root":"true"
}

So a well-instrumented service should have hundreds of these dimensions, all bundled around the context of each request. And yet — and here’s why events blow the pants off of metrics — even with hundreds of dimensions, it’s still just one write. Adding more dimensions to your event is effectively free, it’s still one write plus a few more bits.

Compare this to a metric-based systems, where you are often in the position of trying to predict whether a metric will be valuable enough to justify the extra write, because every single metric or tag you add contributes linearly to write amplification. Ever gotten billed tens of thousands of dollars for your custom metrics, or had to prune your list of useful custom metrics down to something affordable? (“BUT THOSE ARE THE ONLY USEFUL ONES!”, as every ops team wails)

Events … must pass along the blob of context as the request executes

As you can imagine, it can be a pain in the ass to keep passing this blob of information along the life of the request as it hits many services and databases. So at Honeycomb we do all the annoying parts for you with our integrations. You just install the go pkg or ruby gem or whatever, and under the hood we:

  1. initialize an empty debug event when the request enters that service
  2. prepopulate the empty debug event with any and all interesting information that we already know or can guess.  language type, version, environment, etc.
  3. create a framework so you can just stuff any other details in there as easily as if you were printing it out to stdout
  4. pass the event along and maintain its state until you are ready to error or exit
  5. write the extremely wide event out to honeycomb

Easy!

(Check out this killer talk from @lyddonb on … well everything you need to know about life, love and distributed systems is in here, but around the 12:00 mark he describes why this approach is mandatory. WATCH IT. https://www.youtube.com/watch?v=xy3w2hGijhE&feature=youtu.be)

Events … should collect context like sticky buns collect dust

Other stuff you’ll want to track in these structured blobs includes:

1225287_1370081029072_full

  1. Metadata like src, dst headers
  2. The timing stats and contents of every network call (our beelines wrap all outgoing http calls and db queries automatically)
  3. Every raw db query, normalized query family, execution time etc
  4. Infra details like AZ, instance type, provider
  5. Language/environment context like $lang version, build flags, $ENV variables
  6. Any and all unique identifying bits you can get your grubby little paws on — request ID, shopping cart ID, user ID, request ID, transaction ID, any other ID … these are always the highest value data for debugging.
  7. Any other useful application context.  Service name, build id, ordering info, error rates, cache hit rate, counters, whatever.
  8. Possibly the system resource state at this point in time.  e.g. values from /proc/net/ipv4 stats

Capture all of it. Anything that ever occurs to you (“this MIGHT be handy someday”) — don’t even hesitate, just throw it on the pile. Collect it up in one rich fat structured blob.

Events … must be unique, ordered, and traceable

You need a unique request ID, and you need to propagate it through your stack in some way that preserves sequence. Once you have that, traces are just a beautiful visualization layer on top of your shiny event data.

Events … must be stored raw.

Because observability means you need to be able to ask any arbitrary new question of Rainbow-Dash-is-not-amused-my-little-pony-friendship-is-magic-31088082-900-622your system without shipping new code, and aggregation is a one-way trip. Once you have aggregated your data and discarded the raw requests, you have destroyed your ability to ask new questions of that data forever. For Ever.

Aggregation is a one-way trip.  You can always, always derive your pretty metrics and dashboards and aggregates from structured events, and you can never go in reverse. Same for traces, same for logs. The structured event is the gold standard. Invest in it now, save your ass in the future.

It’s only observability if you can ask new questions. And that means storing raw events.

Events…are richer than metrics

There’s always tradeoffs when it comes to data. Metrics choose to sacrifice context and connective tissue, and sometimes high cardinality support, which you need to correlate anomalies or track down outliers. They have a very small, efficient data format, but they sacrifice everything else by discarding all but the counter, gauge, etc.

A metric looks like this, by the way.

{ metric: "db.query.time", value: 0.502, tags: Array(), type: set }

That’s it. It’s just a name, a number and maybe some tags. You can’t dig into the event and see what else was happening when that query was strangely slow. You can never get that information back after discarding it at write time.

But because they’re so cheap, you can keep every metric for every request! Maybe. (Sometimes.) More often, what happens is they aggregate at write time. So you never actually get a value written out for an individual event, it smushes everything together that happens in the 1 second interval and calculates some aggregate values to write out. And that’s all you can ever get back to.

With events, and their relative explosion of richness, we sacrifice our ability to store every single observability event about every request. At FB, every request generated hundreds of observability events as it made its way through the stack. Nobody, NOBODY is going to pay for an o11y stack that is hundreds of times as large as production. The solution to that problem is sampling.

Events…should be sampled.rainbow_dash___no_by_cptofthefriendship-d4erd69

But not dumb, blunt sampling at server side. Control it on the client side.

Then sample heavily for events that are known to be common and useless, but keep the events that have interesting signal. For example: health checks that return 200 OK usually represent a significant chunk of your traffic and are basically useless, while 500s are almost always interesting. So are all requests to /login or /payment endpoints, so keep all of them. For database traffic: SELECTs for health checks are useless, DELETEs and all other mutations are rare but you should keep all of them. Etc.

You don’t need to treat your observability metadata with the same care as you treat your billing data. That’s just dumb.

… To be continued.

I hope it’s now blazingly obvious why observability requires — REQUIRES — that you have access to raw structured events with no pre-aggregation or write-time rollups. Metrics don’t count. Just traces don’t count. Unstructured logs sure as fuck don’t count.

Structured, arbitrarily wide events, with dynamic sampling of the boring parts to control costs. There is no substitute.

For more about the technical requirements for observability, read this, this, or this.

IMG_4619
**The deep fine print: it’s one observability event per request per service hop … because we gather observability detail organized by request id.  Databases may be different.  For example, with MongoDB or MySQL, we can’t instrument them to talk to honeycomb directly, so we gather information about its internal perspective by 1) tailing the slow query log (and turning it up to log all queries if perf allows), 2) streaming tcp over the wire and reconstructing transactions, 3) connecting to the mysql port as root every couple seconds from cron, then dumping all mysql stats and streaming them in to honeycomb as an event.  SO.  Database traffic is not organized around connection length or unique request id, it is organized around transaction id or query id.  Therefore it generates one observability event per query or transaction. 
In other words: if your request hit the edge, API, four internal services, two databases … but ran 1 query on one db and 10 queries on the second db … you would generate a total of *19 observability events* for this request.
For more on observability for databases and other black boxes, try this blog post.
Live Your Best Life With Structured Events

Giving Good Feedback: Consider the Ratio

Consider the ratio.

You work with someone great. If someone asked, you’d say they are brilliant, inspired and dedicated. They care deeply about their work, they are timely and reliable (for the most part), and their emojis and dry sense of humor brighten your day. Your work depends on theirs, and you are working together on a neat project which is generating lots of excitement at demo days. You would miss them terribly if they left.

But today you are annoyed. They either didn’t hear or forgot your feedback from the last design review, which means you have to redo some components you thought were finished. It’s a considerable amount of work, and this isn’t the first (or second) time, either. You want to tell them so and try to debug this so it doesn’t keep happening.

So far, so good. Giving feedback like this can be hard, especially if they are senior to you. But do they understand the totality of how you see them? Or was the last time they heard from you the last time they fucked up? Out of the last ten times you gave them feedback, how many were complaining or asking for changes? Does that feedback ratio accurately represent your perception of their value?

This doesn’t mean you have to run around saying “you’re amazing!” all the time, but do be mindful of how other people think you perceive them. I can pretty much guarantee that none of the people you love working with realize just how much you value them, but they are acutely aware of all the ways they fall short or fail you. Here are some ways to correct that imbalance a bit

  • Don’t be vague. Do be specific. If you just run around saying “You’re awesome!” to people, they will tune you out. Do try to notice and reflect some of the things that making working with them a joy. Like, “I learned so much about mysql indexes pairing with you today, thank you”, or “Last week in our practice session you suggested approaching it this way, and it was so helpful in my situation”, or “I really admire the way you can talk extemporaneously about $topic, and I LOVE knowing I can rely on you with zero prep”. This is harder, and it absolutely takes more work on your part, but it lands. And sticks.
  • Use the Situation, Behavior, Impact framework…but for praise. The SBI framework is designed for delivering hard feedback, but it works just as well for delivering kudos. Use it to give great praise that isn’t generic and does let people know what they’re doing right/what they mean to you. “In the last team meeting, your overview of the messaging framework was super eye-opening for me. I learned more than ever before about not just our pyramid, but how messaging frameworks in general are used. I understand its impact on my role better now than I have in seven years of product marketing!”
  • Ground critique in your overall reaction. Let’s say someone just presented an idea that you think is super interesting and potentially very high value, but you have questions about its impact on marginalized groups. Do they know you think it is interesting and high value, when you launch into your critique? No they do not. If all they hear is several rounds of criticism, they may very well give up and cancel altogether, thinking everyone hated it. Something as simple as starting with “I LOVE this idea. Have you thought about —”, or “This is really interesting, but I’m curious…” can be enough to convey a less discouraging, more accurate sense of your perspective.
  • Don’t hold out for the “wow” moments. Sometimes even sharing what you see as a neutral description of someone’s work can be mind-blowing and affirming. Most people don’t realize how much they are just noticed, full stop. It is flattering to be noticed or have the things you said remembered. Being seen can be enough. (h/t @eanakashima)
  • Don’t contribute to a pile-on. Feedback is asymmetric — you can only give feedback as one person at a time (you!), but the recipient might be grappling with negative feedback from many, many people. In that context, anything critical you say is likely to feel like one more rock in a public stoning. Or (somewhat less dramatically), if someone asks for feedback and receives a wave of criticism, they may feel deflated and defeated and drop the entire idea. If that isn’t the outcome you want, try to bring some positive balance to the discussion instead of piling on.
  • Give feedback to grow on. Pure positivity can sound cloying and be easy to discount. If you’re just praising me, I’m learning nothing from it. We’re not talking about a shit sandwich here, but the best compliments are the ones you learn something from. “That was GREAT. It might be even better if…” Relatedly, some people find it hard to believe purely positive feedback, but if you give feedback that shows you understand their work and what they did less well, you gain credibility and they will believe the praise. (h/t @inert_wall).

Hard conversations and corrective feedback are absolutely necessary at times. But even poorly-delivered critiques can be dealt with in the context of a good relationship, when the person knows how much you value them, and even the most delicately delivered criticism can be hard to hear from someone when all you ever seem to hear from them is how much you suck.

Engineers can be the worst at this, because we tend to show our interest by eagerly engaging with an idea or piece of work … by picking it apart, and chattering about all the ways it could be better. 🙃 I generally think this is an awesome way to show love, but we could stand to be clearer about the affection part, and not let the perfect be the enemy of the good. So please consider the ratio of critique vs affirmation when giving feedback.

And there’s no reason to save all the nice words and praise and gratitude for someone’s funeral (or when they leave the company ☺️).

Giving Good Feedback: Consider the Ratio

Questionable Advice: Is there a path back from CTO to engineer?

I received this question in the comments section of my piece on The Twin Anxieties of the Engineer/Manager Pendulum, and figured I might as well answer it. It definitely isn’t a question I’ve spent a lot of time thinking about or anything. 🥰

As a former CTO coming off a sabbatical and wanting to go back to engineering, it’s good to hear that this can be done. Having had coding, architecting, and scaling skills before getting pushed to more lead role and looking to get back to working after the sabbatical, what would the roadmap look like to achieve this? Is it still possible having been away for a few years? What would be a good role to target for re-entry: principal/staff engineer? architect? — Mark

Personally? If I were you, I would return to engineering as a regular old software engineer, writing and shipping code every day in the trenches with (this cannot be emphasized enough) a really, really good team.

Your rustiest skill sets are always going to be the most tactical ones — writing software, reviewing code, reproducing bugs, understanding a new production system.

As a former CTO, you have many other skill sets to pull from — management, strategy, architecture, coaching, staffing, fundraising, etc. These skills are valuable. But they don’t degrade the way hands-on development does. You’ll still remember how to write a performance review two (or twenty) years from now, but writing code is like speaking a language: you use it or lose it. And just like with a language, the best way to freshen up is full immersion.

It’s not just about refreshing your technical chops, it’s also about re-acclimating yourself to the rhythms of hours, days, and weeks spent in focus mode, building and creating.

Think back to the time you first moved from engineering into a management role. Do you remember how exhausting and intrusive it was at first, having meeting after meeting after meeting on your calendar? You had to context switch twenty times a day — you were context switching constantly. You had to walk into room after room after room full of people and their words and emotions. By the end of the day you would be drained dry (and the days felt so long).

As an engineer, you spent your days in stretches of deep focus and concentration, punctuated by the occasional meal, meeting or interruption. But as a manager, your life is nothing but interruptions (and time boxes, and time-boxed interruptions). It took time to for you adjust to manager life and learn where to seek out new dopamine hits. And it’s going to take time for you to adjust back.

How much time? About six months, at least for me. I don’t think it’s being overly dramatic to say that you have to allow enough time to become a different version of yourself. You can’t just change personas and entire ways of being like you change your clothing. The process is more like…a snake shedding its skin, or a caterpillar turning into a butterfly. Don’t rush the process.

And don’t just pick up where you left off as an engineer. This is a beautiful opportunity for you to enjoy the terrible frustration of beginner eyes. ☺️ Learn a new language, learn a new framework, learn a new way of packaging and deploying your code. Freshen up your toolchain. Try a new editor. Catch up on some new testing or validation ideas that have developed or gone mainstream since you were last in the coal mines. (Take modern observability for a test drive? 😉)

Shit moves fast. A lot will have changed.

If you try to pick up where you left off as an expert, it’s really going to suck as you stumble through the motions that used to feel effortless and automatic. But if you start with something new, the friction of learning will feel ordinary and predictable instead. And pretty soon you’ll feel the engine start to kick in: the accelerated learning curve you’ll remember from learning a new technical skill for the 50,000th time.

Because it’s not just about refreshing your technical skills and your daily cadence, either… it’s also about reconnecting with your curiosity, and your attachment to (and love for) technology.

And you better fucking love it, if you plan to inflict the world of agonies that is software development on yourself day after day. 😭 So you have to reconnect with that dopamine drip you get from learning things, fixing shit, and solving problems. And you have to reconnect with the emotional intensity of shipping code that will impact people’s lives — for better or for worse — and of being personally responsible for that code in production. Of knowing viscerally what it’s like to ship a diff that brings production down, or wakes up your coworker in the middle of the night, or corrupts user data.

So yes. It is absolutely possible to return to engineering after a few years away. And yeah, you could come back as a principal or staff engineer. Someone will definitely hire you. However, I suggest otherwise. I suggest you come back as a senior engineer, writing software every day, for a good 6-9 months.

Your grounding in the technical challenges and solution space will be much deeper and richer if you come back hands on than if you came in at a higher level, detached from the rhythms of daily development. Working closer to production and closer to users will give you the chance to rebuild the intense empathy and connectedness to your work and team that tends to seep away the higher you go up the food chain. You’ll be more confident in yourself as a technologist if you honor your need to relearn and rebuild. And you will earn much more respect from your fellow engineers this way. Engineers respect people who respect what they do.

It’s better than jumping straight into the role of a staff+ engineer and trying to refresh your tactical/technical skills on the side. And you’ll be an infinitely more effective staff+ engineer once you’ve done the refreshing.

But if it feels like a demotion, or it’s too hard to swallow, or if the politics of promotions at this company make you leery: compromise by getting yourself hired as a staff or principal engineer, while being clear with your hiring manager that you plan to spend the first 6+ months slinging diffs. They should be fine with it (delighted, really) since a) ANY staff+ hire is an investment for the long run, b) this is a great way to onboard any staff+ engineer, and c) I don’t believe anybody can actually have staff+ level impact during their first 6-12 months at a company, since so much of the job has to do with people, process, technical context, systems history, etc which accrues over time.

Have fun, and do report back! Tell us how it goes!

charity.

P.S.: You don’t say how long it’s been, but I’m operating under the assumption that it’s been 5-10 years since you last worked as an engineer.

P.P.S.: 🚨unsolicited opinion alert🚨 I would personally avoid any role that includes “Architect” in its title (except solutions architects). To me, “software architect” rings of “someone who can no longer write code or perform as a software engineer, who has probably been at the same company for so long that their skills and knowledge now consist entirely of minutiae about that particular company’s technology. likely to be useless and/or helpless at any other company.” I say this with all due apologies to my architect friends, every one of whom is technically dazzling, operationally up-to-date, and has beautiful hair.💆 🥰

 

 

Questionable Advice: Is there a path back from CTO to engineer?

Why On-Call Pain Is A Sociotechnical Problem

Cross-posted from leaddev.com

Most people hate being on call, because most on-call rotations are terrible.

Pager bombs, flappy alerts, false positives going off night and day, sleepless nights… Who can blame them? Small wonder that so many people develop a Pavlovian response to the sound of their Pagerduty ringtone. Alert goes off; adrenaline soars.

Conventional wisdom tells us that being on call means you put your whole life on hold, then spend all week lurching between firefighting and false alarms as you get progressively more sleep-deprived. It sucks, but that’s just what you get when you own your code in production. Right?

Noooooo. Wrong wrongy wrong wrong. Being on call should not be a constant cycle of things breaking down and firefighting, or alerts going off at all hours. This is not ‘normal.’ These are telltale signs of a fragile system and lack of alert discipline.

If on-call pain is a constant source of pain at your organization, that is a PROBLEM. It’s a five-alarm fire. You should drop what you’re doing and fix it with urgency.

An eternally miserable on-call rotation is a violation of the pact we make to support these systems:

  1. It is engineering’s job to own their code in production.
  2. It is management’s job to make sure it doesn’t suck.

This is a two-way handshake. If management isn’t holding up their end, if they don’t allocate enough time to fix the underlying problems – if they run a feature factory that never stops to refactor or invest in reliability work – then on-call will never get better, and you should leave.

On-call rotations are sociotechnical systems

On-call rotations are a classic example of a sociotechnical problem. A sociotechnical system consists of three elements: in this case that’s your production system, the people who operate it, and the tools they use to enact change on it.

You cannot solve sociotechnical problems with purely people solutions or with purely technical solutions. You need to use both.

The technical problems are usually easier to diagnose. You need to automate failovers, instrument your code, build and test repairing code, audit your indexes, etc. The social problems can be trickier to spot, but here’s a tip: they usually manifest as organizational problems.

Some engineers spend their entire career actively avoiding roles where they would have to be on call. Other engineers cling to the safety buffer of ops teams on call for their code, so that only manual escalations reach them.

Responsibility for your code is increasingly non-optional

This is becoming a harder line to hold, as the consensus has shifted decisively towards engineers owning their own code in production. Our systems are becoming exponentially more complex, and feedback loops are tightening. The people best equipped to own software in production are the people who built it. And in order to own it effectively, they need to close the loop by receiving the signal when something breaks.

But the point is not to invite software engineers into the same circle of hell that ops engineers have traditionally inhabited. This isn’t an act of vengeance. The point is that tightening these feedback loops is how we make systems better. Being on call shouldn’t have to destroy your social life or your sleep schedule.

Yes, engineering owns their software. But ensuring that engineering’s time is respected and their rest time valued is on management. It’s management’s job to make sure time is allocated to fixing recurring or known issues – and that they don’t kick the proverbial can down the road to later turn into tech debt. If reliability or productivity is suffering, managers need to reassign engineering cycles away from feature work. Managers’ performance should be evaluated by the four DORA metrics, as well as a fifth; how often is their team alerted outside of working hours?

It’s reasonable to be woken up two to three times a year when on call. But more than that is not okay. It’s management’s responsibility to ensure enough resources are dedicated to maintaining system stability, and they should be held accountable – not the on-call engineers.

Humans doing human things

We all have lives outside of work – families, doctor appointments, dentist visits, and so on. Instead of being surprised when things come up, we can predict the ways people’s lives will conflict with on-call duty and come up with ways to ease the burden.

  • Kids. I would never ask a new parent to be on call. Being woken up by ONE instrument of chaos is all anyone should ever have to cope with at any given time.
  • Sleepy brain. People are never going to be at their best when they are woken up in the middle of the night. We should make sure alert text, documentation, and steps are all clear, simple, and otherwise tuned for 2 a.m. brain fog.
  • Getting sleep. Sometimes people struggle getting back to sleep, or they were up all night dealing with something. Establish that 1) no one is EVER to be on call two nights in a row after a bad night, and 2) they are entitled to sleep in, come in late, leave early – whatever works best to help them catch up on their sleep.
  • Anxiety. I’ve managed people before who had high anxiety about being on call. They were perfectly willing, but it didn’t matter how quiet the pager was – their anxiety knowing it was on made it impossible to sleep. We tried it for a while, and it wasn’t getting better, so we ultimately found other ways for them to pull their weight.

If someone is absolutely unable to participate in on-call rotations, well, it happens. If it’s a temporary situation, you might want to let it go. But if it’s a permanent thing, like in the ‘anxiety’ example above, the team should address this by finding other ways for that person to do their share of maintenance work.

For example, they could be in charge of failed builds or maintain the dev environment. What matters is that 1) the team as a whole feels like it’s a fair distribution of labor, and 2) there are enough people left in the on-call rotation that no one is overly burdened.

Technical stumbling blocks

  • Un-owned code. Everything you support, and every alert that can fire, should have a team that owns it.
  • Conversely, you may have architectural issues that make it impossible to isolate and alert only the owners. If you have ten different on-call rotations for various areas of the code base, but any time the database gets slow all ten of you get paged, this is a bad situation.
  • SLOs. As you scale up, there will come a point where you can no longer alert on individual services or symptoms. They will simply drown you. At this point, you need to migrate your alerts over to Service Level Objectives. SLOs align your engineering pain directly with user pain.
  • Paging too early. Ah, this always sounds like such a great idea. ‘Wouldn’t it be great if we could catch it and alert someone before the users are impacted?’ But it’s not. It’s a recipe for flappy alerts and aggravation. Alert when users are impacted, not before.
  • Two lanes. You need two types of alerts: ‘WAKE ME UP’ and ‘Deal with this later.’ No more, no less. Keep the list of ‘wake me up’ alerts as short, crisp, and carefully curated. Put everything that needs to be dealt with ‘soon’ in the second lane, and have your on-call engineer sweep through it at the start of the day and the end of the day. If it doesn’t need to be acted upon in the next day, it probably shouldn’t be an alert.

On-call problems are often organizational problems

Sometimes people don’t want to be on call, and it’s not due to life events. This is a bit trickier to address because they are actually the result of organizational problems that present themselves as on-call problems. For example:

  • Tribal knowledge, or the ‘bus factor.’ You’re the debugger of last resort because you’ve been responsible for a mission-critical component of the system from the very beginning. The team tried training new people, but you still get called every time something goes wrong, and it’s not clear if the issue would be fixed if you weren’t available (or how long it would take them if they did).
  • Individual ownership vs. team ownership. Software is owned by teams, not by individuals. In an ideal world, this means everyone on the team is capable of debugging and maintaining all the systems they collectively own. In the real world, this means everything is at least understood by more than one engineer.
  • Too little – or too much – coverage. If you have three to four people on call, that’s too much of your life spent lugging around a laptop. Tossing all 20-30 engineers into a single rotation is also the wrong way to go; engineers won’t be on call often enough to stay familiar with the systems. The ideal on-call rotation has seven to eight people; five people is a bare minimum. With eight people, you are on call for a highly sustainable one week out of every two months.
  • Lower the barriers to asking for help, swapping times, covering for each other, etc. When someone asks for help with their on-call shift, thank them for asking. If the on-call shift isn’t that arduous, it’s really no big deal to back someone up for the duration of a movie.
  • Appointing primary/secondary on-call engineers can be really helpful here. Only the primary needs to get alerted and lug their laptop around, but they have a designated point person to tag if they need to run to the grocery store, drive through the boonies, or otherwise go offline for a while.
  • Put managers on call. I’m not generally a fan of putting managers in the rotation, but they really are the ideal backup situation. Especially when it comes to picking up the pager the day after someone has had a rough night. This serves multiple purposes: it helps keep the manager fresh, it exposes them to the reality of what on-call is currently like, and their time doesn’t have to be swapped for someone else’s.

The next time someone doesn’t want to be on call, it may be time to take a closer look at your organization as a whole to see whether the problem really is resource allocation, risk mitigation, or something else.

Making on-call costs tangible

On the topic of paying people more to be on call: there are loads of opinions here – it’s a very fraught topic. I generally come down on the side of ‘no, it’s part of the job,’ just like it is for doctors. With one big exception.

If you’re having a hard time getting upper management to understand the value of spending engineering cycles on the infrastructure and reliability work that needs to be done, instead of just cranking features… by all means, pay people for being on call.

Pay them for every event they have to respond to.

Pay them well.

Pay them so goddamn well the finance team starts squawking about the need to pay down that reliability debt.

If that’s the only way you can make it real for them, well, use the tools you’ve got. Engineers should never have to quietly suffer the pain of flaky software and unhappy users alone. Give management pain too until they take their jobs seriously enough to see that reliability issues get fixed.

Why On-Call Pain Is A Sociotechnical Problem

Advice for Engineering Managers Who Want to Climb the Ladder

We have been interviewing and hiring a pile of engineering directors at Honeycomb lately. In so doing, I’ve had some fascinating conversations with engineering managers who have been trying unsuccessfully to make the leap to director.

Here is a roundup of some of the ideas and advice I shared with them, and the original twitter thread that spawned this post.

What is an engineering director?

Given all the title inflation and general inconsistency out there, it seems worth describing what I have in mind when I say or hear “Engineering Director.”

In a traditional org chart, an engineering manager usually manages about 5-8 engineers, an engineering director manages 2-5 engineering managers, and a VP of engineering manages the directors. (At big companies, you may see managers and directors reporting to other managers and directors, and/or you may find a bunch of ‘title padding’ roles like Senior Manager, Senior Director etc.)

In smaller companies, it’s common to have a “Head of Engineering” (this is an appropriately weaselly title that commands just the right amount of respect while leaving plenty of space to hire additional people below or above them). Or all of engineering might roll up to a director or VP or CTO. It varies a lot.

When it comes to the work a director is expected to do, though, there’s a fair bit of consistency: we expect managers to manage ICs, and directors to manage managers.  Directors sit between the line managers and the strategic leadership roles. (More on this later.)

So if you’re an engineering manager, and you want to try being a director, the first thing you’ll want to understand is this: it is generally better to get there by being promoted than by getting offered a director title at a different company.

How to level up

Lots of engineers get tapped by their management to become managers, but not many become directors without a conversation and some intentional growth first. This means that for many of you, trying to become a director may be the first time you have ever consciously solicited a role outside the interview process. This can bring up feelings of awkwardness, even shamefulness or inappropriateness. You’ll just have to push through those.

If you ever want a job in upper leadership, you are going to have to learn how to shamelessly state your career goals. We want people in senior leadership who want to be there and are honing their skills in anticipation of an opportunity. Not “oops, I accidentally a VP.”

It is better to get promoted than hired up a level

There are a few reasons for this. It’s usually easier to get promoted than to get hired straight into a job you’ve never held before (at a org with high standards), and it also tends to be more sustainable/more likely to succeed if you get promoted in as well. Being a director is NOT just being a super-duper manager, it’s a different role and function entirely.

A lot of your ability to be successful as a director (or any kind of manager) comes from knowing the landscape, the product and the people, and having good relationships internally. When you are internally promoted, you already know the company and the people, so you get a leg up towards being successful. Whereas if you’ve just joined the company and are trying to learn the tech, the people, the relationships, and how the company works all at once, on top of trying to perform a new role for the first time.. well, that is a lot to take on at once.

There are exceptions, sure! Oodles of them[1]. But I would frankly look sideways at a place that wanted to hire me as a director if I haven’t been one, or hadn’t at least managed managers before. It’s at least a yellow flag. It tells me they are probably either a) very desperate or b) very sloppy with handing out titles.

If you want a promotion…

The obvious first step involves asking your manager, “what is the skill gap for me between the job I am doing right now and a director role?” Unlike in the movies, promotions don’t usually get surprise-dropped on people’s heads; people are usually cultivated for them. Registering your interest makes it more likely they will consider you, or help you develop skills in that direction as time moves on.

If you have a good manager who believes in you, and the opportunities exist at your company, that might even be all you have to do.(!)

And if so, lucky you. But as for the remaining ~80-90% of us (ha!) … well, we’ll need a bit more hustle.

Take inventory of your opportunities

Lots of companies aren’t large enough to need directors, or growing fast enough to create new opportunities. This can actually be the most challenging part of the equation, because there are generally a lot more managers who want to be directors than there are available openings.

If you do need to find a new job to reach your career goals, I would target fast-growing companies with at least 100 engineers. If you’re evaluating prospective employers based on your chance of advancement, consider the following::

  • Ask about their policies on internal vs external hires. Do they give preference to existing employees? How do they decide when to recruit vs grow from within?
  • Ask about the last time that someone was promoted into a similar role.
  • Tell the recruiter and hiring manager about your career goals. Don’t be shy. “My next career goal is to gain some experience managing managers” is fine. (That shouldn’t be the only reason you’re interested, of course.)
  • Size up the playing field. Is there oxygen at that level? Or are there a dozen other managers senior to you lined up for the same shot?

There are no sure bets. But you can do a lot to put yourself in the right place at the right time, signal your interest, and be prepared for the opportunity when it strikes.

a director is not a ‘super-senior manager’

A director is not just a manager on steroids: it is an entirely different job. It helps to have been a good manager before becoming a director, because many management skills will translate, but others will be entirely new to you. Expect this.

How being a good director is different from being a good manager

Let’s look at some of the ways that being a good engineering manager is different than being a good director.

  1. You can be a great EM, beloved by your team, without giving much thought to managing out or up. Directors cannot. If anything, it’s the opposite. You may get away with not coddling your EMs, but you must pull your weight for your peers and upper management.
  2. You can have a bit of a reputation for being stubborn or difficult as an EM, and that can be just fine. But having such a rep will probably sabotage your attempt at being promoted to director.
  3. You can be a powerful technical EM who sometimes jumps in to train engineers, be on call, or course correct technical and architectural decisions. This can even burnish your value and reputation as an EM. But this would all be a solid knock against you as a director.

Managers can get away with being opinionated and attached to technology, to some extent, while directors absolutely must balance lots of different stakeholders to achieve healthy business outcomes.

This difference of perspective is why managers will sometimes sniff about directors having sold out, or being “all about politics.”

(Blaming something on “politics” is usually a way of accidentally confessing that you don’t actually understand the constraints someone is operating under, IMO.)

A director’s job is running the business

Here’s the key fact: ✨directors run the business✨.

Managers should be focused on high-performing engineering teams. VPs should be focused on strategy and the longer term. Directors are the execution machines that knit technology with business objectives. (I like this piece, although the lede is a little buried. Key graf:)

managers, directors, VPs

Directors run the business. They are accountable for results. You can’t be bopping in and writing or reviewing code, or tossing off technical opinions. That’s not your job anymore.

Managing managers is a whole new skill set

The distance between managing engineers and managing managers is nearly as vast as the gulf between being an engineer and being a manager.

But it’s sneakier, because you don’t feel out of your depth as much as you did when you became a manager. 😁

As a manager, each of us instinctively draws on our own unique blend of strength and charisma — whatever it is that makes people look up to you and willing to accept your influence. Most of us can’t explain how we do it, because we run on instinct.

But as a director, you have to figure it out. Because you need to be able to debug it when the magic breaks down. You need to help your managers influence and lead using *their* unique strengths. What works for you won’t work for them. You have to learn how to unpack different leadership styles and support them in the way they need.

If you’re working towards a director role:

There are lots of areas where you can improve your director skills and increase your chances of being viewed as director material without any help whatsoever from your manager.

You ✨can not✨ be a blocker

Directors run the business … so you CANNOT be seen as a blocker. People must come to you of their own accord to get shit done and break through the blockers.

If they are going to other people for advice on how to break through YOU, you are not a good candidate for director. Figure out how to fix this before you do anything else.

Demonstrate impact beyond your team(s)

Another way to make yourself an attractive prospect for director is to work on systemic problems, driving impact at the org or company level. You could:

  • work to substantially increase the diversity of your teams or your candidate pipeline, and offer to work with recruiting and other managers to help them do the same (becoming BFFs with recruiting is often a canny move)
  • drive some cross-platform initiative to consolidate dozens of snowflake deploy processes and significantly reduce CI/CD build/deploy times, set an internal SLO for artifact build times, or successfully champion auto-deployment
  • champion an internal tools team with a mandate to increase developer productivity, and quantify the hell out of it
  • lead a revamp of the new hire onboarding process. Or add training and structure to the interview process and set an SLO of responding to every candidate within one week

I dunno — it all depends on what’s broken at your company. 🙃 Identify something causing widespread pain and frustration at the organizational level and fix it. 

Managing ‘up’ is not a ‘nice-to-have’…

If there’s a problem, make sure you are the one to bring it to your manager (and swiftly), along with “Here is the context, here’s where I went wrong, and this is what I’m planning to do about it.” No surprises.

At this point in your career, you should have mastered the art of not being a giant pain in the ass to your manager. Nobody wants a high-maintenance director. Do you reliably make problems go away, or do they boomerang back five times worse after you “fix” them?

…Neither is managing ‘out’

Managing “out” is important too. (Not “managing out”, which means terminating people from the company, but managing “out” as in horizontally, meaning your relationship with your peers.)

What do your peers think of you? Do you invest in those relationships? Do they see you as an ally and a source of wise counsel, or a source of chaos, gossip and instability, or a competitor with turf to protect? If you’re the manager that other managers seek out for a peer check, you might be a good candidate for director.

psst.. People are watching you

One of the most uncomfortable things to internalize if you climb the ladder is how much people will make snap judgments about you based on the tiniest fragments of information about you, and how those judgments may forever color the way they think of you or interact with you.

First impressions might be made by ten minutes together on the same zoom call…a few overheard fragments of people talking about you…even the expressions on your face as they pass you in the hallway. People will extrapolate a lot from a very little, and changing their impression of you later is hard work.

(Yes it’s frustrating, but you can’t really get upset about it, because you and I do it too. It’s part of being human. )

Because of that, you really do have to guard against being too cranky, too tired, or out of spoons. People WILL take it personally. It WILL come back to hurt you.

Remember, you don’t hear most feedback. If you visibly disagree with someone, assume 10x as many silently agree with them. If one person gives you a piece of hard feedback, assume 10x as many will never tell you. Be grateful. The more power you are perceived to have, the less feedback you will ever hear.

Pro tip

You can infer a surprising amount about how good a director candidate may be at their job, simply by listening closely to how they talk about their colleagues. Do they complain about being misunderstood or mistreated, do they minimize the difficulty or quality of others’ work, do they humblebrag, or do they take full responsibility for outcomes? And does their empathy fully extend to their peers in other departments, like sales and marketing?

Does it sound like they enjoy their work, and look forward to beginning it every day? Does it sound like they are all in the same little tugboat, all pulling in the same direction, or is there a baseline disconnect and lack of trust?

In conclusion…

Be approachable, be a drama dampener, project warmth. Control your calendar and carve out regular focus time. Guard your energy — never run your engine under 30%, and always leave something in the tank.

There are a lot more great responses and advice in the replies to my thread, btw. Go check them out if you’re interested.. and if you have something to say, contribute!.☺️

charity

Footnotes:

[1] Occasionally, it may work out to your benefit to jump into a new, higher title at a new company. This can happen when someone is already well qualified for the higher role, but is finding it difficult to get promoted (possibly due to insufficient opportunity or systemic biases). Just be aware that the job you were hired into is likely to be one where the titles are meaningless and/or the roles are chaotic. You may want to stay just long enough to get the title, then bounce to a healthier org.

Advice for Engineering Managers Who Want to Climb the Ladder

The Truth About “MEH-TRICS”

First published on 2022-04-13 at https://www.honeycomb.io/blog/truth-about-meh-trics-metrics

A long time ago, in a galaxy far, far away, I said a lot of inflammatory things about metrics.

“Metrics are shit salad.”

“Metrics are simply nerfed dimensions.”

“Metrics suck,” “metrics are legacy,” “metrics and time series aggregates will fucking kneecap you.”

I cannot tell a lie; Twitter will testify that I’ve spent the past six years ragging on metrics. So much so that ever since we launched Honeycomb Metrics last year, our poor solution architects have been encountering skeptics in the field who repeat my quotes back to them and ask, dubiously, whether Honeycomb Metrics are any good or not, and whether we genuinely plan on investing in it or not, given our known anti-metrics sympathies.

That’s a great question. 😊

Metrics aren’t worthless; they’re just limited.

Metrics are a mature technology that’s been around for over 30 years, and they have some real advantages. They’re tiny, fast, and cheap; you can hold a bunch of them in memory as counters, summaries, and gauges. They aggregate well and take up a fixed amount of storage space. The entire monitoring industry is built on top of metrics.

When it comes to workloads like, “How heavy is the write load on my hard drive?” or “What is the temperature or fan status inside my chassis?” or “What is the traffic rate in and out of this interface on my switch?”  metrics are what you should use. In fact, pretty much any time you want to know the health of a system or component in toto, metrics are the right tool.

Because that’s what metrics do best—report statistics in aggregate, from the perspective of any system or component. They can tell you that your Ruby HTTP worker pool is 70% utilized or that your nginx webserver is returning 502s 1% of the time. What they can’t tell you is what this means for any one of your users, applications, delivery vehicles, and so forth.

Until recently, metrics-based tools or logs were the only game in town. People were trying to sell us metrics tools for observability use cases, and that’s what got my goat so badly. If you simply append “… for observability” to each of my inflammatory statements, then I stand by them completely.

“Metrics are shit salad … for observability.

Yup, rings true.

You’re never going to make a metrics tool like Prometheus or Datadog into an observability tool. You’re just not. Observability is about unknown-unknowns, while metrics are a tool for known-unknowns.

If you need a refresher on the differences between observability and monitoring, I’ll refer you to pieces like thisthis, and this. What I want to talk about here is slightly different. In a post-observability world, what is the true and proper place for metrics tooling?

Metrics and observability have different use cases.

Metrics aren’t completely useless, even if you have a robust observability presence. We still use metrics at Honeycomb to this day for certain workloads—and always will because they’re the right tool for the job.

There are two kinds of workloads, roughly speaking: your code—the code you write, review, ship, debug and maintain on a daily basis. And other people’s code—the code you have to run and use in order to support your code. Some examples of the latter might be: Linux, Docker, MySql, Amazon RDS, Kafka, AWS Lambda, GCP gateways, memcache, CI/CD pipelines, Kubernetes, etc.

Your code is your crown jewels, the code you need to survive and succeed as a business. It changes constantly—many times per week, if not per day. You are expected to understand its inner workings intimately, and spend lots of time chasing down bugs or understanding and reproducing behavior. You care about the way it performs and interacts with each and every individual user, with changing infrastructure state, and under a variety of different load conditions.

That is why your code demands observability. In order to understand your software, you must first instrument it, in a way that collects lots of rich context and bundles it up around each event end-to-end. Then you need to stream those events into a tool that lets you slice and dice and trace and explore with support for high-cardinality and high-dimensionality data. That’s the only way you’re going to be able to correlate errors, track down outliers, and reflect each user’s experience.

But what about the rest of the software? You can’t instrument Amazon RDS, and only crazy people would instrument, rebuild, and repackage things like Kafka or Docker or nginx. The whole point of third-party software is that you DON’T USE IT until it’s stable enough to be taken more or less for granted. Sure, you roll updates, but usually on the order of months or years—not every day. You don’t need to be intimately familiar with its inner workings because you aren’t changing it every day. Those aren’t your crown jewels.

You do care about their health though, only differently. You care about whether you need to provision more capacity or not. You care about knowing how hard you’re hammering on the underlying hardware or hypervisor. That’s why metrics and monitoring are the right tools to use for third-party code. They don’t let you peer under the hood in the same way, or slice and dice in the same way, but that’s okay. You shouldn’t have to.

With third-party stuff, you don’t care about the code, you care about the health of the service. In aggregate.

(There are some kinds of in-between software, like databases, where event-level information is super useful for debugging things like slow queries and lock percentages, and you can use various black box techniques to approximate observability without instrumentation. But in general this model holds up quite well.)

In a post-observability world, what are metrics for?

I’ve often pointed out that observability is built on top of arbitrarily wide structured data blobs, and that metrics, logs, and traces can be derived from those blobs while the reverse is not true—you can’t take a bunch of metrics and reformulate a rich event.

And yes, people who have observability typically find themselves using metrics and dashboards less and less. They’re simply not as versatile or useful as events that you can slice and dice and manipulate in infinite ways. And you can derive aggregates and trends from the events you have stored.

But metrics will always be useful for understanding third-party software, from the perspective of the service, cluster, or node. They will always be the right tool for the job when it comes to software interfacing with hardware. And they can be super complementary when you are investigating your code using events and instrumentation.

If you’re an engineer writing and shipping code, you’re never not going to want to know if your change caused memory usage to triple, or CPU utilization to skyrocket, or disk usage or network throughput to saturate. That’s why we built Honeycomb Metrics as an overlay, a way to enhance or validate your understanding of the impact your code changes have had on the underlying system.

Metrics are also valuable as a bridge to the past. People have been instrumenting software for metrics for 30 years—they’re never going away completely, and not everything can or should be reinstrumented with events. Lots of people already have robust monitoring systems that slurp in millions of metrics. Nobody wants to have to redo all that work just because they’re moving to a different tool, so people tend to point their metrics firehose at Honeycomb as a way of getting started as they roll observability out into their code.

The Truth About “MEH-TRICS”